[Swan-dev] why do we bother checking out_raw() et.al.'s result?

Mon Sep 24 04:43:53 UTC 2018

| From: Andrew Cagney <andrew.cagney at gmail.com>
| 
| On Sat, 22 Sep 2018 at 14:19, D. Hugh Redelmeier <hugh at mimosa.com> wrote:
| >
| > | From: Andrew Cagney <andrew.cagney at gmail.com>
| >
| > | > There are no good exception mechnisms.
| > |
| > | There is, its called abort().  Part of libreswan's start up is dealing
| > | with any mess from a previous abort.
| >
| > Abort kills all operations.  I have no trouble advocating abort/assert
| > when we go into unknown territory.  When an invariant turns out not to
| > be true.
| 
| Such as the buffer overflowed?

No.

The invariant established by out_raw is that the bytes are inserted
into the buffer OR failure is returned.

That's the regime under which the code was written.  Changing that would 
require a significant security audit.

It is possible to support both regimes simultaneously but that does not
seem attractive to me.

| > Packet output problems are not uncontemplated.  They are a kind of
| > resource exhaustion.  (Although the last time I hit one was actually
| > due to a bug in the calling code (that I had just writtten); an
| > assertion failure would have been slightly more convenient to debug.)
| >
| > The packet output problems are currently handled in a way that stops a
| > particular negotiation.  Nothing else is affected.
| 
| I'm really not sure what you mean by 'stops a particular negotiation'
| means.  For instance if an IKEv2 AUTH response comes back with a fail,
| we migh stop the negotiation, re-start the negotiation, or do some
| funky negotation (a notify exchange to delete things).

I neither know deeply nor trust completely the IKEv2 code.

I'm referring to the IKEv1 code from a decade ago.  As far as I know,
no intentional design change has been made.  I do find accidental
changes (often bugs).

| > | > In security code, it is good to stop when things are going wrong.  Not
| > | > get further into the weeds.
| > |
| > | The real enemy of security is unnecessary complexity.
| >
| > Yes.
| 
| > | So how is this:
| > |
| > |         return ikev1_out_generic(np, &isakmp_keyex_desc, outs, &z) &&
| > |             out_zero(g->len, &z, "fake g^x") &&
| > |             (close_output_pbs(&z), TRUE);
| > |
| > | which goes hand-in-hand with code trying to propagate 'fatal' up the
| > | stack is somehow less weedy than:
| > |
| > |        ikev1_out_generic(np, &isakmp_keyex_desc, outs, &z);
| > |        out_zero(g->len, &z, "fake g^x");
| > |        close_output_pbs(&z);
| > |
| > | After all, if it really is fatal, abort().
| >
| > It's not fatal.
| >
| > For example, an adversary might be able to drive us into such an
| > overflow.  The way things are now, only the negotiation he's cast his
| > evil spell upon will fail.  No general DOS.
| 
| right, #2 below

I'm not sure what #2 is.

| > | I made these points:
| > |
| > | - calling an out function after a fail will core dump
| > | hopefully that isn't true - that code path needs to be robust
| > | regardless (I'm sure if I look I can find more code like the above
| > | where a status is ignored)
| >
| > Probably.  Or an error can be lost (not buffer full errors, but
| > others).
| 
| Such as, can you be specific?  If we can't enumerate potential
| failures when emitting a payload then I suspect we've a problem.

Look for any places in the code that can return false.  I'm a bit too
lazy to bother.

An example from when I coded it (and may no longer be the case):
outputting an enum value that is not known.

| Perhaps out_struct() can screw up?  Regardless of this thread, we
| should ensure that the first error is sticky.

The current logic has the effect of making the first error sticky: the 
first error is reacted-to immediately.

The current logic is intended to make the person coding calls to the
packet.h functions aware that errors are possible and responsible for
handling that case.

It's kind of like the difference between && and &.

But yes, reporting the first problem is important.  Sometimes
reporting all problems is even better, but that begs for smarts to
suppress cascades.

(libipsecconf used to save up an error and pass it up, but later
errors overwrote earlier ones, so only the last would be reported.  I
recently hacked a change in so that errors would accumulate.  Of
course much more work remains to be done with libipsecconf error
reporting.)

|  Something that can be
| tested (even if all a re-call does is trigger an abort()?).
| 
| A case where care does need to be taken is in the packet-encrypt code.
| It should probably check that the un-encrypted packet built ok before
| trying to encrypt it.  Since IKEv[12] each only have one encryption
| code paths, checking this should be easy,  Oh! wait ...
| 
| > | - pluto must never knowingly send out a corrupt packet (reflection
| > | attack and all that)
| > | (but I want to :-) so drop the packet as it goes out the door
| >
| > During state transition functions, we modify the internal state of our
| > system (XFRMs, for example).  Quiting an STF as soon as an error is
| > discovered prevents such changes, that either matters or it doesn't.
| > I don't wish to audit the code to determine which.
| 
| I don't want to be the bearer of bad news, but for IKEv2 I _know_ we
| screw this up.

We should not screw this up.  The code should be fixed.  With high
priority.

Great care was taken in certain places in IKEv1 to not half-do things.

|   The STF_* model fails.
|
| For instance, during an IKEv2 AUTH the initiator may find itself
| wanting to: accept the packet, transition the IKE state to
| "established", delete the child (it shouldn't exist), and starting the
| process of cleanly deleting the parent (which means sending a notify).
| STF_* can't describe all that.  Nor do I think it should.

I didn't design or implement IKEv2 code.  All my IKEv2 work has been
fighting fires and low-level code improvement.

I haven't thought about the architecture.  That would include best way
to handle this.  Certainly if the current model fails, it should be
fixed or replaced.  There's no reason to assume that IKEv1 and IKEv2
should be shoehorned into the same model.  But it might work.

It's not a legit packet, its a broken partial packet.  We should not
send it.  We should not save it.  So we certainly should not
retransmit it.

| > | Second, if the SA should be killed, then inject a KILL-SA event rather
| > | than trying to handle it in-line.
| >
| > I don't know what this means.  It doesn't sound conservative.
| 
| See above for an AUTH exchange.  If the initiator rejects auth it
| needs to transition to initiate a delete, most straight forward way of
| doing that is injecting a delete parent event to trigger the process.

The original idea of (timed) events doesn't encompass this kind of
use.  It seems to have been able to handle it.  One feature to be
aware of is that the two actions are not "atomic" but that's often
OK.  It's important to take this into account.

So many Pluto errors are due to programmers not taking into account
atomicity.  Both threading and event-driven styles seem to be hard for
many programmers.

A lot of complexity has been added to the timed event mechanism.  That
may mean that there are lurking errors.  Certainly I've fixed a few.
For example, there used to be exactly one timed event per state object
(as well as a few global timed events, not associated with states).
Each was always active.  Now there can be several timed events per
state object, some of which may not be active.

BTW, I regret calling them "events".  They a a special case of events 
(after all, packet arrival is an event too).  I should have called the 
"timed events" or "scheduled events" or something like that.

================

If we're redoing packet.c stuff, here's something that I would try
changing:

When I designed this stuff, I thought of in_* and out_* as symmetric
and that they should share a lot of properties.  That is probably
overdone.

I'd create different struct packet_byte_streams for input and output.

- certain fields (e.g. for backpatching) are only meaningful for output

- I don't think any PBS is used both for input and output, even though
  that could be done (accidentally or intentionally).  So I think that
  this feature is more dangerous than useful.

- "sticky" error handling (as proposed by Andrew) makes more sense for
  output rather than input.

  + an input value is used by subsequent operations and if it is crap
    subsequent operations are probably broken

  + an output operation that fails may only poison the actual PBS.  It
    might be safe to just keep going until something important depends
    on it.  What's important?  At least making some internal state
    change or actually sending the packet.

================

Some problems are created by underdocumented designs.  For example,
when you implemented NP backpatching, it appears that you didn't know
about the tree-structure-in-time of the PBSes.  Backpatching code that
reflected that is simpler than what you wrote.  Any sticky error
conditions would need to flow toward the root of that tree.