[Swan-dev] pluto: When Child state fails, don't tear down IKE SA

Andrew Cagney andrew.cagney at gmail.com
Mon Jul 5 18:46:39 UTC 2021

On Mon, 5 Jul 2021 at 13:22, Paul Wouters <paul at nohats.ca> wrote:

> On Mon, 5 Jul 2021, Andrew Cagney wrote:
> >       But that is changing the meaning of FAIL vs FATAL.
> >
> >
> > (sarcasm) It's hard to change the meaning of something that was, well,
> meaningless.
> >
> > This is what v3.23 had:
> >
> > It just signals to the state machine to do nothing (as I discovered last
> week, it also signals that the state on the stack state may have been
> deleted)!
> > -> I think it should go; internal errors are either recoverable (just
> kill IKE family), or passert
> This is supposed to core dump. We hit something that never is supposed
> to happen.

The most common case is realising that the send buffer is full.  Probably
not worth a core dump; but yes I'm sure there's other code where it should

I suspect the code deleting the state then returning STF_INTERNAL_ERROR is
something of an outlier (or IKEv1).

> > delete current state; but "be vewy vewy quiet"
> > -> this is gone
> >
> > delete current state
> > (for a child it would delete sending a notify, then see the ike sa was
> childless and delete that sending a second notify; except when there were
> multiple
> > children)
> > -> it now completes message exchange and deletes the IKE family; it is
> after all fatal
> That is wrong. Because a fatal child should not delete the IKE SA or
> other exitsing Child SA states, just because of the children is
> STF_FATAL. Eg this is now executed when two endpoints misconfigure
> their subnets. It makes it impossible for an admin to add a new subnet
> and then tell the other end to add it, since it now tears down the
> house.
Should not, vs cannot.
While I agree it shouldn't happen, it can.

> > do pretty much nothing
> > -> which is useless, the exchange failed so things are going nowhere
> > -> it now completes message exchange and deletes current state
> In IKEv1, I think this can still be valid. In IKEv2, things are more
> clear so it is either OK or FATAL.
I should have been clearer.  This is what the v2.32 *IKEv2* code did.

IKEv1 was already separated (but admitidly, the ikev2 code still looked
like a copy/paste of IKEv1).

> STF_FAIL+v2N_....:
> > in the responder send v2N and then deleted a state (during IKE_SA_INIT
> that was IKE, else it was the child, but see alsofor the initiator this did
> nothing
> > -> this code path is gone
> even in IKEv1?

I don't know what IKEv1 does.

> > it's been pretty consistent; pretend the message never happened
> >
> > So STF_DROP, STF_FATAL and STF_FAIL+v2N could all potentially delete a
> state; and STF_FAIL on its own did nothing.
> well, "nothing" meant it could retransmit the request, eg a Quick Mode
> Request that got denied before. Eg retrying. It works because of the
> unique msgids, you could just keep retransmitting such packets.

Not IKEv2.

Since the response is protected, there's no mistaking the answer: one thumb
up, one thumb down, two thumbs down.  Retransmitting the request won't
change this(1).  There is no middle ground here.

But that didn't stop broken IKEv2 code from trying, and hiding it behind a
release whack.  To spell it out:
- it would ignore the response; but release whack
- retransmit
- timeout and only then take the action it should have taken at the very
start; childless SAs caused this to be fixed

The only way to retry the SA by initiating a new exchange.

(1) Ok, technically a really crappy memory starved responder might, when
given a retransmit, try to re-perform the same operation (oh, wait, we've
got IKEv1 code doing that).

> >       FAIL used to mean "there is a non-fatal error, please keep trying
> this state"
> >       FATAL meant "there is no more hope, kill the state"
> >       Note that this applies to the _states_ and not the connection.
> That is,
> >       starting a new keying attempt after a timeout and/or revival are
> not
> >       related to these state changes.
> >
> >
> > If a state transition fails then it fails.  There is no "please keep
> trying this state".
> That depends on the retransmit logic/arhitecture. Also, in IKEv1 you
> could have Child SA states without having an IKE state. It was a
> different world :)
> > The only remaining question is how much damage should the failure
> inflict?  Delete the state or delete the IKE family (and yes any exchange
> can trigger a
> > delete family, a critical payload, bad payload lengths, ...).
> I still don't agree with this statement. A new CREATE_CHILD_SA that
> fails should not take down existing Child SA's or an IKE SA. Even an
> IKE REKEY failure should not take down the existing IKE SA.

Again.  Should not vs cannot.

> > However, what I have found and fixed is code trying to ignore failure
> (failure is not not an option :-).  For instance, code would ignore
> unexpected
> > notifications in an IKE_AUTH response, and instead let things
> retransmit.  This is futile.  The response was protected, so no number of
> retransmits will
> > change its content.
> Agreed, but IKEv1 might not be that clear.

This is IKEv2.

> >       In that sense, a CREATE_CHILD_SA that received an answer that
> leads to
> >       an unfixable isse, like TS_UNACCEPTABLE, is a FATAL state change
> for
> >       the child sa. FAIL would mean we might try to retransmit, which is
> >       what is wrongly happening here. FAIL would also happen if we got an
> >       unauthenticated reply we didn't like, so we would still wait for a
> >       possible unauthenticated reply we like (to prevent attackers from
> >       locking out legitimate packets).
> >
> > If a packet fails protection, it's STF_IGNOREd.  The exception is with
> IKE_SA_INIT bit it too is free to STF_IGNORE a message.
> >
> >
> >       The way IKEv2 works with serial MSGID, we really don't have
> >       that sense. We only have STF_OK or STF_FATAL. (well also
> >       The only STF_FAIL would be the if we get no reply, and we need to
> >       retransmit. I can't think of an STF_FAIL that would operate on the
> child
> >       SA state.
> >
> >
> > We're somewhat on the same page.  Any message can carry a notification
> that implies that the IKE family is dead.  For instance, INVALID_SYNTAX.
> So the
> > exchange outcome is either Ok, or fatal.
> See above. I don't agree that INVALID_SYNTAX should be interpreted to
> tear down the entire house. Or at least we should be really careful
> with sending INVALID_SYNTAX. I find these two bits of the RFC rather
> conflicting:

The RFCs pretty clear on what to do with INVALID_SYNTAX, and for the code
path we care about - crap payloads - when to send it.

>     If a peer parsing a request notices that it is badly formatted (after
>     it has passed the message authentication code checks and window
>     checks) and it returns an INVALID_SYNTAX notification, then this
>     error notification is considered fatal in both peers, meaning that
>     the IKE SA is deleted without needing an explicit Delete payload.
> vs
>   INVALID_SYNTAX                            7
>         Indicates the IKE message that was received was invalid because
>         some type, length, or value was out of range or because the
>         request was rejected for policy reasons.
> Specifically the "for policy reasons".
> But even so, INVALID_SYNTAX is rare and not the problem we are dealing
> with now, which is things like TS_UNACCEPTABLE. That is the case now
> where we tear down the house and we should not.

... presumably there's a code path that should have returned STF_FAIL.

> > However with a CREATE_CHILD_SA exchange, there's three outcomes:
> > - exchange succeeds; child succeeds
> > - exchange succeeds; child fails (ike survives)
> > - exchange fails, implying that the ike family dies
> > hence the need for three status
> >
> > Taking a step back, and ignoring our current code, which would you
> consider more serious: FAIL or FATAL?
> > Personally, I'd treat something fatal as more serious and assume it's
> wiping things out.
> As you said, I think we are on the same page, but FATAL in a child
> should not cause the IKE SA or other Child SAs to die (maybe the
> exception to this is indeed INVALID_SYNTAX - I don't care too much
> since we don't really see that case happening in real life, and I'm
> mostly concerned now with fixing the mismatching subnet cases that

Again, three possible outcomes requiring three clear status codes:
> - exchange succeeds; child succeeds STF_OK
> - exchange succeeds; child fails (ike survives) STF_FAIL
> - exchange fails, implying that the ike family dies STF_FATAL
we need a way to differentiate between them, and return accordingly.

> > The IKE_AUTH exchange is different.   The message is passed to the IKE
> SA, and then as a nested operation, it tries to create the child.  If the
> attempt
> > to create the child proves fatal, the IKE SA can proceed accordingly.
> That is not the case yet, which is why I added _two_ test cases with
> mismatched subnet. One of those mismatches the first subnet, thus
> causing a failure in IKE_AUTH, which currently also/still tears
> down the IKE SA (indirectly through STF_FATAL, even if your recent
> childless code made the IKE SA survive the initial failure)
> > If CREATE_CHILD_SA followed this model (I call it nested states), then
> we could only need two codes.
> Works for me, but we have to be careful not to break IKEv1 assumptions
> and running code.

None of this code is in the IKEv1 code path.  Hasn't been for years.
However, it isn't trivial.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.libreswan.org/pipermail/swan-dev/attachments/20210705/147edbff/attachment.html>

More information about the Swan-dev mailing list