[Swan-dev] pluto: When Child state fails, don't tear down IKE SA

Mon Jul 5 17:01:10 UTC 2021

On Mon, 5 Jul 2021 at 09:26, Paul Wouters <paul at nohats.ca> wrote:

> On Mon, 5 Jul 2021, Andrew Cagney wrote:
>
> >       commit 68fb298d059854253e8267680aeee1ee1f3158a3
> >       Author: Paul Wouters <paul.wouters at aiven.io>
> >       Date:   Sun Jul 4 22:15:51 2021 -0400
> >
> >           pluto: When Child state fails, don't tear down IKE SA
> >
> >           In complete_v2_state_transition() for a child SA state
> STF_FATAL
> >           error, don't call delete_ike_family()
> >
> >
> >
> > A create child sa transaction can finish in one of three ways:
> >
> > - ok ...
> > - fail, the specific sa needs to be deleted but the ike sa remains
> > - fatal, something bad happened the entire family is dead; thing
> INVALID_SYNTAX
> >
> >
> https://github.com/libreswan/libreswan/commit/1f72ba5ce87a34bc3140e2e8fcaf843011f6a959
> > went through and eliminated remaining cases where fail+v2n was returned;
> it sounds like we've still got cases where FATAL is being
> > returned.
> >
> > so this is going in the wrong direction
>
> But that is changing the meaning of FAIL vs FATAL.
>
>
(sarcasm) It's hard to change the meaning of something that was, well,
meaningless.

This is what v3.23 had:

STF_INTERNAL_ERROR:
It just signals to the state machine to do nothing (as I discovered last
week, it also signals that the state on the stack state may have been
deleted)!
-> I think it should go; internal errors are either recoverable (just kill
IKE family), or passert

STF_DROP:
delete current state; but "be vewy vewy quiet"
-> this is gone

STF_FATAL:
delete current state
(for a child it would delete sending a notify, then see the ike sa was
childless and delete that sending a second notify; except when there were
multiple children)
-> it now completes message exchange and deletes the IKE family; it is
after all fatal

STF_FAIL:
do pretty much nothing
-> which is useless, the exchange failed so things are going nowhere
-> it now completes message exchange and deletes current state

STF_FAIL+v2N_....:
in the responder send v2N and then deleted a state (during IKE_SA_INIT that
was IKE, else it was the child, but see alsofor the initiator this did
nothing
-> this code path is gone

STF_IGNORE:
it's been pretty consistent; pretend the message never happened

So STF_DROP, STF_FATAL and STF_FAIL+v2N could all potentially delete a
state; and STF_FAIL on its own did nothing.
However, without looking at the code that returns the status, knowing what
state was deleted was a struggle.

FAIL used to mean "there is a non-fatal error, please keep trying this
> state"
> FATAL meant "there is no more hope, kill the state"
>
>
Note that this applies to the _states_ and not the connection. That is,
> starting a new keying attempt after a timeout and/or revival are not
> related to these state changes.
>

If a state transition fails then it fails.  There is no "please keep trying
this state".
The only remaining question is how much damage should the failure inflict?
Delete the state or delete the IKE family (and yes any exchange can trigger
a delete family, a critical payload, bad payload lengths, ...).

However, what I have found and fixed is code trying to ignore failure
(failure is not not an option :-).  For instance, code would ignore
unexpected notifications in an IKE_AUTH response, and instead let things
retransmit.  This is futile.  The response was protected, so no number of
retransmits will change its content.

> In that sense, a CREATE_CHILD_SA that received an answer that leads to
> an unfixable isse, like TS_UNACCEPTABLE, is a FATAL state change for
> the child sa. FAIL would mean we might try to retransmit, which is
> what is wrongly happening here. FAIL would also happen if we got an
> unauthenticated reply we didn't like, so we would still wait for a
> possible unauthenticated reply we like (to prevent attackers from
> locking out legitimate packets).
>

If a packet fails protection, it's STF_IGNOREd.  The exception is with
IKE_SA_INIT bit it too is free to STF_IGNORE a message.

>
>
> The way IKEv2 works with serial MSGID, we really don't have STF_FAIL in
> that sense. We only have STF_OK or STF_FATAL. (well also STF_SUSPEND etc)
> The only STF_FAIL would be the if we get no reply, and we need to
> retransmit. I can't think of an STF_FAIL that would operate on the child
> SA state.
>

We're somewhat on the same page.  Any message can carry a notification that
implies that the IKE family is dead.  For instance, INVALID_SYNTAX.  So the
exchange outcome is either Ok, or fatal.

However with a CREATE_CHILD_SA exchange, there's three outcomes:
- exchange succeeds; child succeeds
- exchange succeeds; child fails (ike survives)
- exchange fails, implying that the ike family dies
hence the need for three status

Taking a step back, and ignoring our current code, which would you consider
more serious: FAIL or FATAL?
Personally, I'd treat something fatal as more serious and assume it's
wiping things out.

The IKE_AUTH exchange is different.   The message is passed to the IKE SA,
and then as a nested operation, it tries to create the child.  If the
attempt to create the child proves fatal, the IKE SA can proceed
accordingly.

If CREATE_CHILD_SA followed this model (I call it nested states), then we
could only need two codes.

>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.libreswan.org/pipermail/swan-dev/attachments/20210705/94b0ed08/attachment.html>