[Swan-dev] pluto: When Child state fails, don't tear down IKE SA

Mon Jul 5 17:22:36 UTC 2021

On Mon, 5 Jul 2021, Andrew Cagney wrote:

>       But that is changing the meaning of FAIL vs FATAL.
> 
> 
> (sarcasm) It's hard to change the meaning of something that was, well, meaningless.
> 
> This is what v3.23 had:
> 
> STF_INTERNAL_ERROR:
> It just signals to the state machine to do nothing (as I discovered last week, it also signals that the state on the stack state may have been deleted)!
> -> I think it should go; internal errors are either recoverable (just kill IKE family), or passert

This is supposed to core dump. We hit something that never is supposed
to happen.

> STF_DROP:
> delete current state; but "be vewy vewy quiet"
> -> this is gone
> 
> STF_FATAL:
> delete current state
> (for a child it would delete sending a notify, then see the ike sa was childless and delete that sending a second notify; except when there were multiple
> children)
> -> it now completes message exchange and deletes the IKE family; it is after all fatal

That is wrong. Because a fatal child should not delete the IKE SA or
other exitsing Child SA states, just because of the children is
STF_FATAL. Eg this is now executed when two endpoints misconfigure
their subnets. It makes it impossible for an admin to add a new subnet
and then tell the other end to add it, since it now tears down the
house.

> STF_FAIL:
> do pretty much nothing
> -> which is useless, the exchange failed so things are going nowhere
> -> it now completes message exchange and deletes current state

In IKEv1, I think this can still be valid. In IKEv2, things are more
clear so it is either OK or FATAL.

> STF_FAIL+v2N_....:
> in the responder send v2N and then deleted a state (during IKE_SA_INIT that was IKE, else it was the child, but see alsofor the initiator this did nothing
> -> this code path is gone

even in IKEv1?

> STF_IGNORE:
> it's been pretty consistent; pretend the message never happened
> 
> So STF_DROP, STF_FATAL and STF_FAIL+v2N could all potentially delete a state; and STF_FAIL on its own did nothing.

well, "nothing" meant it could retransmit the request, eg a Quick Mode
Request that got denied before. Eg retrying. It works because of the
unique msgids, you could just keep retransmitting such packets.

>       FAIL used to mean "there is a non-fatal error, please keep trying this state"
>       FATAL meant "there is no more hope, kill the state"

>       Note that this applies to the _states_ and not the connection. That is,
>       starting a new keying attempt after a timeout and/or revival are not
>       related to these state changes.
> 
> 
> If a state transition fails then it fails.  There is no "please keep trying this state".

That depends on the retransmit logic/arhitecture. Also, in IKEv1 you
could have Child SA states without having an IKE state. It was a
different world :)

> The only remaining question is how much damage should the failure inflict?  Delete the state or delete the IKE family (and yes any exchange can trigger a
> delete family, a critical payload, bad payload lengths, ...).

I still don't agree with this statement. A new CREATE_CHILD_SA that
fails should not take down existing Child SA's or an IKE SA. Even an
IKE REKEY failure should not take down the existing IKE SA.

> However, what I have found and fixed is code trying to ignore failure (failure is not not an option :-).  For instance, code would ignore unexpected
> notifications in an IKE_AUTH response, and instead let things retransmit.  This is futile.  The response was protected, so no number of retransmits will
> change its content.

Agreed, but IKEv1 might not be that clear.

>       In that sense, a CREATE_CHILD_SA that received an answer that leads to
>       an unfixable isse, like TS_UNACCEPTABLE, is a FATAL state change for
>       the child sa. FAIL would mean we might try to retransmit, which is
>       what is wrongly happening here. FAIL would also happen if we got an
>       unauthenticated reply we didn't like, so we would still wait for a
>       possible unauthenticated reply we like (to prevent attackers from
>       locking out legitimate packets).
> 
> If a packet fails protection, it's STF_IGNOREd.  The exception is with IKE_SA_INIT bit it too is free to STF_IGNORE a message.
>  
>
>       The way IKEv2 works with serial MSGID, we really don't have STF_FAIL in
>       that sense. We only have STF_OK or STF_FATAL. (well also STF_SUSPEND etc)
>       The only STF_FAIL would be the if we get no reply, and we need to
>       retransmit. I can't think of an STF_FAIL that would operate on the child
>       SA state.
> 
> 
> We're somewhat on the same page.  Any message can carry a notification that implies that the IKE family is dead.  For instance, INVALID_SYNTAX.  So the
> exchange outcome is either Ok, or fatal.

See above. I don't agree that INVALID_SYNTAX should be interpreted to
tear down the entire house. Or at least we should be really careful
with sending INVALID_SYNTAX. I find these two bits of the RFC rather
conflicting:

    If a peer parsing a request notices that it is badly formatted (after
    it has passed the message authentication code checks and window
    checks) and it returns an INVALID_SYNTAX notification, then this
    error notification is considered fatal in both peers, meaning that
    the IKE SA is deleted without needing an explicit Delete payload.

vs

  INVALID_SYNTAX                            7
        Indicates the IKE message that was received was invalid because
        some type, length, or value was out of range or because the
        request was rejected for policy reasons.

Specifically the "for policy reasons".

But even so, INVALID_SYNTAX is rare and not the problem we are dealing
with now, which is things like TS_UNACCEPTABLE. That is the case now
where we tear down the house and we should not.

> However with a CREATE_CHILD_SA exchange, there's three outcomes:
> - exchange succeeds; child succeeds
> - exchange succeeds; child fails (ike survives)
> - exchange fails, implying that the ike family dies
> hence the need for three status
> 
> Taking a step back, and ignoring our current code, which would you consider more serious: FAIL or FATAL?
> Personally, I'd treat something fatal as more serious and assume it's wiping things out.

As you said, I think we are on the same page, but FATAL in a child
should not cause the IKE SA or other Child SAs to die (maybe the
exception to this is indeed INVALID_SYNTAX - I don't care too much
since we don't really see that case happening in real life, and I'm
mostly concerned now with fixing the mismatching subnet cases that
throw TS_UNACCEPTABLE)

> The IKE_AUTH exchange is different.   The message is passed to the IKE SA, and then as a nested operation, it tries to create the child.  If the attempt
> to create the child proves fatal, the IKE SA can proceed accordingly.

That is not the case yet, which is why I added _two_ test cases with
mismatched subnet. One of those mismatches the first subnet, thus
causing a failure in IKE_AUTH, which currently also/still tears
down the IKE SA (indirectly through STF_FATAL, even if your recent
childless code made the IKE SA survive the initial failure)

> If CREATE_CHILD_SA followed this model (I call it nested states), then we could only need two codes.

Works for me, but we have to be careful not to break IKEv1 assumptions
and running code.

Paul