[Swan-dev] IKEv2 rekey initiator failures

Wed Apr 8 17:49:09 UTC 2020

On Wed, 8 Apr 2020 at 13:16, Antony Antony <antony at phenome.org> wrote:

> On Wed, Apr 08, 2020 at 10:07:43AM -0400, Andrew Cagney wrote:
> >
> >
> > On Wed, 8 Apr 2020 at 02:29, Antony Antony <antony at phenome.org> wrote:
> >
> >     Hi,
> >
> >     I am hunting a couple of corner cases, IKEv2 rekey initiator
> failures.
> >     These issues appear when testing clones. Think of 100 IKEv2 Child
> SAs under
> >     one IKE SA and rekeying them all. In the test rekey margin and
> salife are
> >     short. Short values do not matter, because looking back in Tuomo
> production
> >     systems logs, with larger rekey margin and saflifetime, I see the
> same
> >     issues.
> >
> >     First one is possibly fixed.
> >
> >     https://testing.libreswan.org/v3.30-418-g2a5319bd84-master/
> >     ikev2-child-rekey-07-deadlock/OUTPUT/west.pluto.log.gz
> >
> >
> >
> >     ikev2: insert new v2D message to the tail of pending v2D requests
> >
> > It's certainly better:
> >
> >     optimized to send out v2D message before v2_CREATE_CHILD_SA jobs.
> >     Delete would also get a quicker response from the other end, because,
> >     there are no crypto operations i.e. v2DH,KE,nonce to respond to
> >     a  v2_INFORMATIONAL request.
> >
> > but is it needed?
>
> I noticed sometimes a several rekeys would get queued up. And delete would
> stay in the queue for longer. I think it is best to prioritize v2D ahead
> CREATE_CHILD_SA. Also can't think of any side effect of pritorizing v2D.
>

It re-orders the event queue, which will make debugging harder.

> >     this queue may have v2_CREATE_CHILD_SA or/and v2_INFORMATIONAL (v2D).
> >     add new v2_INFORMATIONAL, v2D request to the tail of v2D requests,
> >     instead adding to the tail of the queue.
> >
> > I'm also puzzled by:
>
> yes. It seems necessary for now. May be there could a way to call
> complete_v2_state_transition().
>

The processor is is most likely broken.

> > if (IS_CHILD_SA_ESTABLISHED(st)) {
> > /*
> > * this is a continuation of delete message.
> > * shortcut complete_v2_state_transition()
> > * to call complete_v2_state_transition, need more work
> > */
> > pending.cb(ike, st, NULL);
>
>
> Initially I imagined calling complete_v2_state_transition() would be
> cleaner, however it turned out to be harder. There are too many checks
> inside complete_v2_state_transition() and success_v2_state_transition()
> that
> would fail, or need hacks or fixes.
>
> Initiatiing INFROMATIONAL v2D does not have an svm entry or  fake md.  I
> am
> not sure faking all that is worth it.
>
> Message ID is already advanced in send_delete()
>

As in:

/*
* XXX: The record 'n' send call shouldn't be
* needed.  Instead, as part of this
* transition (live -> being-deleted) the
* standard success_v2_transition() code path
* should get to do the right thing.
*
* XXX: The record 'n' send call leads to an
* RFC violation.  The lack of a state
* transition means there's nothing set up to
* wait for the ack.  And that in turn means
* that the next packet will be sent before
* this one has had a response.
*/

> Now I am thinking a lighter version complete_v2_state_transition() is
> probably ideal this.
>
> when completeing an v2D all we really need is the next line.
> > v2_msgid_schedule_next_initiator(ike);
>
> >
> >     Second One seems to harder to trace. Here are the symptoms that I
> found so
> >     far.  Rekaying IPsec fails with the following message and the
> connections
> >     disappears. If you have auto=route it may come back again.
> auto=route is a
> >     bandaid.
> >
> >     Apr  6 13:58:50.367487: | ikev2_child_sa_respond returned
> >     STF_INTERNAL_ERROR
> >
> >
> >
> > Any ideas on what triggered the internal error?
>
> I know one step further. the line shown bellow "EVENT_SA_EXPIRE, timeout
> in
> 0 seconds" is the cause. However, I don't why pluto schedule this
> EVENT_SA_EXPIRE.
>
>
According to the below the internal-error triggered the expire; not the
reverse?

That's totally expected.

Did you look a the log?
>
> >
> >     Apr  6 13:58:50.367426: "west/68x0" #3699: CHILD SA to rekey #3500
> vanished
> >     abort this exchange
> >
> >     the internal error seems to caused  EVENT_SA_EXPIRE, timeout in 0
> seconds
> >     which removed predecessor state before rekey send the message.
> >
>

So there's an STF_INTERNAL_ERROR, and then ...

>     Apr  6 13:58:50.237455: | inserting event EVENT_SA_EXPIRE, timeout in
> 0
> >     seconds for #3500
>

>     Apr  6 13:58:50.230304: | duplicating state object #3606 "west/100x0"
> as
> >     #3699 for IPSEC SA
> >     Apr  6 13:58:50.226728: | handling event EVENT_SA_REPLACE for child
> state #
> >     3500
> >
> >     It seems when replace is hit, code decided to rekey, and also
> scheduled an
> >     EXPIRE. Usually margin is 70s or so Which is enough to rekey. In
> this case
> >     it margin end as 0.
> >
> >     Bellow is link to full log of test ikev2-child-rekey-08-deadlock. I
> ran it
> >     manually for longer time. It took 58 minutes before this error to
> occured.
> >
> >     WARNING: the following log is big 100s Mega bytes, careful when
> clicking on
> >     the link in a browser. It is better to download it using wget to
> read.
> >
> >
> https://swantest.libreswan.fi/s2/ikev2-child-rekey-08-deadlock/OUTPUT/
> >     west.pluto.log.gz
> >
> >     I let the test run, it happend twice in about 6 hours.
> >
> >     Apr  6 13:58:50.367426: "west/68x0" #3699: CHILD SA to rekey #3500
> vanished
> >     abort this exchange
> >     Apr  6 17:49:44.228126: "west/39x0" #18513: CHILD SA to rekey #18320
> >     vanished abort this exchange
> >
> >     Looking back in older logs, Tuomo's log, I see the same sequence and
> >     connrection restarting either due to keying tries or traffic. He is
> running
> >     with default rekey margin and fuzz, while the test  has shorter rekey
> >     margin.  Note it is not just because of shorter rekey margin or
> salifetime.
> >
> >     The oldest seems to hapeend in
> >
> >     "Apr 18 00:57:47 foo-gw pluto[5139]: "antony-ams" #915: Child SA to
> rekey #
> >     912 vanished abort this exchange"
> >
> >     I am wondering where is the bug? pluto's rekey margin calcuation, or
> some
> >     other logic forcing the EXPIRE event.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.libreswan.org/pipermail/swan-dev/attachments/20200408/2f346936/attachment-0001.html>