[Swan-dev] IKEv2 rekey initiator failures

Wed Apr 8 14:07:43 UTC 2020

On Wed, 8 Apr 2020 at 02:29, Antony Antony <antony at phenome.org> wrote:

> Hi,
>
> I am hunting a couple of corner cases, IKEv2 rekey initiator failures.
> These issues appear when testing clones. Think of 100 IKEv2 Child SAs
> under
> one IKE SA and rekeying them all. In the test rekey margin and salife are
> short. Short values do not matter, because looking back in Tuomo
> production
> systems logs, with larger rekey margin and saflifetime, I see the same
> issues.
>
> First one is possibly fixed.
>
>
> https://testing.libreswan.org/v3.30-418-g2a5319bd84-master/ikev2-child-rekey-07-deadlock/OUTPUT/west.pluto.log.gz
>
>
    ikev2: insert new v2D message to the tail of pending v2D requests

It's certainly better:

    optimized to send out v2D message before v2_CREATE_CHILD_SA jobs.
    Delete would also get a quicker response from the other end, because,
    there are no crypto operations i.e. v2DH,KE,nonce to respond to
    a  v2_INFORMATIONAL request.

but is it needed?

    this queue may have v2_CREATE_CHILD_SA or/and v2_INFORMATIONAL (v2D).
    add new v2_INFORMATIONAL, v2D request to the tail of v2D requests,
    instead adding to the tail of the queue.

I'm also puzzled by:

if (IS_CHILD_SA_ESTABLISHED(st)) {
/*
* this is a continuation of delete message.
* shortcut complete_v2_state_transition()
* to call complete_v2_state_transition, need more work
*/
pending.cb(ike, st, NULL);
v2_msgid_schedule_next_initiator(ike);

The pluto log will show the following debugline and rekey response is
> dropped.
>
> Message ID: CHILD #1.#6 XXX: expecting initiator==NULL - suspect
> record'n'send with an out-of-order wrong packet response; discarding
> packet:
> ike.initiator.sent=7 ike.initiator.recv=6 ike.responder.sent=-1
> ike.responder.recv=-1 child.wip.initiator=5 child.wip.responder=-1
>

this is due to delete messages jumping the queue and Message ID seems to
> sync up again with next dpd or delete message. I am not sure that is RFC
> complaint.
>

It isn't.

> Second One seems to harder to trace. Here are the symptoms that I found so
> far.  Rekaying IPsec fails with the following message and the connections
> disappears. If you have auto=route it may come back again. auto=route is a
> bandaid.
>
> Apr  6 13:58:50.367487: | ikev2_child_sa_respond returned
> STF_INTERNAL_ERROR
>
>
Any ideas on what triggered the internal error?

Apr  6 13:58:50.367426: "west/68x0" #3699: CHILD SA to rekey #3500 vanished
> abort this exchange
>
> the internal error seems to caused  EVENT_SA_EXPIRE, timeout in 0 seconds
> which removed predecessor state before rekey send the message.
>
> Apr  6 13:58:50.237455: | inserting event EVENT_SA_EXPIRE, timeout in 0
> seconds for #3500
>
> Apr  6 13:58:50.230304: | duplicating state object #3606 "west/100x0" as
> #3699 for IPSEC SA
> Apr  6 13:58:50.226728: | handling event EVENT_SA_REPLACE for child state
> #3500
>
> It seems when replace is hit, code decided to rekey, and also scheduled an
> EXPIRE. Usually margin is 70s or so Which is enough to rekey. In this case
> it margin end as 0.
>
> Bellow is link to full log of test ikev2-child-rekey-08-deadlock. I ran it
> manually for longer time. It took 58 minutes before this error to occured.
>
> WARNING: the following log is big 100s Mega bytes, careful when clicking
> on
> the link in a browser. It is better to download it using wget to read.
>
>
> https://swantest.libreswan.fi/s2/ikev2-child-rekey-08-deadlock/OUTPUT/west.pluto.log.gz
>
> I let the test run, it happend twice in about 6 hours.
>
> Apr  6 13:58:50.367426: "west/68x0" #3699: CHILD SA to rekey #3500
> vanished abort this exchange
> Apr  6 17:49:44.228126: "west/39x0" #18513: CHILD SA to rekey #18320
> vanished abort this exchange
>
> Looking back in older logs, Tuomo's log, I see the same sequence and
> connrection restarting either due to keying tries or traffic. He is
> running
> with default rekey margin and fuzz, while the test  has shorter rekey
> margin.  Note it is not just because of shorter rekey margin or salifetime.
>
> The oldest seems to hapeend in
>
> "Apr 18 00:57:47 foo-gw pluto[5139]: "antony-ams" #915: Child SA to rekey
> #912 vanished abort this exchange"
>
> I am wondering where is the bug? pluto's rekey margin calcuation, or some
> other logic forcing the EXPIRE event.
>
> -antony
>
> _______________________________________________
> Swan-dev mailing list
> Swan-dev at lists.libreswan.org
> https://lists.libreswan.org/mailman/listinfo/swan-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.libreswan.org/pipermail/swan-dev/attachments/20200408/e8113216/attachment.html>