<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, 8 Apr 2020 at 02:29, Antony Antony <<a href="mailto:antony@phenome.org">antony@phenome.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>

<br>

I am hunting a couple of corner cases, IKEv2 rekey initiator failures.  <br>

These issues appear when testing clones. Think of 100 IKEv2 Child SAs under <br>

one IKE SA and rekeying them all. In the test rekey margin and salife are <br>

short. Short values do not matter, because looking back in Tuomo production <br>

systems logs, with larger rekey margin and saflifetime, I see the same <br>

issues.<br>

<br>

First one is possibly fixed.<br>

<br>

<a href="https://testing.libreswan.org/v3.30-418-g2a5319bd84-master/ikev2-child-rekey-07-deadlock/OUTPUT/west.pluto.log.gz" rel="noreferrer" target="_blank">https://testing.libreswan.org/v3.30-418-g2a5319bd84-master/ikev2-child-rekey-07-deadlock/OUTPUT/west.pluto.log.gz</a><br>

<br></blockquote><div><br></div><div>    ikev2: insert new v2D message to the tail of pending v2D requests<br>    <br></div><div>It's certainly better:</div><div><br></div><div>    optimized to send out v2D message before v2_CREATE_CHILD_SA jobs.<br>    Delete would also get a quicker response from the other end, because,<br>    there are no crypto operations i.e. v2DH,KE,nonce to respond to<br>    a  v2_INFORMATIONAL request.<br>    <br></div><div>but is it needed?<br></div><div><br></div><div>    this queue may have v2_CREATE_CHILD_SA or/and v2_INFORMATIONAL (v2D).<br>    add new v2_INFORMATIONAL, v2D request to the tail of v2D requests,<br>    instead adding to the tail of the queue.</div><div><br></div><div>I'm also puzzled by:<br></div><div><br><div style="margin-left:40px">           if (IS_CHILD_SA_ESTABLISHED(st)) {<br></div><div style="margin-left:40px"><div style="margin-left:40px">                    /*<br>                     * this is a continuation of delete message.<br>                   * shortcut complete_v2_state_transition()<br>                     * to call complete_v2_state_transition, need more work<br>                        */<br>                   pending.cb(ike, st, NULL);<br>                    v2_msgid_schedule_next_initiator(ike);</div></div></div></div><div class="gmail_quote"><div><br> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

The pluto log will show the following debugline and rekey response is <br>

dropped.<br>

<br>

Message ID: CHILD #1.#6 XXX: expecting initiator==NULL - suspect <br>

record'n'send with an out-of-order wrong packet response; discarding packet: <br>

ike.initiator.sent=7 ike.initiator.recv=6 ike.responder.sent=-1 <br>

ike.responder.recv=-1 child.wip.initiator=5 child.wip.responder=-1<br></blockquote><div><br></div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

this is due to delete messages jumping the queue and Message ID seems to <br>

sync up again with next dpd or delete message. I am not sure that is RFC <br>

complaint.<br></blockquote><div><br></div><div>It isn't.</div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Second One seems to harder to trace. Here are the symptoms that I found so <br>

far.  Rekaying IPsec fails with the following message and the connections <br>

disappears. If you have auto=route it may come back again. auto=route is a <br>

bandaid.<br>

<br>

Apr  6 13:58:50.367487: | ikev2_child_sa_respond returned STF_INTERNAL_ERROR<br>

<br></blockquote></div><div class="gmail_quote"><br></div><div class="gmail_quote">Any ideas on what triggered the internal error?<br></div><div class="gmail_quote"><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

Apr  6 13:58:50.367426: "west/68x0" #3699: CHILD SA to rekey #3500 vanished <br>

abort this exchange<br>

<br>

the internal error seems to caused  EVENT_SA_EXPIRE, timeout in 0 seconds <br>

which removed predecessor state before rekey send the message.<br>

<br>

Apr  6 13:58:50.237455: | inserting event EVENT_SA_EXPIRE, timeout in 0 seconds for #3500<br>

<br>

Apr  6 13:58:50.230304: | duplicating state object #3606 "west/100x0" as <br>

#3699 for IPSEC SA<br>

Apr  6 13:58:50.226728: | handling event EVENT_SA_REPLACE for child state #3500<br>

<br>

It seems when replace is hit, code decided to rekey, and also scheduled an <br>

EXPIRE. Usually margin is 70s or so Which is enough to rekey. In this case <br>

it margin end as 0.<br>

<br>

Bellow is link to full log of test ikev2-child-rekey-08-deadlock. I ran it <br>

manually for longer time. It took 58 minutes before this error to occured. <br>

<br>

WARNING: the following log is big 100s Mega bytes, careful when clicking on <br>

the link in a browser. It is better to download it using wget to read.<br>

<br>

<a href="https://swantest.libreswan.fi/s2/ikev2-child-rekey-08-deadlock/OUTPUT/west.pluto.log.gz" rel="noreferrer" target="_blank">https://swantest.libreswan.fi/s2/ikev2-child-rekey-08-deadlock/OUTPUT/west.pluto.log.gz</a><br>

<br>

I let the test run, it happend twice in about 6 hours.<br>

<br>

Apr  6 13:58:50.367426: "west/68x0" #3699: CHILD SA to rekey #3500 vanished abort this exchange<br>

Apr  6 17:49:44.228126: "west/39x0" #18513: CHILD SA to rekey #18320 <br>

vanished abort this exchange<br>

<br>

Looking back in older logs, Tuomo's log, I see the same sequence and <br>

connrection restarting either due to keying tries or traffic. He is running <br>

with default rekey margin and fuzz, while the test  has shorter rekey <br>

margin.  Note it is not just because of shorter rekey margin or salifetime.<br>

<br>

The oldest seems to hapeend in <br>

<br>

"Apr 18 00:57:47 foo-gw pluto[5139]: "antony-ams" #915: Child SA to rekey #912 vanished abort this exchange"<br>

<br>

I am wondering where is the bug? pluto's rekey margin calcuation, or some <br>

other logic forcing the EXPIRE event.<br>

<br>

-antony<br>

<br>

_______________________________________________<br>

Swan-dev mailing list<br>

<a href="mailto:Swan-dev@lists.libreswan.org" target="_blank">Swan-dev@lists.libreswan.org</a><br>

<a href="https://lists.libreswan.org/mailman/listinfo/swan-dev" rel="noreferrer" target="_blank">https://lists.libreswan.org/mailman/listinfo/swan-dev</a><br>

</blockquote></div></div>