<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, 8 Apr 2020 at 13:16, Antony Antony <<a href="mailto:antony@phenome.org">antony@phenome.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Wed, Apr 08, 2020 at 10:07:43AM -0400, Andrew Cagney wrote:<br>
> <br>
> <br>
> On Wed, 8 Apr 2020 at 02:29, Antony Antony <<a href="mailto:antony@phenome.org" target="_blank">antony@phenome.org</a>> wrote:<br>
> <br>
> Hi,<br>
> <br>
> I am hunting a couple of corner cases, IKEv2 rekey initiator failures. <br>
> These issues appear when testing clones. Think of 100 IKEv2 Child SAs under<br>
> one IKE SA and rekeying them all. In the test rekey margin and salife are<br>
> short. Short values do not matter, because looking back in Tuomo production<br>
> systems logs, with larger rekey margin and saflifetime, I see the same<br>
> issues.<br>
> <br>
> First one is possibly fixed.<br>
> <br>
> <a href="https://testing.libreswan.org/v3.30-418-g2a5319bd84-master/" rel="noreferrer" target="_blank">https://testing.libreswan.org/v3.30-418-g2a5319bd84-master/</a><br>
> ikev2-child-rekey-07-deadlock/OUTPUT/west.pluto.log.gz<br>
> <br>
> <br>
> <br>
> ikev2: insert new v2D message to the tail of pending v2D requests<br>
> <br>
> It's certainly better:<br>
> <br>
> optimized to send out v2D message before v2_CREATE_CHILD_SA jobs.<br>
> Delete would also get a quicker response from the other end, because,<br>
> there are no crypto operations i.e. v2DH,KE,nonce to respond to<br>
> a v2_INFORMATIONAL request.<br>
> <br>
> but is it needed?<br>
<br>
I noticed sometimes a several rekeys would get queued up. And delete would <br>
stay in the queue for longer. I think it is best to prioritize v2D ahead <br>
CREATE_CHILD_SA. Also can't think of any side effect of pritorizing v2D.<br></blockquote><div><br></div><div>It re-orders the event queue, which will make debugging harder.</div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
> this queue may have v2_CREATE_CHILD_SA or/and v2_INFORMATIONAL (v2D).<br>
> add new v2_INFORMATIONAL, v2D request to the tail of v2D requests,<br>
> instead adding to the tail of the queue.<br>
> <br>
> I'm also puzzled by:<br>
<br>
yes. It seems necessary for now. May be there could a way to call <br>
complete_v2_state_transition().<br></blockquote><div><br></div><div>The processor is is most likely broken.<br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
> if (IS_CHILD_SA_ESTABLISHED(st)) {<br>
> /*<br>
> * this is a continuation of delete message.<br>
> * shortcut complete_v2_state_transition()<br>
> * to call complete_v2_state_transition, need more work<br>
> */<br>
> pending.cb(ike, st, NULL);<br>
<br>
<br>
Initially I imagined calling complete_v2_state_transition() would be <br>
cleaner, however it turned out to be harder. There are too many checks <br>
inside complete_v2_state_transition() and success_v2_state_transition() that <br>
would fail, or need hacks or fixes.<br>
<br>
Initiatiing INFROMATIONAL v2D does not have an svm entry or fake md. I am <br>
not sure faking all that is worth it.<br>
<br>
Message ID is already advanced in send_delete() <br></blockquote></div><div class="gmail_quote"><br></div><div class="gmail_quote">As in:</div><div class="gmail_quote"><br></div><div class="gmail_quote"> /*<br> * XXX: The record 'n' send call shouldn't be<br> * needed. Instead, as part of this<br> * transition (live -> being-deleted) the<br> * standard success_v2_transition() code path<br> * should get to do the right thing.<br> *<br> * XXX: The record 'n' send call leads to an<br> * RFC violation. The lack of a state<br> * transition means there's nothing set up to<br> * wait for the ack. And that in turn means<br> * that the next packet will be sent before<br> * this one has had a response.<br> */</div><div class="gmail_quote"><br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Now I am thinking a lighter version complete_v2_state_transition() is <br>
probably ideal this.<br>
<br>
when completeing an v2D all we really need is the next line.<br>
> v2_msgid_schedule_next_initiator(ike); <br>
<br>
> <br>
> Second One seems to harder to trace. Here are the symptoms that I found so<br>
> far. Rekaying IPsec fails with the following message and the connections<br>
> disappears. If you have auto=route it may come back again. auto=route is a<br>
> bandaid.<br>
> <br>
> Apr 6 13:58:50.367487: | ikev2_child_sa_respond returned<br>
> STF_INTERNAL_ERROR<br>
> <br>
> <br>
> <br>
> Any ideas on what triggered the internal error?<br>
<br>
I know one step further. the line shown bellow "EVENT_SA_EXPIRE, timeout in <br>
0 seconds" is the cause. However, I don't why pluto schedule this <br>
EVENT_SA_EXPIRE. <br>
<br></blockquote></div><div class="gmail_quote"><br></div><div class="gmail_quote">According to the below the internal-error triggered the expire; not the reverse?</div><div class="gmail_quote"><br></div><div class="gmail_quote">That's totally expected.</div><div class="gmail_quote"><br></div><div class="gmail_quote"><br></div><div class="gmail_quote"><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Did you look a the log?<br>
<br>
> <br>
> Apr 6 13:58:50.367426: "west/68x0" #3699: CHILD SA to rekey #3500 vanished<br>
> abort this exchange<br>
> <br>
> the internal error seems to caused EVENT_SA_EXPIRE, timeout in 0 seconds<br>
> which removed predecessor state before rekey send the message.<br>
><br></blockquote></div><div class="gmail_quote"><br></div><div class="gmail_quote">So there's an STF_INTERNAL_ERROR, and then ...</div><div class="gmail_quote"><br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> Apr 6 13:58:50.237455: | inserting event EVENT_SA_EXPIRE, timeout in 0<br>
> seconds for #3500<br></blockquote><br></div><div class="gmail_quote"><br></div><div class="gmail_quote"><br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> Apr 6 13:58:50.230304: | duplicating state object #3606 "west/100x0" as<br>
> #3699 for IPSEC SA<br>
> Apr 6 13:58:50.226728: | handling event EVENT_SA_REPLACE for child state #<br>
> 3500<br>
> <br>
> It seems when replace is hit, code decided to rekey, and also scheduled an<br>
> EXPIRE. Usually margin is 70s or so Which is enough to rekey. In this case<br>
> it margin end as 0.<br>
> <br>
> Bellow is link to full log of test ikev2-child-rekey-08-deadlock. I ran it<br>
> manually for longer time. It took 58 minutes before this error to occured.<br>
> <br>
> WARNING: the following log is big 100s Mega bytes, careful when clicking on<br>
> the link in a browser. It is better to download it using wget to read.<br>
> <br>
> <a href="https://swantest.libreswan.fi/s2/ikev2-child-rekey-08-deadlock/OUTPUT/" rel="noreferrer" target="_blank">https://swantest.libreswan.fi/s2/ikev2-child-rekey-08-deadlock/OUTPUT/</a><br>
> west.pluto.log.gz<br>
> <br>
> I let the test run, it happend twice in about 6 hours.<br>
> <br>
> Apr 6 13:58:50.367426: "west/68x0" #3699: CHILD SA to rekey #3500 vanished<br>
> abort this exchange<br>
> Apr 6 17:49:44.228126: "west/39x0" #18513: CHILD SA to rekey #18320<br>
> vanished abort this exchange<br>
> <br>
> Looking back in older logs, Tuomo's log, I see the same sequence and<br>
> connrection restarting either due to keying tries or traffic. He is running<br>
> with default rekey margin and fuzz, while the test has shorter rekey<br>
> margin. Note it is not just because of shorter rekey margin or salifetime.<br>
> <br>
> The oldest seems to hapeend in<br>
> <br>
> "Apr 18 00:57:47 foo-gw pluto[5139]: "antony-ams" #915: Child SA to rekey #<br>
> 912 vanished abort this exchange"<br>
> <br>
> I am wondering where is the bug? pluto's rekey margin calcuation, or some<br>
> other logic forcing the EXPIRE event.<br>
</blockquote></div></div>