[Swan-dev] IKEv2 rekey initiator failures

Wed Apr 8 17:16:08 UTC 2020

On Wed, Apr 08, 2020 at 10:07:43AM -0400, Andrew Cagney wrote:
> 
> 
> On Wed, 8 Apr 2020 at 02:29, Antony Antony <antony at phenome.org> wrote:
> 
>     Hi,
> 
>     I am hunting a couple of corner cases, IKEv2 rekey initiator failures. 
>     These issues appear when testing clones. Think of 100 IKEv2 Child SAs under
>     one IKE SA and rekeying them all. In the test rekey margin and salife are
>     short. Short values do not matter, because looking back in Tuomo production
>     systems logs, with larger rekey margin and saflifetime, I see the same
>     issues.
> 
>     First one is possibly fixed.
> 
>     https://testing.libreswan.org/v3.30-418-g2a5319bd84-master/
>     ikev2-child-rekey-07-deadlock/OUTPUT/west.pluto.log.gz
> 
> 
> 
>     ikev2: insert new v2D message to the tail of pending v2D requests
>    
> It's certainly better:
> 
>     optimized to send out v2D message before v2_CREATE_CHILD_SA jobs.
>     Delete would also get a quicker response from the other end, because,
>     there are no crypto operations i.e. v2DH,KE,nonce to respond to
>     a  v2_INFORMATIONAL request.
>    
> but is it needed?

I noticed sometimes a several rekeys would get queued up. And delete would 
stay in the queue for longer. I think it is best to prioritize v2D ahead 
CREATE_CHILD_SA. Also can't think of any side effect of pritorizing v2D.

>     this queue may have v2_CREATE_CHILD_SA or/and v2_INFORMATIONAL (v2D).
>     add new v2_INFORMATIONAL, v2D request to the tail of v2D requests,
>     instead adding to the tail of the queue.
> 
> I'm also puzzled by:

yes. It seems necessary for now. May be there could a way to call 
complete_v2_state_transition().

> if (IS_CHILD_SA_ESTABLISHED(st)) {
> /*
> * this is a continuation of delete message.
> * shortcut complete_v2_state_transition()
> * to call complete_v2_state_transition, need more work
> */
> pending.cb(ike, st, NULL);

Initially I imagined calling complete_v2_state_transition() would be 
cleaner, however it turned out to be harder. There are too many checks 
inside complete_v2_state_transition() and success_v2_state_transition() that 
would fail, or need hacks or fixes.

Initiatiing INFROMATIONAL v2D does not have an svm entry or  fake md.  I am 
not sure faking all that is worth it.

Message ID is already advanced in send_delete() 

Now I am thinking a lighter version complete_v2_state_transition() is 
probably ideal this.

when completeing an v2D all we really need is the next line.
> v2_msgid_schedule_next_initiator(ike); 

> 
>     Second One seems to harder to trace. Here are the symptoms that I found so
>     far.  Rekaying IPsec fails with the following message and the connections
>     disappears. If you have auto=route it may come back again. auto=route is a
>     bandaid.
> 
>     Apr  6 13:58:50.367487: | ikev2_child_sa_respond returned
>     STF_INTERNAL_ERROR
> 
> 
> 
> Any ideas on what triggered the internal error?

I know one step further. the line shown bellow "EVENT_SA_EXPIRE, timeout in 
0 seconds" is the cause. However, I don't why pluto schedule this 
EVENT_SA_EXPIRE. 

Did you look a the log?

> 
>     Apr  6 13:58:50.367426: "west/68x0" #3699: CHILD SA to rekey #3500 vanished
>     abort this exchange
> 
>     the internal error seems to caused  EVENT_SA_EXPIRE, timeout in 0 seconds
>     which removed predecessor state before rekey send the message.
> 
>     Apr  6 13:58:50.237455: | inserting event EVENT_SA_EXPIRE, timeout in 0
>     seconds for #3500
> 
>     Apr  6 13:58:50.230304: | duplicating state object #3606 "west/100x0" as
>     #3699 for IPSEC SA
>     Apr  6 13:58:50.226728: | handling event EVENT_SA_REPLACE for child state #
>     3500
> 
>     It seems when replace is hit, code decided to rekey, and also scheduled an
>     EXPIRE. Usually margin is 70s or so Which is enough to rekey. In this case
>     it margin end as 0.
> 
>     Bellow is link to full log of test ikev2-child-rekey-08-deadlock. I ran it
>     manually for longer time. It took 58 minutes before this error to occured.
> 
>     WARNING: the following log is big 100s Mega bytes, careful when clicking on
>     the link in a browser. It is better to download it using wget to read.
> 
>     https://swantest.libreswan.fi/s2/ikev2-child-rekey-08-deadlock/OUTPUT/
>     west.pluto.log.gz
> 
>     I let the test run, it happend twice in about 6 hours.
> 
>     Apr  6 13:58:50.367426: "west/68x0" #3699: CHILD SA to rekey #3500 vanished
>     abort this exchange
>     Apr  6 17:49:44.228126: "west/39x0" #18513: CHILD SA to rekey #18320
>     vanished abort this exchange
> 
>     Looking back in older logs, Tuomo's log, I see the same sequence and
>     connrection restarting either due to keying tries or traffic. He is running
>     with default rekey margin and fuzz, while the test  has shorter rekey
>     margin.  Note it is not just because of shorter rekey margin or salifetime.
> 
>     The oldest seems to hapeend in
> 
>     "Apr 18 00:57:47 foo-gw pluto[5139]: "antony-ams" #915: Child SA to rekey #
>     912 vanished abort this exchange"
> 
>     I am wondering where is the bug? pluto's rekey margin calcuation, or some
>     other logic forcing the EXPIRE event.