[Swan-dev] IKEv2 rekey initiator failures

Wed Apr 8 06:28:07 UTC 2020

Hi,

I am hunting a couple of corner cases, IKEv2 rekey initiator failures.  
These issues appear when testing clones. Think of 100 IKEv2 Child SAs under 
one IKE SA and rekeying them all. In the test rekey margin and salife are 
short. Short values do not matter, because looking back in Tuomo production 
systems logs, with larger rekey margin and saflifetime, I see the same 
issues.

First one is possibly fixed.

https://testing.libreswan.org/v3.30-418-g2a5319bd84-master/ikev2-child-rekey-07-deadlock/OUTPUT/west.pluto.log.gz

The pluto log will show the following debugline and rekey response is 
dropped.

Message ID: CHILD #1.#6 XXX: expecting initiator==NULL - suspect 
record'n'send with an out-of-order wrong packet response; discarding packet: 
ike.initiator.sent=7 ike.initiator.recv=6 ike.responder.sent=-1 
ike.responder.recv=-1 child.wip.initiator=5 child.wip.responder=-1

this is due to delete messages jumping the queue and Message ID seems to 
sync up again with next dpd or delete message. I am not sure that is RFC 
complaint.

Second One seems to harder to trace. Here are the symptoms that I found so 
far.  Rekaying IPsec fails with the following message and the connections 
disappears. If you have auto=route it may come back again. auto=route is a 
bandaid.

Apr  6 13:58:50.367487: | ikev2_child_sa_respond returned STF_INTERNAL_ERROR

Apr  6 13:58:50.367426: "west/68x0" #3699: CHILD SA to rekey #3500 vanished 
abort this exchange

the internal error seems to caused  EVENT_SA_EXPIRE, timeout in 0 seconds 
which removed predecessor state before rekey send the message.

Apr  6 13:58:50.237455: | inserting event EVENT_SA_EXPIRE, timeout in 0 seconds for #3500

Apr  6 13:58:50.230304: | duplicating state object #3606 "west/100x0" as 
#3699 for IPSEC SA
Apr  6 13:58:50.226728: | handling event EVENT_SA_REPLACE for child state #3500

It seems when replace is hit, code decided to rekey, and also scheduled an 
EXPIRE. Usually margin is 70s or so Which is enough to rekey. In this case 
it margin end as 0.

Bellow is link to full log of test ikev2-child-rekey-08-deadlock. I ran it 
manually for longer time. It took 58 minutes before this error to occured. 

WARNING: the following log is big 100s Mega bytes, careful when clicking on 
the link in a browser. It is better to download it using wget to read.

https://swantest.libreswan.fi/s2/ikev2-child-rekey-08-deadlock/OUTPUT/west.pluto.log.gz

I let the test run, it happend twice in about 6 hours.

Apr  6 13:58:50.367426: "west/68x0" #3699: CHILD SA to rekey #3500 vanished abort this exchange
Apr  6 17:49:44.228126: "west/39x0" #18513: CHILD SA to rekey #18320 
vanished abort this exchange

Looking back in older logs, Tuomo's log, I see the same sequence and 
connrection restarting either due to keying tries or traffic. He is running 
with default rekey margin and fuzz, while the test  has shorter rekey 
margin.  Note it is not just because of shorter rekey margin or salifetime.

The oldest seems to hapeend in 

"Apr 18 00:57:47 foo-gw pluto[5139]: "antony-ams" #915: Child SA to rekey #912 vanished abort this exchange"

I am wondering where is the bug? pluto's rekey margin calcuation, or some 
other logic forcing the EXPIRE event.

-antony