[Swan-dev] IKEv2 rekey initiator failures
antony at phenome.org
Wed Apr 8 06:28:07 UTC 2020
I am hunting a couple of corner cases, IKEv2 rekey initiator failures.
These issues appear when testing clones. Think of 100 IKEv2 Child SAs under
one IKE SA and rekeying them all. In the test rekey margin and salife are
short. Short values do not matter, because looking back in Tuomo production
systems logs, with larger rekey margin and saflifetime, I see the same
First one is possibly fixed.
The pluto log will show the following debugline and rekey response is
Message ID: CHILD #1.#6 XXX: expecting initiator==NULL - suspect
record'n'send with an out-of-order wrong packet response; discarding packet:
ike.initiator.sent=7 ike.initiator.recv=6 ike.responder.sent=-1
ike.responder.recv=-1 child.wip.initiator=5 child.wip.responder=-1
this is due to delete messages jumping the queue and Message ID seems to
sync up again with next dpd or delete message. I am not sure that is RFC
Second One seems to harder to trace. Here are the symptoms that I found so
far. Rekaying IPsec fails with the following message and the connections
disappears. If you have auto=route it may come back again. auto=route is a
Apr 6 13:58:50.367487: | ikev2_child_sa_respond returned STF_INTERNAL_ERROR
Apr 6 13:58:50.367426: "west/68x0" #3699: CHILD SA to rekey #3500 vanished
abort this exchange
the internal error seems to caused EVENT_SA_EXPIRE, timeout in 0 seconds
which removed predecessor state before rekey send the message.
Apr 6 13:58:50.237455: | inserting event EVENT_SA_EXPIRE, timeout in 0 seconds for #3500
Apr 6 13:58:50.230304: | duplicating state object #3606 "west/100x0" as
#3699 for IPSEC SA
Apr 6 13:58:50.226728: | handling event EVENT_SA_REPLACE for child state #3500
It seems when replace is hit, code decided to rekey, and also scheduled an
EXPIRE. Usually margin is 70s or so Which is enough to rekey. In this case
it margin end as 0.
Bellow is link to full log of test ikev2-child-rekey-08-deadlock. I ran it
manually for longer time. It took 58 minutes before this error to occured.
WARNING: the following log is big 100s Mega bytes, careful when clicking on
the link in a browser. It is better to download it using wget to read.
I let the test run, it happend twice in about 6 hours.
Apr 6 13:58:50.367426: "west/68x0" #3699: CHILD SA to rekey #3500 vanished abort this exchange
Apr 6 17:49:44.228126: "west/39x0" #18513: CHILD SA to rekey #18320
vanished abort this exchange
Looking back in older logs, Tuomo's log, I see the same sequence and
connrection restarting either due to keying tries or traffic. He is running
with default rekey margin and fuzz, while the test has shorter rekey
margin. Note it is not just because of shorter rekey margin or salifetime.
The oldest seems to hapeend in
"Apr 18 00:57:47 foo-gw pluto: "antony-ams" #915: Child SA to rekey #912 vanished abort this exchange"
I am wondering where is the bug? pluto's rekey margin calcuation, or some
other logic forcing the EXPIRE event.
More information about the Swan-dev