[Swan] Fwd: Problem with random rekey failures

Miguel Ponce Antolin mponce at paradigmadigital.com
Thu Jun 17 11:09:47 UTC 2021


Hi Paul,

Thanks so much for your comments.

Finally I upgrade the Amazon Linux 2 instance with Libreswan binary from
3.25 to 4.4 version.

We have not separate yet the connections as you proposed because we want to
test it on this way first.

For now it is working well, we are going to test for some weeks checking if
the reconnections problems restart again.
Anyway, it continues dropping connections after 30 minutes without traffic
flow.

This is the log, focusing on the subnet 1x5 and a generated with traffic
connection #127 and a try to reconnect #135, using this bash command "grep
-e "#135" -e "#127" -e "vpn/1x5" /var/log/libreswan.log":

Jun 17 07:41:52.659233: |  conn vpn/1x5 mark 0/00000000, 0/00000000
Jun 17 07:41:52.662959: |  conn vpn/1x5 mark 0/00000000, 0/00000000

### Reconnection sending traffic of vpn/1x5:

Jun 17 07:41:58.748356: | find_connection: conn "vpn/1x5" has compatible
peers: xxx.xxx.xxx.80/28 -<all>-> xxx.xxx.xxx.75/32 [pri: 29425674]
Jun 17 07:41:58.748360: | find_connection: first OK "vpn/1x5"
[pri:29425674]{0x55cbb5eb0628} (child none)
Jun 17 07:41:58.748366: | find_connection: concluding with "vpn/1x5"
[pri:29425674]{0x55cbb5eb0628} kind=CK_PERMANENT
Jun 17 07:41:58.748377: | priority calculation of connection "vpn/1x5" is
2082750 (0x1fc7be)
Jun 17 07:41:58.748455: "vpn/1x5": assign_holdpass() delete_bare_shunt()
failed
Jun 17 07:41:58.748490: | creating state object #127 at 0x55cbb5ede168
Jun 17 07:41:58.748494: | State DB: adding IKEv2 state #127 in UNDEFINED
Jun 17 07:41:58.748507: | pstats #127 ikev2.child started
Jun 17 07:41:58.748510: | duplicating state object #115 "vpn/1x18" as #127
for IPSEC SA
Jun 17 07:41:58.748515: | #127 setting local endpoint to xxx.xxx.xxx.92:500
from #115.st_localport (in duplicate_state() at state.c:1492)
Jun 17 07:41:58.748577: | Message ID: CHILD #115.#127 initializing (CHILD
SA): ike.initiator.sent=8 ike.initiator.recv=8
ike.initiator.last_contact=1899656.764788 ike.responder.sent=1240
ike.responder.recv=1240 ike.responder.last_contact=1899579.044082
child.wip.initiator=0->-1 child.wip.responder=0->-1
Jun 17 07:41:58.748583: | child state #127: UNDEFINED(ignore) =>
V2_NEW_CHILD_I0(established IKE SA)
Jun 17 07:41:58.748588: | #127.st_v2_transition NULL ->
V2_NEW_CHILD_I0->V2_NEW_CHILD_I1 (in new_v2_child_state() at state.c:1579)
Jun 17 07:41:58.748596: | using existing local ESP/AH proposals for vpn/1x5
(ESP/AH initiator emitting proposals):
1:ESP=AES_CBC_256-HMAC_SHA2_256_128-MODP2048-DISABLED
Jun 17 07:41:58.748609: | #127 schedule initiate IPsec SA
PSK+ENCRYPT+TUNNEL+PFS+UP+IKE_FRAG_ALLOW+ESN_NO using IKE# 115 pfs=MODP2048
Jun 17 07:41:58.748650: | inserting event EVENT_v2_INITIATE_CHILD, timeout
in 0 seconds for #127
Jun 17 07:41:58.748694: | handling event EVENT_v2_INITIATE_CHILD for child
state #127
Jun 17 07:41:58.748726: | job 196 for #127: Child Initiator KE? and nonce
(dh): adding job to queue
Jun 17 07:41:58.748729: | state #127 has no .st_event to delete
Jun 17 07:41:58.748733: | #127 STATE_V2_NEW_CHILD_I0: retransmits: cleared
Jun 17 07:41:58.748741: | inserting event EVENT_CRYPTO_TIMEOUT, timeout in
60 seconds for #127
Jun 17 07:41:58.748772: | job 196 for #127: Child Initiator KE? and nonce
(dh): helper 1 starting job
Jun 17 07:41:58.750556: | "vpn/1x5" #127: spent 1.78 (1.78) milliseconds in
helper 1 processing job 196 for state #127: Child Initiator KE? and nonce
(dh)
Jun 17 07:41:58.750560: | job 196 for #127: Child Initiator KE? and nonce
(dh): helper thread 1 sending result back to state
Jun 17 07:41:58.750563: | scheduling resume sending helper answer back to
state for #127
Jun 17 07:41:58.750590: | #127 spent 0.0691 (1.9) milliseconds in
timer_event_cb() EVENT_v2_INITIATE_CHILD
Jun 17 07:41:58.750605: | processing resume sending helper answer back to
state for #127
Jun 17 07:41:58.750610: | unsuspending #127 MD (nil)
Jun 17 07:41:58.750614: | job 196 for #127: Child Initiator KE? and nonce
(dh): processing response from helper 1
Jun 17 07:41:58.750617: | job 196 for #127: Child Initiator KE? and nonce
(dh): calling continuation function 0x55cbb573feb0
Jun 17 07:41:58.750621: | ikev2_child_outI_continue() for #127
STATE_V2_NEW_CHILD_I0
Jun 17 07:41:58.750636: | adding CHILD SA #127 to IKE SA #115 message
initiator queue
Jun 17 07:41:58.750644: | Message ID: CHILD #115.#127 wakeing IKE SA for
next initiator (unack 0): ike.initiator.sent=8 ike.initiator.recv=8
ike.initiator.last_contact=1899656.764788 ike.responder.sent=1240
ike.responder.recv=1240 ike.responder.last_contact=1899579.044082
child.wip.initiator=-1 child.wip.responder=-1
Jun 17 07:41:58.750656: | job 196 for #127: Child Initiator KE? and nonce
(dh): calling cleanup function 0x55cbb573fef0
Jun 17 07:41:58.750680: | #127 complete_v2_state_transition()
V2_NEW_CHILD_I0->V2_NEW_CHILD_I1 with status STF_SUSPEND
Jun 17 07:41:58.750693: | #127 spent 0.0822 (0.0822) milliseconds in resume
sending helper answer back to state
Jun 17 07:41:58.750716: | Message ID: CHILD #115.#127 resuming SA using IKE
SA (unack 0): ike.initiator.sent=8 ike.initiator.recv=8
ike.initiator.last_contact=1899656.764788 ike.responder.sent=1240
ike.responder.recv=1240 ike.responder.last_contact=1899579.044082
child.wip.initiator=-1 child.wip.responder=-1
Jun 17 07:41:58.750719: | unsuspending #127 MD (nil)
Jun 17 07:41:58.751818: | #127 complete_v2_state_transition()
V2_NEW_CHILD_I0->V2_NEW_CHILD_I1 with status STF_OK
Jun 17 07:41:58.751825: | Message ID: updating counters for #127
Jun 17 07:41:58.751838: | Message ID: CHILD #115.#127 scheduling
EVENT_RETRANSMIT: ike.initiator.sent=9 ike.initiator.recv=8
ike.initiator.last_contact=1899656.764788 ike.responder.sent=1240
ike.responder.recv=1240 ike.responder.last_contact=1899579.044082
child.wip.initiator=9 child.wip.responder=-1
Jun 17 07:41:58.751847: | inserting event EVENT_RETRANSMIT, timeout in 0.5
seconds for #127
Jun 17 07:41:58.751857: | #127 STATE_V2_NEW_CHILD_I0: retransmits: first
event in 0.5 seconds; timeout in 60 seconds; limit of 12 retransmits;
current time is 1899662.835279
Jun 17 07:41:58.751863: | Message ID: CHILD #115.#127 updating initiator
sent message request 9: ike.initiator.sent=8->9 ike.initiator.recv=8
ike.initiator.last_contact=1899656.764788 ike.responder.sent=1240
ike.responder.recv=1240 ike.responder.last_contact=1899579.044082
child.wip.initiator=-1->9 child.wip.responder=-1
Jun 17 07:41:58.751874: | child state #127: V2_NEW_CHILD_I0(established IKE
SA) => V2_NEW_CHILD_I1(established IKE SA)
Jun 17 07:41:58.751881: "vpn/1x5" #127: sent CREATE_CHILD_SA request for
new IPsec SA
Jun 17 07:41:58.752025: | state #127 deleting .st_event EVENT_CRYPTO_TIMEOUT
Jun 17 07:41:58.788755: | State DB: found IKEv2 state #127 in
V2_NEW_CHILD_I1 (find_v2_sa_by_initiator_wip)
Jun 17 07:41:58.788759: | #127 is idle
Jun 17 07:41:58.788762: | #127 idle
Jun 17 07:41:58.788791: | #127 in state V2_NEW_CHILD_I1: sent
CREATE_CHILD_SA request for new IPsec SA
Jun 17 07:41:58.789357: | forcing ST #127 to CHILD #115.#127 in FSM
processor
Jun 17 07:41:58.789371: | using existing local ESP/AH proposals for vpn/1x5
(CREATE_CHILD_SA initiator accepting remote ESP/AH proposal):
1:ESP=AES_CBC_256-HMAC_SHA2_256_128-MODP2048-DISABLED
Jun 17 07:41:58.789576: | updating #127's .st_oakley with preserved PRF,
but why update?
Jun 17 07:41:58.789658: | job 197 for #127: DH shared secret (dh): adding
job to queue
Jun 17 07:41:58.789662: | state #127 has no .st_event to delete
Jun 17 07:41:58.789666: | #127 requesting EVENT_RETRANSMIT-pe at 0x55cbb5f028d8
be deleted
Jun 17 07:41:58.789683: | #127 STATE_V2_NEW_CHILD_I1: retransmits: cleared
Jun 17 07:41:58.789691: | inserting event EVENT_CRYPTO_TIMEOUT, timeout in
60 seconds for #127
Jun 17 07:41:58.789717: | job 197 for #127: DH shared secret (dh): helper 1
starting job
Jun 17 07:41:58.790543: | "vpn/1x5" #127: spent 0.824 (0.824) milliseconds
in helper 1 processing job 197 for state #127: DH shared secret (dh)
Jun 17 07:41:58.790546: | job 197 for #127: DH shared secret (dh): helper
thread 1 sending result back to state
Jun 17 07:41:58.790550: | scheduling resume sending helper answer back to
state for #127
Jun 17 07:41:58.790571: | #127 spent 0.344 (1.2) milliseconds in
processing: Process CREATE_CHILD_SA IPsec SA Response in v2_dispatch()
Jun 17 07:41:58.790578: | #127 complete_v2_state_transition()
V2_NEW_CHILD_I1->ESTABLISHED_CHILD_SA with status STF_SUSPEND;
.st_v2_transition=V2_NEW_CHILD_I0->V2_NEW_CHILD_I1
Jun 17 07:41:58.790581: | suspending state #127 and saving MD 0x55cbb5f104c8
Jun 17 07:41:58.790593: | #127 is busy; has suspended MD 0x55cbb5f104c8
Jun 17 07:41:58.790620: | processing resume sending helper answer back to
state for #127
Jun 17 07:41:58.790625: | unsuspending #127 MD 0x55cbb5f104c8
Jun 17 07:41:58.790629: | job 197 for #127: DH shared secret (dh):
processing response from helper 1
Jun 17 07:41:58.790633: | job 197 for #127: DH shared secret (dh): calling
continuation function 0x55cbb5740220
Jun 17 07:41:58.790640: | ikev2_child_inR_continue() for #127
STATE_V2_NEW_CHILD_I1
Jun 17 07:41:58.790750: | evaluating our conn="vpn/1x5"
I=xxx.xxx.xxx.80/28:0/0 R=xxx.xxx.xxx.75/32:0/0 to their:
Jun 17 07:41:58.790864: | #127 STATE_V2_NEW_CHILD_I1 add g^ir to child key
0x55cbb5ec5600
Jun 17 07:41:58.796821: | install_ipsec_sa() for #127: inbound and outbound
Jun 17 07:41:58.796825: | could_route called for vpn/1x5; kind=CK_PERMANENT
that.has_client=yes oppo=no this.host_port=500
Jun 17 07:41:58.796833: |  conn vpn/1x5 mark 0/00000000, 0/00000000 vs
Jun 17 07:41:58.796842: |  conn vpn/1x5 mark 0/00000000, 0/00000000 vs
...
Jun 17 07:41:58.796946: |  conn vpn/1x5 mark 0/00000000, 0/00000000 vs
Jun 17 07:41:58.796953: |  conn vpn/1x5 mark 0/00000000, 0/00000000 vs
Jun 17 07:41:58.796963: | route owner of "vpn/1x5" prospective erouted:
self; eroute owner: self
Jun 17 07:41:58.797003: | NIC esp-hw-offload not for connection 'vpn/1x5'
not available on interface eth0
Jun 17 07:41:58.798009: | NIC esp-hw-offload not for connection 'vpn/1x5'
not available on interface eth0
Jun 17 07:41:58.798142: | priority calculation of connection "vpn/1x5" is
2082750 (0x1fc7be)
Jun 17 07:41:58.798280: | sr for #127: prospective erouted
Jun 17 07:41:58.798293: |  conn vpn/1x5 mark 0/00000000, 0/00000000 vs
...
Jun 17 07:41:58.798412: |  conn vpn/1x5 mark 0/00000000, 0/00000000 vs
Jun 17 07:41:58.798419: | route owner of "vpn/1x5" prospective erouted:
self; eroute owner: self
Jun 17 07:41:58.798424: | route_and_eroute with c: vpn/1x5 (next: none)
ero:vpn/1x5 esr:{(nil)} ro:vpn/1x5 rosr:{(nil)} and state: #127
Jun 17 07:41:58.798431: | priority calculation of connection "vpn/1x5" is
2082750 (0x1fc7be)
Jun 17 07:41:58.798896: | executing up-client: 2>&1 PLUTO_VERB='up-client'
PLUTO_CONNECTION='vpn/1x5' PLUTO_CONNECTION_TYPE='tunnel'
PLUTO_VIRT_INTERFACE='NULL' PLUTO_INTERFACE='eth0' PLUTO_XFRMI_ROUTE=''
PLUTO_NEXT_HOP='xxx.xxx.xxx.81' PLUTO_ME='xxx.xxx.xxx.92'
PLUTO_MY_ID='xxx.xxx.xxx.120' PLUTO_MY_CLIENT='xxx.xxx.xxx.80/28'
PLUTO_MY_CLIENT_NET='xxx.xxx.xxx.80' PLUTO_MY_CLIENT_MASK='xxx.xxx.xxx.240'
PLUTO_MY_PORT='0' PLUTO_MY_PROTOCOL='0' PLUTO_SA_REQID='16404'
PLUTO_SA_TYPE='ESP' PLUTO_PEER='xxx.xxx.xxx.45'
PLUTO_PEER_ID='xxx.xxx.xxx.45' PLUTO_PEER_CLIENT='xxx.xxx.xxx.75/32'
PLUTO_PEER_CLIENT_NET='xxx.xxx.xxx.75'
PLUTO_PEER_CLIENT_MASK='xxx.xxx.xxx.255' PLUTO_PEER_PORT='0'
PLUTO_PEER_PROTOCOL='0' PLUTO_PEER_CA='' PLUTO_STACK='xfrm'
PLUTO_ADDTIME='1623908518'
PLUTO_CONN_POLICY='PSK+ENCRYPT+TUNNEL+PFS+UP+IKE_FRAG_ALLOW+ESN_NO'
PLUTO_CONN_KIND='CK_PERMANENT' PLUTO_CONN_ADDRFAMILY='ipv4' XAUTH_FAILED=0
PLUTO_MY_SOURCEIP='xxx.xxx.xxx.92' PLUTO_MOBIKE_EVENT=''
PLUTO_IS_PEER_CISCO='0' PLUTO_PEER_DNS_INFO='' PLUTO_PEER_DOMAIN_INFO=''
PLUTO_PEER_Bxxx.xxx.xxx.
Jun 17 07:41:58.798910: | cmd(   0):2>&1 PLUTO_VERB='up-client'
PLUTO_CONNECTION='vpn/1x5' PLUTO_CONNECTION_TYPE='tu:
Jun 17 07:41:58.814208: | route_and_eroute: instance "vpn/1x5", setting
eroute_owner {spd=0x55cbb5eb0788,sr=0x55cbb5eb0788} to #127 (was #0)
(newest_ipsec_sa=#0)
Jun 17 07:41:58.814293: | inR2: instance vpn/1x5[0], setting IKEv2
newest_ipsec_sa to #127 (was #0) (spd.eroute=#127) cloned from #115
Jun 17 07:41:58.814370: | job 197 for #127: DH shared secret (dh): calling
cleanup function 0x55cbb5740840
Jun 17 07:41:58.814792: | #127 complete_v2_state_transition()
V2_NEW_CHILD_I1->ESTABLISHED_CHILD_SA with status STF_OK;
.st_v2_transition=V2_NEW_CHILD_I0->V2_NEW_CHILD_I1
Jun 17 07:41:58.814930: | Message ID: updating counters for #127
Jun 17 07:41:58.815001: | Message ID: CHILD #115.#127 XXX: no
EVENT_RETRANSMIT to clear; suspect IKE->CHILD switch: ike.initiator.sent=9
ike.initiator.recv=8 ike.initiator.last_contact=1899656.764788
ike.responder.sent=1240 ike.responder.recv=1240
ike.responder.last_contact=1899579.044082 child.wip.initiator=-1
child.wip.responder=-1
Jun 17 07:41:58.815098: | Message ID: CHILD #115.#127 updating initiator
received message response 9: ike.initiator.sent=9 ike.initiator.recv=8->9
ike.initiator.last_contact=1899656.764788->1899662.898429
ike.responder.sent=1240 ike.responder.recv=1240
ike.responder.last_contact=1899579.044082 child.wip.initiator=9->-1
child.wip.responder=-1
Jun 17 07:41:58.815184: | Message ID: CHILD #115.#127 skipping update_send
as nothing to send: ike.initiator.sent=9 ike.initiator.recv=9
ike.initiator.last_contact=1899662.898429 ike.responder.sent=1240
ike.responder.recv=1240 ike.responder.last_contact=1899579.044082
child.wip.initiator=-1 child.wip.responder=-1
Jun 17 07:41:58.815332: | child state #127: V2_NEW_CHILD_I1(established IKE
SA) => ESTABLISHED_CHILD_SA(established CHILD SA)
Jun 17 07:41:58.815399: | pstats #127 ikev2.child established
Jun 17 07:41:58.815530: "vpn/1x5" #127: negotiated connection
[xxx.xxx.xxx.80-xxx.xxx.xxx.95:0-65535 0] ->
[xxx.xxx.xxx.75-xxx.xxx.xxx.75:0-65535 0]
Jun 17 07:41:58.815666: "vpn/1x5" #127: IPsec SA established tunnel mode
{ESP=>0xa875208f <0x670fba52 xfrm=AES_CBC_256-HMAC_SHA2_256_128-MODP2048
NATOA=none NATD=none DPD=active}
Jun 17 07:41:58.815738: | releasing #127's fd-fd@(nil) because IKEv2
transitions finished
Jun 17 07:41:58.815923: | unpending #127's IKE SA #115
Jun 17 07:41:58.815982: | unpending state #115 connection "vpn/1x5"
Jun 17 07:41:58.816226: | #127 will start re-keying in 85763 seconds with
margin of 637 seconds (attempting re-key)
Jun 17 07:41:58.816301: | state #127 deleting .st_event EVENT_CRYPTO_TIMEOUT
Jun 17 07:41:58.816581: | inserting event EVENT_SA_REKEY, timeout in 85763
seconds for #127
Jun 17 07:41:58.816913: | inserting event EVENT_v2_LIVENESS, timeout in 30
seconds for #127
Jun 17 07:41:58.817294: | #127 spent 4.1 (26.7) milliseconds in resume
sending helper answer back to state
Jun 17 07:42:04.574695: |  conn vpn/1x5 mark 0/00000000, 0/00000000
...
Jun 17 07:42:23.814086: |  conn vpn/1x5 mark 0/00000000, 0/00000000
Jun 17 07:42:28.791618: | handling event EVENT_v2_LIVENESS for child state
#127
Jun 17 07:42:28.791676: | liveness: #127 scheduling next check for
xxx.xxx.xxx.45:500 in 30 seconds (successful exchange was 4.958916 seconds
ago)
Jun 17 07:42:28.791687: | inserting event EVENT_v2_LIVENESS, timeout in 30
seconds for #127
Jun 17 07:42:28.791704: | #127 spent 0.122 (0.122) milliseconds in
timer_event_cb() EVENT_v2_LIVENESS
Jun 17 07:42:38.750512: |  conn vpn/1x5 mark 0/00000000, 0/00000000
Jun 17 07:42:38.754174: |  conn vpn/1x5 mark 0/00000000, 0/00000000
Jun 17 07:42:43.047564: | pending review: connection "vpn/1x5" checked
Jun 17 07:42:58.796286: | handling event EVENT_v2_LIVENESS for child state
#127
Jun 17 07:42:58.796352: | liveness: #127 scheduling next check for
xxx.xxx.xxx.45:500 in 30 seconds (successful exchange was 19.227116 seconds
ago)
Jun 17 07:42:58.796363: | inserting event EVENT_v2_LIVENESS, timeout in 30
seconds for #127
Jun 17 07:42:58.796391: | #127 spent 0.148 (0.147) milliseconds in
timer_event_cb() EVENT_v2_LIVENESS
Jun 17 07:43:28.799819: | handling event EVENT_v2_LIVENESS for child state
#127
Jun 17 07:43:28.799877: | liveness: #127 scheduling next check for
xxx.xxx.xxx.45:500 in 30 seconds (successful exchange was 19.229236 seconds
ago)
Jun 17 07:43:28.799888: | inserting event EVENT_v2_LIVENESS, timeout in 30
seconds for #127
Jun 17 07:43:28.799906: | #127 spent 0.129 (0.128) milliseconds in
timer_event_cb() EVENT_v2_LIVENESS
Jun 17 07:43:58.803738: | handling event EVENT_v2_LIVENESS for child state
#127
Jun 17 07:43:58.804125: | liveness: #127 scheduling next check for
xxx.xxx.xxx.45:500 in 30 seconds (successful exchange was 19.231563 seconds
ago)
Jun 17 07:43:58.804317: | inserting event EVENT_v2_LIVENESS, timeout in 30
seconds for #127
Jun 17 07:43:58.804509: | #127 spent 0.211 (1) milliseconds in
timer_event_cb() EVENT_v2_LIVENESS
Jun 17 07:44:28.807474: | handling event EVENT_v2_LIVENESS for child state
#127
Jun 17 07:44:28.807826: | liveness: #127 scheduling next check for
xxx.xxx.xxx.45:500 in 30 seconds (successful exchange was 19.234285 seconds
ago)
Jun 17 07:44:28.808034: | inserting event EVENT_v2_LIVENESS, timeout in 30
seconds for #127
Jun 17 07:44:28.808223: | #127 spent 0.294 (0.989) milliseconds in
timer_event_cb() EVENT_v2_LIVENESS
Jun 17 07:44:43.049301: | pending review: connection "vpn/1x5" checked
Jun 17 07:44:58.811299: | handling event EVENT_v2_LIVENESS for child state
#127
...

### After 30 minutes begins the disconnection of subnet vpn/1x5

Jun 17 08:11:59.028357: | handling event EVENT_v2_LIVENESS for child state
#127
Jun 17 08:11:59.028422: | liveness: #127 scheduling next check for
xxx.xxx.xxx.45:500 in 30 seconds (successful exchange was 19.366489 seconds
ago)
Jun 17 08:11:59.028433: | inserting event EVENT_v2_LIVENESS, timeout in 30
seconds for #127
Jun 17 08:11:59.028450: | #127 spent 0.134 (0.133) milliseconds in
timer_event_cb() EVENT_v2_LIVENESS
Jun 17 08:12:03.187220: | v2 CHILD SA #127 found using their inbound (our
outbound) SPI, in STATE_V2_ESTABLISHED_CHILD_SA
Jun 17 08:12:03.187223: | State DB: found IKEv2 state #127 in
ESTABLISHED_CHILD_SA (find_v2_child_sa_by_outbound_spi)
Jun 17 08:12:03.187232: "vpn/1x18" #115: received Delete SA payload:
replace CHILD SA #127 now
Jun 17 08:12:03.187237: | #127 requesting EVENT_SA_REKEY-pe at 0x55cbb5ee7578
be deleted
Jun 17 08:12:03.187257: | inserting event EVENT_SA_REPLACE, timeout in 0
seconds for #127
Jun 17 08:12:03.187712: | handling event EVENT_SA_REPLACE for child state
#127
Jun 17 08:12:03.187726: | picked newest_ipsec_sa #127 for #127

### Creates the new state object which won't reconnect the phase 2

Jun 17 08:12:03.187755: | creating state object #135 at 0x55cbb5ed54a8
Jun 17 08:12:03.187759: | State DB: adding IKEv2 state #135 in UNDEFINED
Jun 17 08:12:03.187770: | pstats #135 ikev2.child started

### I assume that here, subnet 1x5 is trying to initiate phase 1 using #115
attached to vpn/1x18

Jun 17 08:12:03.187774: | duplicating state object #115 "vpn/1x18" as #135
for IPSEC SA
Jun 17 08:12:03.187779: | #135 setting local endpoint to xxx.xxx.xxx.92:500
from #115.st_localport (in duplicate_state() at state.c:1492)
Jun 17 08:12:03.187824: | Message ID: CHILD #115.#135 initializing (CHILD
SA): ike.initiator.sent=14 ike.initiator.recv=14
ike.initiator.last_contact=1901443.745334 ike.responder.sent=1243
ike.responder.recv=1243 ike.responder.last_contact=1901467.27103
child.wip.initiator=0->-1 child.wip.responder=0->-1
Jun 17 08:12:03.187829: | child state #135: UNDEFINED(ignore) =>
V2_REKEY_CHILD_I0(established IKE SA)
Jun 17 08:12:03.187834: | #135.st_v2_transition NULL ->
V2_REKEY_CHILD_I0->V2_REKEY_CHILD_I1 (in new_v2_child_state() at
state.c:1579)
Jun 17 08:12:03.187842: | using existing local ESP/AH proposals for vpn/1x5
(ESP/AH initiator emitting proposals):
1:ESP=AES_CBC_256-HMAC_SHA2_256_128-MODP2048-DISABLED
Jun 17 08:12:03.187854: | #135 schedule rekey initiate IPsec SA
PSK+ENCRYPT+TUNNEL+PFS+UP+IKE_FRAG_ALLOW+ESN_NO to replace #127 using IKE#
115 pfs=MODP2048
Jun 17 08:12:03.187866: | inserting event EVENT_v2_INITIATE_CHILD, timeout
in 0 seconds for #135
Jun 17 08:12:03.187874: | #127 requesting
EVENT_v2_LIVENESS-pe at 0x55cbb5ebef58 be deleted
Jun 17 08:12:03.187889: | inserting event EVENT_SA_EXPIRE, timeout in 0
seconds for #127
Jun 17 08:12:03.187900: | #127 spent 0.19 (0.19) milliseconds in
timer_event_cb() EVENT_SA_REPLACE
Jun 17 08:12:03.187911: | handling event EVENT_v2_INITIATE_CHILD for child
state #135
Jun 17 08:12:03.187937: | job 210 for #135: Child Rekey Initiator KE and
nonce ni (dh): adding job to queue
Jun 17 08:12:03.187940: | state #135 has no .st_event to delete
Jun 17 08:12:03.187944: | #135 STATE_V2_REKEY_CHILD_I0: retransmits: cleared
Jun 17 08:12:03.187951: | inserting event EVENT_CRYPTO_TIMEOUT, timeout in
60 seconds for #135
Jun 17 08:12:03.190011: | job 210 for #135: Child Rekey Initiator KE and
nonce ni (dh): helper 1 starting job
Jun 17 08:12:03.191762: | "vpn/1x5" #135: spent 1.75 (1.75) milliseconds in
helper 1 processing job 210 for state #135: Child Rekey Initiator KE and
nonce ni (dh)
Jun 17 08:12:03.191765: | job 210 for #135: Child Rekey Initiator KE and
nonce ni (dh): helper thread 1 sending result back to state
Jun 17 08:12:03.191769: | scheduling resume sending helper answer back to
state for #135
Jun 17 08:12:03.192169: | #135 spent 0.0656 (4.26) milliseconds in
timer_event_cb() EVENT_v2_INITIATE_CHILD
Jun 17 08:12:03.192376: | handling event EVENT_SA_EXPIRE for child state
#127
Jun 17 08:12:03.192670: | picked newest_ipsec_sa #127 for #127
Jun 17 08:12:03.192837: | pstats #127 ikev2.child re-failed exchange-timeout
Jun 17 08:12:03.193157: "vpn/1x5" #127: deleting state
(STATE_V2_ESTABLISHED_CHILD_SA) aged 1804.444672s and NOT sending
notification
Jun 17 08:12:03.193269: | child state #127:
ESTABLISHED_CHILD_SA(established CHILD SA) => delete
Jun 17 08:12:03.193361: | pstats #127 ikev2.child deleted completed
Jun 17 08:12:03.193434: | #127 main thread spent 15.7 (63.8) milliseconds
helper thread spent 2.61 (2.61) milliseconds in total
Jun 17 08:12:03.193619: "vpn/1x5" #127: ESP traffic information: in=0B
out=60B
Jun 17 08:12:03.193717: | unsuspending #127 MD (nil)
Jun 17 08:12:03.193909: | child state #127:
ESTABLISHED_CHILD_SA(established CHILD SA) => CHILDSA_DEL(informational)
Jun 17 08:12:03.194008: | state #127 has no .st_event to delete
Jun 17 08:12:03.194017: | #127 STATE_CHILDSA_DEL: retransmits: cleared
Jun 17 08:12:03.194095: | executing down-client: 2>&1
PLUTO_VERB='down-client' PLUTO_CONNECTION='vpn/1x5'
PLUTO_CONNECTION_TYPE='tunnel' PLUTO_VIRT_INTERFACE='NULL'
PLUTO_INTERFACE='eth0' PLUTO_XFRMI_ROUTE='' PLUTO_NEXT_HOP='xxx.xxx.xxx.81'
PLUTO_ME='xxx.xxx.xxx.92' PLUTO_MY_ID='xxx.xxx.xxx.120'
PLUTO_MY_CLIENT='xxx.xxx.xxx.80/28' PLUTO_MY_CLIENT_NET='xxx.xxx.xxx.80'
PLUTO_MY_CLIENT_MASK='xxx.xxx.xxx.240' PLUTO_MY_PORT='0'
PLUTO_MY_PROTOCOL='0' PLUTO_SA_REQID='16404' PLUTO_SA_TYPE='ESP'
PLUTO_PEER='xxx.xxx.xxx.45' PLUTO_PEER_ID='xxx.xxx.xxx.45'
PLUTO_PEER_CLIENT='xxx.xxx.xxx.75/32'
PLUTO_PEER_CLIENT_NET='xxx.xxx.xxx.75'
PLUTO_PEER_CLIENT_MASK='xxx.xxx.xxx.255' PLUTO_PEER_PORT='0'
PLUTO_PEER_PROTOCOL='0' PLUTO_PEER_CA='' PLUTO_STACK='xfrm'
PLUTO_ADDTIME='1623908518'
PLUTO_CONN_POLICY='PSK+ENCRYPT+TUNNEL+PFS+UP+IKE_FRAG_ALLOW+ESN_NO'
PLUTO_CONN_KIND='CK_PERMANENT' PLUTO_CONN_ADDRFAMILY='ipv4' XAUTH_FAILED=0
PLUTO_MY_SOURCEIP='xxx.xxx.xxx.92' PLUTO_MOBIKE_EVENT=''
PLUTO_IS_PEER_CISCO='0' PLUTO_PEER_DNS_INFO='' PLUTO_PEER_DOMAIN_INFO=''
PLUTO_PExxx.xxx.xxx.
Jun 17 08:12:03.194105: | cmd(   0):2>&1 PLUTO_VERB='down-client'
PLUTO_CONNECTION='vpn/1x5' PLUTO_CONNECTION_TYPE=':
Jun 17 08:12:03.205133: | shunt_eroute() called for connection 'vpn/1x5' to
'replace with shunt' for rt_kind 'prospective erouted' using protoports
xxx.xxx.xxx.80/28 --0->- xxx.xxx.xxx.75/32
Jun 17 08:12:03.205381: | priority calculation of connection "vpn/1x5" is
2082750 (0x1fc7be)
Jun 17 08:12:03.206065: | priority calculation of connection "vpn/1x5" is
2082750 (0x1fc7be)
Jun 17 08:12:03.206868: | State DB: deleting IKEv2 state #127 in CHILDSA_DEL
Jun 17 08:12:03.206969: | connection vpn/1x5 is not an instance, skipping
delete-unused
Jun 17 08:12:03.207070: | child state #127: CHILDSA_DEL(informational) =>
UNDEFINED(ignore)
Jun 17 08:12:03.207162: | releasing #127's fd-fd@(nil) because deleting
state
Jun 17 08:12:03.209170: | State DB: found IKEv2 state #135 in
V2_REKEY_CHILD_I0 (v2_expire_unused_ike_sa)
Jun 17 08:12:03.209238: | can't expire unused IKE SA #115; it has the child
#135
Jun 17 08:12:03.209305: | in statetime_stop() and could not find #127
Jun 17 08:12:03.209605: | processing resume sending helper answer back to
state for #135
Jun 17 08:12:03.209667: | unsuspending #135 MD (nil)
Jun 17 08:12:03.209725: | job 210 for #135: Child Rekey Initiator KE and
nonce ni (dh): processing response from helper 1
Jun 17 08:12:03.209790: | job 210 for #135: Child Rekey Initiator KE and
nonce ni (dh): calling continuation function 0x55cbb573feb0
Jun 17 08:12:03.209798: | ikev2_child_outI_continue() for #135
STATE_V2_REKEY_CHILD_I0
Jun 17 08:12:03.209820: | adding CHILD SA #135 to IKE SA #115 message
initiator queue
Jun 17 08:12:03.209829: | Message ID: CHILD #115.#135 wakeing IKE SA for
next initiator (unack 0): ike.initiator.sent=14 ike.initiator.recv=14
ike.initiator.last_contact=1901443.745334 ike.responder.sent=1243
ike.responder.recv=1243 ike.responder.last_contact=1901467.27103
child.wip.initiator=-1 child.wip.responder=-1
Jun 17 08:12:03.209844: | job 210 for #135: Child Rekey Initiator KE and
nonce ni (dh): calling cleanup function 0x55cbb573fef0
Jun 17 08:12:03.209868: | #135 complete_v2_state_transition()
V2_REKEY_CHILD_I0->V2_REKEY_CHILD_I1 with status STF_SUSPEND
Jun 17 08:12:03.209883: | #135 spent 0.115 (0.215) milliseconds in resume
sending helper answer back to state
Jun 17 08:12:03.209930: | Message ID: CHILD #115.#135 resuming SA using IKE
SA (unack 0): ike.initiator.sent=14 ike.initiator.recv=14
ike.initiator.last_contact=1901443.745334 ike.responder.sent=1243
ike.responder.recv=1243 ike.responder.last_contact=1901467.27103
child.wip.initiator=-1 child.wip.responder=-1

### At this point is when the reconnection attempt crash with
STF_INTERNAL_ERROR and with version
### 4.4. it deletes correctly the failed state #135

Jun 17 08:12:03.209934: | unsuspending #135 MD (nil)
Jun 17 08:12:03.210236: "vpn/1x5" #135: CHILD SA to rekey #127 vanished
abort this exchange
Jun 17 08:12:03.210243: | #135 complete_v2_state_transition()
V2_REKEY_CHILD_I0->V2_REKEY_CHILD_I1 with status STF_INTERNAL_ERROR
Jun 17 08:12:03.210247: "vpn/1x5" #135: state transition function for
STATE_V2_REKEY_CHILD_I0 had internal error
Jun 17 08:12:03.210251: | release_pending_whacks: state #135 has no whack fd
Jun 17 08:12:43.047717: | pending review: connection "vpn/1x5" checked
Jun 17 08:13:03.191120: | handling event EVENT_CRYPTO_TIMEOUT for child
state #135
Jun 17 08:13:03.191135: | event crypto_failed on state #135, aborting
Jun 17 08:13:03.191139: | pstats #135 ikev2.child failed crypto-timeout
Jun 17 08:13:03.191148: "vpn/1x5" #135: deleting state
(STATE_V2_REKEY_CHILD_I0) aged 60.003398s and NOT sending notification
Jun 17 08:13:03.191151: | child state #135: V2_REKEY_CHILD_I0(established
IKE SA) => delete
Jun 17 08:13:03.191155: | pstats #135 ikev2.child deleted crypto-timeout
Jun 17 08:13:03.191160: | #135 main thread spent 0.18 (4.47) milliseconds
helper thread spent 1.75 (1.75) milliseconds in total
Jun 17 08:13:03.191164: | unsuspending #135 MD (nil)
Jun 17 08:13:03.191171: | child state #135: V2_REKEY_CHILD_I0(established
IKE SA) => CHILDSA_DEL(informational)
Jun 17 08:13:03.191175: | state #135 has no .st_event to delete
Jun 17 08:13:03.191178: | #135 STATE_CHILDSA_DEL: retransmits: cleared
Jun 17 08:13:03.191183: | priority calculation of connection "vpn/1x5" is
2082750 (0x1fc7be)
Jun 17 08:13:03.191214: | State DB: deleting IKEv2 state #135 in CHILDSA_DEL
Jun 17 08:13:03.191221: | connection vpn/1x5 is not an instance, skipping
delete-unused
Jun 17 08:13:03.191225: | child state #135: CHILDSA_DEL(informational) =>
UNDEFINED(ignore)
Jun 17 08:13:03.191231: | releasing #135's fd-fd@(nil) because deleting
state
Jun 17 08:13:03.191329: | in statetime_stop() and could not find #135
Jun 17 08:14:00.555170: |   investigating connection "vpn/1x5" as a better
match
Jun 17 08:14:00.555185: | evaluating our conn="vpn/1x5"
I=xxx.xxx.xxx.75/32:0/0 R=xxx.xxx.xxx.80/28:0/0 to their:
Jun 17 08:14:00.567265: |  conn vpn/1x5 mark 0/00000000, 0/00000000
Jun 17 08:14:00.572021: |  conn vpn/1x5 mark 0/00000000, 0/00000000
Jun 17 08:14:43.047625: | pending review: connection "vpn/1x5" checked
Jun 17 08:16:43.048038: | pending review: connection "vpn/1x5" checked
Jun 17 08:18:43.048999: | pending review: connection "vpn/1x5" checked

This log explains itself better comparing from older versions, thanks for
the effort to get this improvement.

Best regards!

El mié, 16 jun 2021 a las 21:14, Paul Wouters (<paul at nohats.ca>) escribió:

> On Wed, 16 Jun 2021, Miguel Ponce Antolin wrote:
>
> > Some questions that came to me with the upgrade option,
> > - Is it still needed to separate the rightsubnets? And do you create
> them on different files? I have understood that you create them on the same
> conf
> > file.
>
> I would first try the upgrade and see if your problem remains. If it
> does, try separating the conns. It can be in 1 file.
>
> > - The ikelifetime and salifetime for rekeying is still a problem on
> version 4.4-1?, I think it is recommended anyway.
>
> It is not so much a code problem, but an interop of configuration
> problem. Tweaking the lifetimes, tweaks who will decide to rekey
> first. That can work around implementation bugs. We are not aware
> of libreswan having a bug here. I just gave you two methods that
> ensure either libreswan always rekeys, or libreswan never rekeys.
> That usually works around any other implementation bugs.
>
> Paul
>
> > Thanks again,
> >
> > Best Regards!
> >
> >
> > El mar, 15 jun 2021 a las 17:40, Paul Wouters (<paul at nohats.ca>)
> escribió:
> >       On Tue, 15 Jun 2021, Miguel Ponce Antolin wrote:
> >
> >       > I have been suffering a random problem with libreswan v3.25 when
> connecting an AWS EC2 Instance running Libreswan and a Cisco ASA on the
> >       other end.
> >
> >       Is it possible to test v4.4 ? We have rpms build on
> download.libreswan.org/binaries/
> >
> >       Specifically, with the many subnets you are likely needing this
> fix from 4.4:
> >
> >       * IKEv2: Connections would not always switch when needed
> [Andrew/Paul]
> >
> >       But the changelog between 3.25 and 4.4 is huge. There might be
> other
> >       items you need too.
> >
> >       Alternatively, you can try and split up your subnetS  into
> different
> >       conns, eg:
> >
> >
> >               conn vpn
> >                   type=tunnel
> >                   authby=secret
> >                   # use auto=ignore, will be read in via also= statements
> >                   auto=ignore
> >                   left=%defaultroute
> >                   leftid=xxx.xxx.xxx.120
> >                   leftsubnets=xxx.xxx.xxx.80/28
> >                   right=xxx.xxx.xxx.45
> >                   rightid=xxx.xxx.xxx.45
> >                   # no rightsubnet= here
> >                   # dont use this with more than one subnet...
> leftsourceip=xxx.xxx.xxx.92
> >                   ikev2=insist
> >                   ike=aes256-sha2;dh14
> >                   esp=aes256-sha256
> >                   keyexchange=ike
> >                   ikelifetime=28800s
> >                   salifetime=28800s
> >                   dpddelay=30
> >                   dpdtimeout=120
> >                   dpdaction=restart
> >                   encapsulation=no
> >
> >              conn vpn-1
> >               also=vpn
> >               auto=start
> >               rightsubnet=10.subnet.1.0/22
> >
> >              conn vpn-2
> >               also=vpn
> >               auto=start
> >               rightsubnet=10.subnet.2.0/20
> >
> >              [...]
> >
> >              conn vpn-18
> >               also=vpn
> >               auto=start
> >               rightsubnet=10.subnet.18.9/32
> >
> >
> >       This uses a slightly different code path to get all the tunnels
> loaded and active.
> >
> >       > We tried to "force" to reconnect using the ping command to an IP
> in various rightsubnets but when the problem is active we continously are
> >       seeing this
> >       > kind of logs:
> >
> >       That would be hacky and not really solve race conditions.
> >
> >       > Jun 11 11:17:25.795153: "vpn/1x15" #221: message id deadlock?
> wait sending, add to send next list using parent #165 unacknowledged 1 next
> >       message
> >       > id=63 ike exchange window 1
> >
> >       Note that this is a bit of a concern. You can only have one IKE
> message
> >       outstanding, and this indicates that the Cisco might not be
> answering
> >       that outstanding message, and so the only thing libreswan can do is
> >       wait longer or restart _everything_ related to that IKE SA, so that
> >       means all tunnels. We did reduce the change of message id deadlock
> >       some point in the past with our pending() code, so again tetsing
> >       with an upgraded libreswan would be a useful test.
> >
> >       > Is there any troubleshooting we could do in order to know where
> the rekey request is lost or why is not trying to rekey at all when this
> >       problem is
> >       > active?
> >
> >       Depending on what the issues are, you can try to ensure either
> libreswan
> >       or Cisco is always the rekey initiator by tweaking the ikelifetime
> and
> >       salifetime. Eg try ikelifetime=24h with salifetime=8h and most
> likely
> >       Cisco will trigger all the rekeys. Or use ikelifetime=2h and
> >       salifetime=1h to make libreswan likely always initiate the rekeys.
> >
> >       Paul
> >
> >
> >
> > --
> >
> > Logo Especialidad
> >
> > Miguel Ponce Antolín.
> > Sistemas    ·    +34 670 360 655
> > Linea
> > Logo Paradigma    ·   paradig.ma   ·   contáctanos   ·   Twitter
> Youtube   Linkedin   Instagram
> >
> >
> >
>


-- 

*Miguel Ponce Antolín.*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.libreswan.org/pipermail/swan/attachments/20210617/a42a9247/attachment-0001.html>


More information about the Swan mailing list