[Swan] Fwd: Problem with random rekey failures

Tue Jun 15 15:40:35 UTC 2021

On Tue, 15 Jun 2021, Miguel Ponce Antolin wrote:

> I have been suffering a random problem with libreswan v3.25 when connecting an AWS EC2 Instance running Libreswan and a Cisco ASA on the other end.

Is it possible to test v4.4 ? We have rpms build on download.libreswan.org/binaries/

Specifically, with the many subnets you are likely needing this fix from 4.4:

* IKEv2: Connections would not always switch when needed [Andrew/Paul]

But the changelog between 3.25 and 4.4 is huge. There might be other
items you need too.

Alternatively, you can try and split up your subnetS  into different
conns, eg:

        conn vpn
            type=tunnel
            authby=secret
            # use auto=ignore, will be read in via also= statements
            auto=ignore
            left=%defaultroute
            leftid=xxx.xxx.xxx.120
            leftsubnets=xxx.xxx.xxx.80/28
            right=xxx.xxx.xxx.45
            rightid=xxx.xxx.xxx.45
            # no rightsubnet= here
            # dont use this with more than one subnet...    leftsourceip=xxx.xxx.xxx.92
            ikev2=insist
            ike=aes256-sha2;dh14
            esp=aes256-sha256
            keyexchange=ike
            ikelifetime=28800s
            salifetime=28800s
            dpddelay=30
            dpdtimeout=120
            dpdaction=restart
            encapsulation=no

       conn vpn-1
 	also=vpn
 	auto=start
 	rightsubnet=10.subnet.1.0/22

       conn vpn-2
 	also=vpn
 	auto=start
 	rightsubnet=10.subnet.2.0/20

       [...]

       conn vpn-18
 	also=vpn
 	auto=start
 	rightsubnet=10.subnet.18.9/32

This uses a slightly different code path to get all the tunnels loaded and active.

> We tried to "force" to reconnect using the ping command to an IP in various rightsubnets but when the problem is active we continously are seeing this
> kind of logs:

That would be hacky and not really solve race conditions.

> Jun 11 11:17:25.795153: "vpn/1x15" #221: message id deadlock? wait sending, add to send next list using parent #165 unacknowledged 1 next message
> id=63 ike exchange window 1

Note that this is a bit of a concern. You can only have one IKE message
outstanding, and this indicates that the Cisco might not be answering
that outstanding message, and so the only thing libreswan can do is
wait longer or restart _everything_ related to that IKE SA, so that
means all tunnels. We did reduce the change of message id deadlock
some point in the past with our pending() code, so again tetsing
with an upgraded libreswan would be a useful test.

> Is there any troubleshooting we could do in order to know where the rekey request is lost or why is not trying to rekey at all when this problem is
> active?

Depending on what the issues are, you can try to ensure either libreswan
or Cisco is always the rekey initiator by tweaking the ikelifetime and
salifetime. Eg try ikelifetime=24h with salifetime=8h and most likely
Cisco will trigger all the rekeys. Or use ikelifetime=2h and
salifetime=1h to make libreswan likely always initiate the rekeys.

Paul