[Swan] Fwd: Problem with random rekey failures

Wed Jun 16 19:14:29 UTC 2021

On Wed, 16 Jun 2021, Miguel Ponce Antolin wrote:

> Some questions that came to me with the upgrade option,
> - Is it still needed to separate the rightsubnets? And do you create them on different files? I have understood that you create them on the same conf
> file.

I would first try the upgrade and see if your problem remains. If it
does, try separating the conns. It can be in 1 file.

> - The ikelifetime and salifetime for rekeying is still a problem on version 4.4-1?, I think it is recommended anyway.

It is not so much a code problem, but an interop of configuration
problem. Tweaking the lifetimes, tweaks who will decide to rekey
first. That can work around implementation bugs. We are not aware
of libreswan having a bug here. I just gave you two methods that
ensure either libreswan always rekeys, or libreswan never rekeys.
That usually works around any other implementation bugs.

Paul

> Thanks again,
> 
> Best Regards!
> 
> 
> El mar, 15 jun 2021 a las 17:40, Paul Wouters (<paul at nohats.ca>) escribió:
>       On Tue, 15 Jun 2021, Miguel Ponce Antolin wrote:
>
>       > I have been suffering a random problem with libreswan v3.25 when connecting an AWS EC2 Instance running Libreswan and a Cisco ASA on the
>       other end.
>
>       Is it possible to test v4.4 ? We have rpms build on download.libreswan.org/binaries/
>
>       Specifically, with the many subnets you are likely needing this fix from 4.4:
>
>       * IKEv2: Connections would not always switch when needed [Andrew/Paul]
>
>       But the changelog between 3.25 and 4.4 is huge. There might be other
>       items you need too.
>
>       Alternatively, you can try and split up your subnetS  into different
>       conns, eg:
> 
>
>               conn vpn
>                   type=tunnel
>                   authby=secret
>                   # use auto=ignore, will be read in via also= statements
>                   auto=ignore
>                   left=%defaultroute
>                   leftid=xxx.xxx.xxx.120
>                   leftsubnets=xxx.xxx.xxx.80/28
>                   right=xxx.xxx.xxx.45
>                   rightid=xxx.xxx.xxx.45
>                   # no rightsubnet= here
>                   # dont use this with more than one subnet...    leftsourceip=xxx.xxx.xxx.92
>                   ikev2=insist
>                   ike=aes256-sha2;dh14
>                   esp=aes256-sha256
>                   keyexchange=ike
>                   ikelifetime=28800s
>                   salifetime=28800s
>                   dpddelay=30
>                   dpdtimeout=120
>                   dpdaction=restart
>                   encapsulation=no
>
>              conn vpn-1
>               also=vpn
>               auto=start
>               rightsubnet=10.subnet.1.0/22
>
>              conn vpn-2
>               also=vpn
>               auto=start
>               rightsubnet=10.subnet.2.0/20
>
>              [...]
>
>              conn vpn-18
>               also=vpn
>               auto=start
>               rightsubnet=10.subnet.18.9/32
> 
>
>       This uses a slightly different code path to get all the tunnels loaded and active.
>
>       > We tried to "force" to reconnect using the ping command to an IP in various rightsubnets but when the problem is active we continously are
>       seeing this
>       > kind of logs:
>
>       That would be hacky and not really solve race conditions.
>
>       > Jun 11 11:17:25.795153: "vpn/1x15" #221: message id deadlock? wait sending, add to send next list using parent #165 unacknowledged 1 next
>       message
>       > id=63 ike exchange window 1
>
>       Note that this is a bit of a concern. You can only have one IKE message
>       outstanding, and this indicates that the Cisco might not be answering
>       that outstanding message, and so the only thing libreswan can do is
>       wait longer or restart _everything_ related to that IKE SA, so that
>       means all tunnels. We did reduce the change of message id deadlock
>       some point in the past with our pending() code, so again tetsing
>       with an upgraded libreswan would be a useful test.
>
>       > Is there any troubleshooting we could do in order to know where the rekey request is lost or why is not trying to rekey at all when this
>       problem is
>       > active?
>
>       Depending on what the issues are, you can try to ensure either libreswan
>       or Cisco is always the rekey initiator by tweaking the ikelifetime and
>       salifetime. Eg try ikelifetime=24h with salifetime=8h and most likely
>       Cisco will trigger all the rekeys. Or use ikelifetime=2h and
>       salifetime=1h to make libreswan likely always initiate the rekeys.
>
>       Paul
> 
> 
> 
> --
> 
> Logo Especialidad
> 
> Miguel Ponce Antolín.
> Sistemas    ·    +34 670 360 655
> Linea
> Logo Paradigma    ·   paradig.ma   ·   contáctanos   ·   Twitter   Youtube   Linkedin   Instagram  
> 
> 
>