[Swan] mis-matched phase 2 settings cause infinite rekeys, high load, and broad failure across unrelated tunnels

Paul Wouters paul at nohats.ca
Sat Oct 20 12:02:30 UTC 2018


On Fri, 19 Oct 2018, Dharma Indurthy wrote:

> Hey, folks.
> My colleague Terell described this issue about a month ago.  For background, we have libreswan server running that supports ~150 connections.  We proceeded with a
> libreswan upgrade to 3.25.

> The upgrade seemed to be successful.  However, we just encountered the infinite look rekey problem.  What appeared to happen is that the re-keys looped like crazy
> and persisted until pluto became unresponsive, and systemd then killed the process.  Here's the gist.

> Here's the beginning of the logs.  We haven't reread secrets, so we can't connect:

[lots of attempts]

So we know this is a double edged sword. On the one hand, you want an
auto=start connection to always be up, so regardless if why or how it is
down, we attempt to bring it up. This can be aggressive if there is a
quick failure (eg missing PSK or receiving a Delete/Notify). We could
do some exponential backoff, but that would kind of violate the
auto=start directive.

> We have no limit on keying retries, so this continues for a few minutes.  We delete the connection, and things settle down:

> Oct 18 14:53:39 ip-172-20-109-76 pluto[23193]: "baycare4059/2x2" #4363088: ignoring informational payload INVALID_ID_INFORMATION, msgid=00000000, length=352
> 
> This goes on for a ~5 seconds, and then pluto stops logging anything:

Since you confirmed with the watchdog of systemd that pluto hung, it
would be helpful to attach gdb at that point and see where it is
hanging/looping. That is bug we can and should fix.

> We hit the watchdog limit, and systemd kills pluto and restarts:
> 
> Oct 18 14:56:18 ip-172-20-109-76 systemd[1]: ipsec.service: Watchdog timeout (limit 3min 20s)!
> 
> Oct 18 14:56:18 ip-172-20-109-76 systemd[1]: ipsec.service: Killing process 23193 (pluto) with signal SIGABRT.

> Here's some output from the core dump:

that does not contain a backtrace, so there isnt much I can do with it.

> At this point, we're planning on adding a keyingretries limit to all connections and alerting on failures, but I expect it's gonna be noisy and require
> intervention.  If there's no way to throttle the rekey's, then not sure what else we can do.  If there are any other options, we'd love to hear it.

Currently there is not. Throttling would have its own issues. Eg if the
remote is temporarilly misconfigured, you wouldnt want to back-off too
much. But we should look into being a little less aggressive when there
is a chain/repeated failure mode. From a practical point, we don't keep
state of those now, and restart each new keying attempt with a clean
slate.

Paul


More information about the Swan mailing list