[Swan] Trouble with connection dropping

Mon Jun 9 05:01:12 EEST 2014

On 06/08/2014 07:44 PM, Paul Wouters wrote:
> On Sun, 8 Jun 2014, zip wrote:
>
>> Using libreswan 3.8.1 between two household networks each running
>> Fedora 20
>
> You mean libreswan-3.8-1 ?
>> Left's DSL connection must use PPPOE, so its MTU is 8 bytes less than
>> Right's MTU.  In the config below I set the MTU to 1422.  (in the old
>> days this MTU problem caused ssh untold grief, and why I stopped
>> using it).
>
> You might need to use TCP clamping:
>
> ptables -I FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS
> --clamp-mss-to-pmtu
>
> If that does not help, try hardcoding it yourself:
>
> iptables -I FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss
> 1460
>
>> Back to the problem:
>> When I service restart both sides, the VPN starts up fine, both
>> networks can ping / ssh both directions.  Then at some random point
>> in time, Right stops routing traffic through the VPN, but rather goes
>> directly out the public interface; so all ping/ssh traffic
>> originating from Right and its network stops.  However Left can still
>> ping any host in Right including the firewall. ssh however doesn't
>> work in either direction after the failure.
>
> Could it be that this coincides with your DHCP lease getting renewed
> (even if it is renewed to the same IP address) ?
>
>> Finding log output is difficult.  From Left's side, I have
>> /var/log/secure logs but there isn't an immediate entry corresponding
>> to when the VPN drops. The log on Right's side... well for what I
>> think is an unrelated problem, /var/log/secure is empty and I've
>> opened a Fedora bug describing:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1105828
>> so I don't know what's happening on Right's side.  (Seems like
>> problems always happen in two's and three's).
>
> You can use plutostderrlog=/var/log/pluto.log if you get tired of all the
> ways rsyslog and systemd interfere with logging....
>
>> ipsec.conf's are below (note for unknown reasons I've had to use
>> slightly different "rightnexthop" statements).
>
> There are some bugs in the nexthop handling we addressed that will be in
> libreswan-3.9. (already commited to git master on github)
>
>> config setup
>>        protostack=netkey
>
>>        mtu=1422
>
> When the tunnel is up, do you see a route entry with the mtu specified?
>
> I think you might be seeing the DHCP lease issue bug, which has also
> been filed already for rhel by Patrick:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1078593
>
> Paul

Paul,
Thanks for the reply.
Yes I'm using libreswan-3.8-1.fc20.i686 on the box with the problem.

WIth the tunnel up and working, my route output looks like: (the right host)
[root at windward ~]# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use 
Iface
default         66.43.233.126   0.0.0.0         UG 1024   0        0 enp0s4
10.20.0.0       66.43.233.126   255.255.255.0   UG 0      0        0 enp0s4
10.20.1.0       0.0.0.0         255.255.255.0   U 0      0        0 p9p1
10.20.128.0     0.0.0.0         255.255.255.0   U 0      0        0 p9p1
66.43.233.0     0.0.0.0         255.255.255.128 U 0      0        0 enp0s4
dhcp.netins.net 66.43.233.126   255.255.255.255 UGH 1      0        0 enp0s4

But Trace path shows the MTU is correct:
tracepath -n 10.20.0.1
  1?: [LOCALHOST]                                         pmtu 1500
  1?: [LOCALHOST]                                         pmtu 1422
  1:  10.20.0.1                                           130.652ms reached
  1:  10.20.0.1                                           139.447ms reached

As for the clamp MTU, I'm using the Shorewall config line:
CLAMPMSS=Yes

Which is adding a rule of this:
iptables-save|grep -i clamp
-A FORWARD -p tcp -m tcp --tcp-flags SYN,RST SYN -m policy --dir out 
--pol none -j TCPMSS --clamp-mss-to-pmtu

At this point, I believe the MTU issue is just a relic of the past.

I found the logging directive, but didn't find any interesting content 
in it when the connection drops.

Your point about DHCP could definately be the problem.  The Right 
(problem) host is DHCP driven, even tho the it never changes.  I found 
the lease file, its dhcp-lease-time = 21600 (6 hrs).  I signed up to get 
cc's of that bug report.

Thanks,
Brian