[Swan-dev] debugging traffic status diffs in parallel testruns

Tue Jul 30 10:04:10 UTC 2019

I looked a bit into ipsec trafficstatus fluctuations we see during testruns.
Here is what I found so far. And this is a long e-mail!

Here is an example from a testrun, just a diff for the purpose discussion:
https://testing.libreswan.org/v3.28-520-g0ca419163-master/xauth-pluto-14/OUTPUT/road.console.diff
This is a random example, it is not the specific one I used. While debugging 
manually I used xauth-pluto-26 without ipsec stop in it!
May be there are more than one patterns of traffic counter differences, the 
recent diffs are similar to the above one. I know there is another pattern 
with dpd/liveness tests. I haven't looked into those. Also may be in the 
past there was another case where trafficstatus would go bellow clear text 
bytes sent or receive. It suggest leak of clear text traffic. This is more 
worrying than extra encrypted bytes, but do we still see those?  I don't 
notice them any more.

I think the diff, shown above, is  more likely to happen in a xauth test 
with a 0/0 tunnel. There is some other traffic between the hosts. This extra 
traffic get encrypted and get counted. It is likely a host to host traffic 
and it occurs intermittently, possibly related to host load (either cpu or 
network). May be ICMP port unreachable or something like that.

with xauth or IKEv2 with CP, client server model, we can not depend on 
'ipsec trafficstatus' for 0/0 tunnels. currently there are about 20 - 30 
test case that show such diffs? 

Typically a ping in a test sent and receive 336 bytes, 4 pings echo and echo 
response. One would expect ipsec trafficstatus should show 336 bytes in and 
out.  However sometimes it would show weird numbers such as:
inBytes=1584, outBytes=1374

>From my manual debugging session.
ipsec trafficstatus
006 #2: "east-any"[1] 192.1.3.209, username=xroad, type=ESP, add_time=1564437448, inBytes=1584, outBytes=1374, lease=192.0.2.201/32
clearly some extra traffic that arrived on east.

Here is how looked into this issue. I build kvm-install with 22 prefixes.
the host is 32 threads/core system. Load on the host is important.

Then I changed prefixes to 21, start make kvm-test. And using the last set 
t22, I ran the xauth-pluto-26 manually. When I see the difference I logged 
into the console and looked aroud

After a few times, 10-15, I got the t22.east and t22.road in this weird 
situation where it show more traffic than ping sent. I logged in to east, 
north and nic and from north send ping -c 4 most of the time I would see 
increase of 336 bytes as expected, and 8 ESP packets on the NIC. However 
sometimes  more traffic in ipsec trafficstatus, and correspondingly more ESP 
packets on the nic!
I could not corelate the clear text on east to the extra ESP packets yet.  
There was lot of noise traffic going around. I need a better tcpdump filter 
rule.

Also when I was idle, on nic occasionally I saw extra ESP packets going by.
This is even when I am not pinging. I am yet to figure out out exactly what 
those extra esp packet were. My guess is some host-to-host traffic getting 
encrypted because of 0/0 tunnel.  May be some icmp unreachable something.  
Now that we can sort of re-produce it. So we could look into it further. It 
takes a bit time and focus.

So far I don't have the complete story, still thought sharing this would 
help. And my suggestions are based on a hunch trafficstatus alone is not 
enough for a test.

In the mean while Tuomo keep insisting to to switch to fping! a +1 on that!  
I will add fping to kvm packages and we should move to it.

While at this, I will throw a few more ideas for discussion and to record it.

To debug this further we could install some iptable counter rules on nic and 
east, to see what else is going between these hosts.
May be run tcpdump on east or road to see extra traffic. It should be easy 
to capture we don't have to sanitize it, just need a packet capture.

Another observation is this extra traffic appears to be related to load on 
the host or probably leak from network? I wonder why/how?
While I was debugging the single test manually the test run finished and 
everything appeared to be very stable and no more extra traffic for next 30 
minutes. Then I gave up and went to bed!
If the theory non parallel testruns would see fewer trafficstatus diffs.
Any one running tests without KVM_PREFIXES= specified in Makefile.inc.local
notice these trafficstatus diffs?

Another solution floated around is more iptable rules to block clear text 
and log them, either iptable log or to console. For this we need more 
iptable block + log rules. The current, libreswan specific, iptable target 
LOGDROP, created in swan-prep,  is not portable to docker or namespace 
testing easily. Because it send information to 'console' which does not 
exist, at least so far not, in namespace testing or docker testing.  Also 
when running tests in parallel, namespace will blow up with too many iptable 
rule.  A hint about the scaling issue of iptable error :
"Another app is currently holding the xtables lock. Perhaps you want to use 
the -w option?" Using -w180 does nto seems to solve it completely! May be 
nftable would solve it... Or Tuomo suggested use iptable-restore?
iptbale-restore does not easily fit into our model? Paul thinks LOGDROP is 
the best way? AFIK he came up with the idea of LOGDROP.

May be a simple alternative is wrap fping + "ipsec trafficstatus" into a 
shell script. This script process the output and compare to what ping send.
If the inBytes and or outBytes are more than the ping send it is ok?
This would fix many cases we see now. Please use fping here!. And 
investigate the liveness/dpd test cases before fixing things.

Another possibility is send fping with specific clear text pattern in the 
payload and use tcpdump rule to detect leak of this specific traffic. The 
pattern would be like 'ping -p'.  I am not sure if fping suppor this?  
However, the pattern can't be the same for all traffic all tests. It could 
leak between tests. Hard part is to use a dynaic pattern. May be just 
testname or checksum of testname!

Recently, I feel we are sprinkling more "ipsec trafficstatus" to many test 
including the old tests, careful when adding these. It introduce instablity 
to testrun.

-antony