[Swan-dev] debugging traffic status diffs in parallel testruns
Antony Antony
antony at phenome.org
Tue Jul 30 10:04:10 UTC 2019
I looked a bit into ipsec trafficstatus fluctuations we see during testruns.
Here is what I found so far. And this is a long e-mail!
Here is an example from a testrun, just a diff for the purpose discussion:
https://testing.libreswan.org/v3.28-520-g0ca419163-master/xauth-pluto-14/OUTPUT/road.console.diff
This is a random example, it is not the specific one I used. While debugging
manually I used xauth-pluto-26 without ipsec stop in it!
May be there are more than one patterns of traffic counter differences, the
recent diffs are similar to the above one. I know there is another pattern
with dpd/liveness tests. I haven't looked into those. Also may be in the
past there was another case where trafficstatus would go bellow clear text
bytes sent or receive. It suggest leak of clear text traffic. This is more
worrying than extra encrypted bytes, but do we still see those? I don't
notice them any more.
I think the diff, shown above, is more likely to happen in a xauth test
with a 0/0 tunnel. There is some other traffic between the hosts. This extra
traffic get encrypted and get counted. It is likely a host to host traffic
and it occurs intermittently, possibly related to host load (either cpu or
network). May be ICMP port unreachable or something like that.
with xauth or IKEv2 with CP, client server model, we can not depend on
'ipsec trafficstatus' for 0/0 tunnels. currently there are about 20 - 30
test case that show such diffs?
Typically a ping in a test sent and receive 336 bytes, 4 pings echo and echo
response. One would expect ipsec trafficstatus should show 336 bytes in and
out. However sometimes it would show weird numbers such as:
inBytes=1584, outBytes=1374
>From my manual debugging session.
ipsec trafficstatus
006 #2: "east-any"[1] 192.1.3.209, username=xroad, type=ESP, add_time=1564437448, inBytes=1584, outBytes=1374, lease=192.0.2.201/32
clearly some extra traffic that arrived on east.
Here is how looked into this issue. I build kvm-install with 22 prefixes.
the host is 32 threads/core system. Load on the host is important.
Then I changed prefixes to 21, start make kvm-test. And using the last set
t22, I ran the xauth-pluto-26 manually. When I see the difference I logged
into the console and looked aroud
After a few times, 10-15, I got the t22.east and t22.road in this weird
situation where it show more traffic than ping sent. I logged in to east,
north and nic and from north send ping -c 4 most of the time I would see
increase of 336 bytes as expected, and 8 ESP packets on the NIC. However
sometimes more traffic in ipsec trafficstatus, and correspondingly more ESP
packets on the nic!
I could not corelate the clear text on east to the extra ESP packets yet.
There was lot of noise traffic going around. I need a better tcpdump filter
rule.
Also when I was idle, on nic occasionally I saw extra ESP packets going by.
This is even when I am not pinging. I am yet to figure out out exactly what
those extra esp packet were. My guess is some host-to-host traffic getting
encrypted because of 0/0 tunnel. May be some icmp unreachable something.
Now that we can sort of re-produce it. So we could look into it further. It
takes a bit time and focus.
So far I don't have the complete story, still thought sharing this would
help. And my suggestions are based on a hunch trafficstatus alone is not
enough for a test.
In the mean while Tuomo keep insisting to to switch to fping! a +1 on that!
I will add fping to kvm packages and we should move to it.
While at this, I will throw a few more ideas for discussion and to record it.
To debug this further we could install some iptable counter rules on nic and
east, to see what else is going between these hosts.
May be run tcpdump on east or road to see extra traffic. It should be easy
to capture we don't have to sanitize it, just need a packet capture.
Another observation is this extra traffic appears to be related to load on
the host or probably leak from network? I wonder why/how?
While I was debugging the single test manually the test run finished and
everything appeared to be very stable and no more extra traffic for next 30
minutes. Then I gave up and went to bed!
If the theory non parallel testruns would see fewer trafficstatus diffs.
Any one running tests without KVM_PREFIXES= specified in Makefile.inc.local
notice these trafficstatus diffs?
Another solution floated around is more iptable rules to block clear text
and log them, either iptable log or to console. For this we need more
iptable block + log rules. The current, libreswan specific, iptable target
LOGDROP, created in swan-prep, is not portable to docker or namespace
testing easily. Because it send information to 'console' which does not
exist, at least so far not, in namespace testing or docker testing. Also
when running tests in parallel, namespace will blow up with too many iptable
rule. A hint about the scaling issue of iptable error :
"Another app is currently holding the xtables lock. Perhaps you want to use
the -w option?" Using -w180 does nto seems to solve it completely! May be
nftable would solve it... Or Tuomo suggested use iptable-restore?
iptbale-restore does not easily fit into our model? Paul thinks LOGDROP is
the best way? AFIK he came up with the idea of LOGDROP.
May be a simple alternative is wrap fping + "ipsec trafficstatus" into a
shell script. This script process the output and compare to what ping send.
If the inBytes and or outBytes are more than the ping send it is ok?
This would fix many cases we see now. Please use fping here!. And
investigate the liveness/dpd test cases before fixing things.
Another possibility is send fping with specific clear text pattern in the
payload and use tcpdump rule to detect leak of this specific traffic. The
pattern would be like 'ping -p'. I am not sure if fping suppor this?
However, the pattern can't be the same for all traffic all tests. It could
leak between tests. Hard part is to use a dynaic pattern. May be just
testname or checksum of testname!
Recently, I feel we are sprinkling more "ipsec trafficstatus" to many test
including the old tests, careful when adding these. It introduce instablity
to testrun.
-antony
More information about the Swan-dev
mailing list