[Swan-dev] test failures due to IKE retransmissions
Andrew Cagney
andrew.cagney at gmail.com
Mon Oct 22 17:19:21 UTC 2018
On Sun, 21 Oct 2018 at 11:48, D. Hugh Redelmeier <hugh at mimosa.com> wrote:
>
> My previous message to the list described the times taken to run our
> test suite on three different machines.
>
> This one focuses on tests failing due to unexpected IKE retransmissions.
>
> This is so common that I have a procedure for rerunning tests that
> failed due to exactly one IKE retransmissions. (Some fail with
> multiple IKE retransmission but I don't detect those.)
adding h/w specs and ranking, tossing firebird - different CPU
> redtiny: 10
redtiny: 5:46:40.231926 real time
Lenovo ThinkCentre M93p Tiny,
i5-4570T CPU @ 2.90GHz, 4 core,
16G RAM (two sticks)
laptop HDD 7200RPM
> ikev2-11-simple-psk
> ikev2-61-any-psk
> ikev2-algo-03-aes-ccm
> ikev2-algo-06-aes-aes_xcbc
> ikev2-ecdsa-01
> ikev2-hostpair-01
> ikev2-liveness-11-silent
> ikev2-x509-17-multicert-rightid-san-wildcard
> ikev2-x509-20-multicert-rightid-san-wildcard
> netkey-passthrough-02
> redox: 3
redox: 5:23:38.344347 real time
Lenovo ThinkCentre M93p Tiny,
i5-4570T CPU @ 2.90GHz, 4 core,
8G RAM (one stick)
SATA SSD
> ikev2-ecdsa-01
> ikev2-x509-17-multicert-rightid-san-wildcard
> ikev2-x509-20-multicert-rightid-san-wildcard
> Notice that redox's set is a subset of redbird's, which in turn is a
> subset of redtiny's.
>
> I blame the HDD -- what else is inferior about redtiny?
Well, it isn't RAM (redtiny has 2 identical sticks?), and it isn't the
cpu (identical), or I/O (same board) which leaves disk.
The root file system is mounted from:
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='writeback'/>
<source file='@@POOLSPACE@@/@@NAME@@.qcow2'/>
<target dev='vda' bus='virtio'/>
</disk>
which has 'writeback' enabled. (see below).
But what pushes these tests over the edge? Here are some other
theories, ordered least to most credible:
- it's bad timing (not very credible, redox tends to rebut this theory)
If test a needs lots of VMs then booting them all in parallel impacts
other test results (see web page).
To this this theory, what about putting TESTLIST through 'shuff'?
- it's the number of VMs used by the test
Easy to check? (perhaps it is the number of active disks)
- the tests use certificates (plausable)
There's a suspicion that pluto is hitting the NSS database hard -
since that lives on the guest's root file system it could be doing
lots of synchronized writes.
- the total log file size for those tests is simply bigger
Easy to check? (or data, more writing through 9p)
Several ways to mitigate it:
- if it is NSS, then mounting /etc/ipsec.d as tmpfs should confirm this
- writeback->unsafe
Since the test system re-building images as part of kvm-build,
'unsafe' (host may cache all disk io, and sync requests from guest are
ignored) is probably 'safe' (the files to tweak are in
testing/libvirt/vm/*).
See https://libvirt.org/formatdomain.html#elementsDisks
- KVM_LOCALDIR=/tmp/pool (assuming /tmp is tmpfs)
The make variable KVM_LOCALDIR controls the location of the test
machine disk images. You'll need to first create /tmp/pool, and /tmp
won't work because it is tacky.
Technical nit. build, nic, east, west, ... all have read-only
references back to 'clone' and that still lives in the default pool
directory.
I kind of prefer 'unsafe'.
> This makes the HDD system quite annoying. Some changes that Andrew
> made have reduced this effect considerably (Andrew: thanks!). I have
> a script that reruns tests that fail this way.
>
> In general, I don't like the idea of "rerun a test until it passes":
> this would hide some real errors that are non-deterministic. On the
> other hand, spurious error reports are a large waste of my time.
>
> As a check on how many tests failed due to exactly two retries, I
> found that on redtiny there were:
> 48 tests with unexpected first retries
> 45 tests with unexpected second retries
> 35 tests with unexpected third retries
> 33 tests with unexpected fourth retries
> 33 tests with unexpected fifth retries.
>
> Note that these numbers should be monotonically non-increasing since any
> test with an unexpected n+1 retry would have an unexpected n return.
>
> These numbers are AFTER I reran any test that only failed due to one
> unexpected first retry.
>
> Many of these test may have failed for other reasons. The 33 look
> like real failures. 10 might be failures due to only two sequential
> retries.
>
> I just reran all the tests that failed on redtiny with only unexpected
> first and second IKE retransmissions. This picks up tests that failed
> more than one place due to a first retransmission or any second
> retransmission. Here are the 12.
>
> fips-default-ikev2-01-nofips-east
> ikev2-49-hub-spoke
> ikev2-algo-07-aes_ctr
> ikev2-algo-11-gcm-prop2
> ikev2-algo-ike-sha2-04
> ikev2-ecdsa-01
> ikev2-invalid-ke-02-wrong-modp
> ikev2-liveness-11-silent
> ikev2-mobike-01
> ikev2-nat-pluto-03
> ikev2-x509-17-multicert-rightid-san-wildcard
> ikev2-x509-20-multicert-rightid-san-wildcard
>
> After a single rerun, 5 of the 12 passed, 6 of them still had only one or
> two IKE retransmissions, and 1 failed for some other reason. Here are the
> six with only one or two IKE retransmissions:
>
> fips-default-ikev2-01-nofips-east
> ikev2-algo-07-aes_ctr
> ikev2-ecdsa-01
> ikev2-liveness-11-silent
> ikev2-x509-17-multicert-rightid-san-wildcard
> ikev2-x509-20-multicert-rightid-san-wildcard
>
> Doing this again leaves four:
> fips-default-ikev2-01-nofips-east
> ikev2-ecdsa-01
> ikev2-x509-17-multicert-rightid-san-wildcard
> ikev2-x509-20-multicert-rightid-san-wildcard
> _______________________________________________
> Swan-dev mailing list
> Swan-dev at lists.libreswan.org
> https://lists.libreswan.org/mailman/listinfo/swan-dev
More information about the Swan-dev
mailing list