[Swan-dev] test failures due to IKE retransmissions

Mon Oct 22 17:19:21 UTC 2018

On Sun, 21 Oct 2018 at 11:48, D. Hugh Redelmeier <hugh at mimosa.com> wrote:
>
> My previous message to the list described the times taken to run our
> test suite on three different machines.
>
> This one focuses on tests failing due to unexpected IKE retransmissions.
>
> This is so common that I have a procedure for rerunning tests that
> failed due to exactly one IKE retransmissions.  (Some fail with
> multiple IKE retransmission but I don't detect those.)

adding h/w specs and ranking, tossing firebird - different CPU

> redtiny: 10

redtiny: 5:46:40.231926 real time
Lenovo ThinkCentre M93p Tiny,
i5-4570T CPU @ 2.90GHz, 4 core,
16G RAM (two sticks)
laptop HDD 7200RPM

>   ikev2-11-simple-psk
>   ikev2-61-any-psk
>   ikev2-algo-03-aes-ccm
>   ikev2-algo-06-aes-aes_xcbc
>   ikev2-ecdsa-01
>   ikev2-hostpair-01
>   ikev2-liveness-11-silent
>   ikev2-x509-17-multicert-rightid-san-wildcard
>   ikev2-x509-20-multicert-rightid-san-wildcard
>   netkey-passthrough-02

> redox: 3

redox: 5:23:38.344347 real time
Lenovo ThinkCentre M93p Tiny,
i5-4570T CPU @ 2.90GHz, 4 core,
8G RAM (one stick)
SATA SSD

>   ikev2-ecdsa-01
>   ikev2-x509-17-multicert-rightid-san-wildcard
>   ikev2-x509-20-multicert-rightid-san-wildcard

> Notice that redox's set is a subset of redbird's, which in turn is a
> subset of redtiny's.
>
> I blame the HDD -- what else is inferior about redtiny?

Well, it isn't RAM (redtiny has 2 identical sticks?), and it isn't the
cpu (identical), or I/O (same board) which leaves disk.

The root file system is mounted from:

    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='writeback'/>
      <source file='@@POOLSPACE@@/@@NAME@@.qcow2'/>
      <target dev='vda' bus='virtio'/>
    </disk>

which has 'writeback' enabled. (see below).

But what pushes these tests over the edge?  Here are some other
theories, ordered least to most credible:

- it's bad timing (not very credible, redox tends to rebut this theory)
If test a needs lots of VMs then booting them all in parallel impacts
other test results (see web page).
To this this theory, what about putting TESTLIST through 'shuff'?

- it's the number of VMs used by the test
Easy to check?  (perhaps it is the number of active disks)

- the tests use certificates (plausable)
There's a suspicion that pluto is hitting the NSS database hard -
since that lives on the guest's root file system it could be doing
lots of synchronized writes.

- the total log file size for those tests is simply bigger
Easy to check? (or data, more writing through 9p)

Several ways to mitigate it:

- if it is NSS, then mounting /etc/ipsec.d as tmpfs should confirm this

- writeback->unsafe
Since the test system re-building images as part of kvm-build,
'unsafe' (host may cache all disk io, and sync requests from guest are
ignored) is probably 'safe' (the files to tweak are in
testing/libvirt/vm/*).
See https://libvirt.org/formatdomain.html#elementsDisks

- KVM_LOCALDIR=/tmp/pool (assuming /tmp is tmpfs)
The make variable KVM_LOCALDIR controls the location of the test
machine disk images.  You'll need to first create /tmp/pool, and /tmp
won't work because it is tacky.
Technical nit. build, nic, east, west, ... all have read-only
references back to 'clone' and that still lives in the default pool
directory.

I kind of prefer 'unsafe'.

> This makes the HDD system quite annoying.  Some changes that Andrew
> made have reduced this effect considerably (Andrew: thanks!).  I have
> a script that reruns tests that fail this way.
>
> In general, I don't like the idea of "rerun a test until it passes":
> this would hide some real errors that are non-deterministic.  On the
> other hand, spurious error reports are a large waste of my time.
>
> As a check on how many tests failed due to exactly two retries, I
> found that on redtiny there were:
> 48 tests with unexpected first retries
> 45 tests with unexpected second retries
> 35 tests with unexpected third retries
> 33 tests with unexpected fourth retries
> 33 tests with unexpected fifth retries.
>
> Note that these numbers should be monotonically non-increasing since any
> test with an unexpected n+1 retry would have an unexpected n return.
>
> These numbers are AFTER I reran any test that only failed due to one
> unexpected first retry.
>
> Many of these test may have failed for other reasons.  The 33 look
> like real failures.  10 might be failures due to only two sequential
> retries.
>
> I just reran all the tests that failed on redtiny with only unexpected
> first and second IKE retransmissions.  This picks up tests that failed
> more than one place due to a first retransmission or any second
> retransmission.  Here are the 12.
>
>   fips-default-ikev2-01-nofips-east
>   ikev2-49-hub-spoke
>   ikev2-algo-07-aes_ctr
>   ikev2-algo-11-gcm-prop2
>   ikev2-algo-ike-sha2-04
>   ikev2-ecdsa-01
>   ikev2-invalid-ke-02-wrong-modp
>   ikev2-liveness-11-silent
>   ikev2-mobike-01
>   ikev2-nat-pluto-03
>   ikev2-x509-17-multicert-rightid-san-wildcard
>   ikev2-x509-20-multicert-rightid-san-wildcard
>
> After a single rerun, 5 of the 12 passed, 6 of them still had only one or
> two IKE retransmissions, and 1 failed for some other reason.  Here are the
> six with only one or two IKE retransmissions:
>
>   fips-default-ikev2-01-nofips-east
>   ikev2-algo-07-aes_ctr
>   ikev2-ecdsa-01
>   ikev2-liveness-11-silent
>   ikev2-x509-17-multicert-rightid-san-wildcard
>   ikev2-x509-20-multicert-rightid-san-wildcard
>
> Doing this again leaves four:
>   fips-default-ikev2-01-nofips-east
>   ikev2-ecdsa-01
>   ikev2-x509-17-multicert-rightid-san-wildcard
>   ikev2-x509-20-multicert-rightid-san-wildcard
> _______________________________________________
> Swan-dev mailing list
> Swan-dev at lists.libreswan.org
> https://lists.libreswan.org/mailman/listinfo/swan-dev