[Swan-dev] test failures that are easy to fix

Tue Jan 3 16:10:47 UTC 2017

On 2 January 2017 at 21:37, Paul Wouters <paul at nohats.ca> wrote:
>>
>> testing/pluto/newoe-08-ike-replace-responder unresolved east:output-missing nic:output-missing road:output-missing
>> testing/pluto/newoe-11-failike unresolved east:output-missing nic:output-missing road:output-missing
>> testing/pluto/newoe-18-cop-private unresolved east:output-missing nic:output-missing road:output-missing
>>
>>         These used to pass.  Why was the reference output deleted?
>
>
> I don't know?
>
> Note that I'm focused now on the oe-cleanup-aa branch, which is causing
> some output changes too. But I'll go through master again tomorrow and
> sync it up.

The reference output is still there.  What is completely missing is
the output from all the test domains.

I'd strongly suspect that kvmrunner.py abandoned the test after
waiting 2 minutes for one of the domains to boot (since I think of a
30 second boot as slow, 120 seconds is very generous).

After about a week of up time, and re-using the same domains,
testing.libreswan.org starts to show these symptoms.  When it happens
I cast the following spell:

  # delete the problem domains
  make kvm-uninstall
  # I suspect not necesssary, but ...
  reboot
  # get things started again
  make kvm-install

If you happen to have a copy of the test logs, you might see:

  INFO gh1.virsh road 7:04:18.00/5:21.04: domain shutdown after 2.6 seconds
  INFO gh1.virsh west 7:04:18.01/5:21.05: starting domain
  ...
  INFO gh1.runner aggr-pluto-01 7:06:20.02/7:23.07: trying to cancel
job <Future at 0x7f724d1a90b8 state=running> on gh1.east
  ...
  ERROR gh1.runner aggr-pluto-01 7:06:20.06/7:24.00: timeout while
booting domains

following this is a stack trace that contains more information (just
filter out lines that don't make sense):

  pexpect.exceptions.TIMEOUT: Timeout exceeded.
  <pexpect.pty_spawn.spawn object at 0x7f724d1a9a20>
  command: /usr/bin/sudo
  args: ['/usr/bin/sudo', 'virsh', 'console', '--force', 'gh1.west']
  searcher: None
  buffer (last 100 chars): "alling v9fs 9p2000 file system
support\r\n[   69.475542] FS-Cache: Netfs '9p' registered for
caching\r\n"
  before (last 100 chars): "alling v9fs 9p2000 file system
support\r\n[   69.475542] FS-Cache: Netfs '9p' registered for
caching\r\n"
  after: <class 'pexpect.exceptions.TIMEOUT'>

(given this is a "shouldn't happen", I've never tried to make the
messages more user friendly).

Andrew

PS: Here are the gory details I extracted from OUTPUT/debug.log I've on hand:

- first I look for TIMEOUT:

  DEBUG gh1.runner aggr-pluto-01 7:06:20.02/7:23.07: job <Future at
0x7f724d1a9978 state=finished raised TIMEOUT> on gh1.west completed

  i.e., kvmrunner.py timed out while talking to west at the 7:23.07 (7
minutes and 23 second) mark (the 7:06:20.02 - 7 hours, 6 minutes - is
the total run time for all tests)

at this point I can even grep debug.log for the word west as nothing
else is relevant.

- almost immediately above the timeout should be the last output from
"west" vis:

  DEBUG gh1.shell west 7:06:01.09/7:05.03: read <<"9.475542] FS-Cache:
Netfs '9p'">>>
  DEBUG gh1.shell west 7:06:01.09/7:05.03: read <<' registered for
caching\r\n'>>>

  i.e., no output from west since the 7:05 (7 minutes and 5 seconds) mark

- and scrolling back a long way (or just look for 'west' from the top,
since the domain 'west' is what messed up here) I found:

  INFO gh1.virsh west 7:04:18.01/5:21.05: starting domain
  ...
  DEBUG gh1.shell west 7:04:18.01/5:21.05: prompt
'\[(?P<username>[-\.a-z0-9]+)@(?P<hostname>west)
(?P<basename>[-\.a-z0-9A-Z_~]+)(?P<status>| [0-9]+)\](?P<dollar>[#\$])
'

  i.e., it starts booting the domain and the first signs of life are
seen around the 5:21.05 (5 minutes and 21 seconds) mark.

So kvmrunner.py abandoned 'west' after waiting >2 minutes for it to boot.

Since all the console output from west is captured it is possible to
look for something obvious going wrong.  There wasn't.  It simply look
>2 minutes to boot.