[Swan-dev] my test run hung last night

Sat Sep 5 20:17:56 EEST 2015

My test run from last night is hung in ikev1-impair-gx-01/OUTPUT.  No 
progress for 12 hours.

east.pluto.log is 17618 lines long.  Most looks like this:
| expiring aged bare shunts
| event_schedule called for 20 seconds
| event_schedule_tv called for about 20 seconds and change
| inserting event EVENT_SHUNT_SCAN, timeout in 20.000000 seconds
| handling event EVENT_PENDING_DDNS
| event_schedule called for 60 seconds
| event_schedule_tv called for about 60 seconds and change
| inserting event EVENT_PENDING_DDNS, timeout in 60.000000 seconds
| elapsed time in connection_check_ddns for hostname lookup 0.000000
| handling event EVENT_SHUNT_SCAN

I don't know why.  Certainly west was a problem: the west.init?)
script failed to cd into the test directory and stopped.

At Paul's suggestion, I rebooted east and west, hoping that would get
the test run unstuck.  It did not.

When I do a ps -laxgwf, I see:

0  1105  5868  4363  20   0 431820 19000 futex_ Sl+  pts/2      0:17                                      \_ swantest
4  1105 18674  5868  20   0  27796  7616 poll_s S+   pts/2      0:00                                          \_ /usr/sbin/tcpdump -w OUTPUT/swan12.pcap -i swan12 -s 0 -n not stp and not port 22
4     0 19041  5868  20   0      0     0 exit   Zs   ?          0:00                                          \_ [sudo] <defunct>
4     0 19044  5868  20   0      0     0 exit   Zs   ?          0:00                                          \_ [sudo] <defunct>

<defunct> should not happen.  It means that the parent process isn't
doing reaping.

I tried to kill 18674 (tcpdump) but it would not die.
kill -9 and sudo kill -9 didn't work.

BTW, one new think is that before the run, I changed the ownership and
capabilities of tcpdump, as per the wiki page on testing.

What's going on?  What should I do at this point?  I assume that there
are interesting forensics possible if I don't just kill everything.