[Swan-dev] should final.sh shut down pluto?

Sun Jun 26 23:58:40 UTC 2016

On 26 June 2016 at 09:40, D. Hugh Redelmeier <hugh at mimosa.com> wrote:
> | From: Paul Wouters <paul at nohats.ca>
>
> | > On Jun 26, 2016, at 03:49, D. Hugh Redelmeier <hugh at mimosa.com> wrote:
> | >
> | > I'm quite unhappy about uncaptured core files.  At the moment, I'm
> | > not willing to say what a good solution would be.
> |
> | That's not quite was it is.
>
> I'm not sure what you mean by that sentence.

My "recollection" of the issue on IRC was wrong.  This e-mail has the
correct details.

> | Not all tests perform a shutdown and so some
> | errors in a shutdown that could cause a core would theoretically not be
> | detected". Any core files made are always detected.
>
> I don't know what you are claiming.
>
> When I ran all the tests Friday morning (using kvmrunner.py), several
> assertion failures showed up but no core files were preserved.  All
> failures were of the same assertion (which we know about).
>
> I'm a dumb user.  I don't know why this happened.
>
> But I'll make a sequence of guesses:
> - the assertion failed during shutdown of Pluto

The entire system was shutting down; pluto didn't quite make it and dumped core.

> - maybe shutdown was performed by final.sh (I don't know)

You shouldn't need to know.

However to be specific, final.sh scripts sometimes do the following:
- shutdown pluto
- check for core files and save them
but some skip one or both steps.  Regardless, it shouldn't matter.

> - the pluto log file captured the ASSERT failure

By luck we've this bread crumb.  If pluto were to dump core for some
other reason, we'd have nothing.

> - the core file doesn't get captured in this case.

Right, per above.  Capturing core files is an optional part of
final.sh and that can be before pluto exits.

> | As antony said, it is VERY useful to run a single test and then ssh in
> | while the tunnel is still up.
>
> Yes.  So maybe we need to cleanly separate shutdown into a separate step
> and have the script-runner capable of stopping at any designated step.
> For an example of this kind of control, look at rpmbuild's -b parameter.

Yea, an explicit option like --stop-before <script>.

However, I view that as a "nice to have".

To me the "must have" is consistent default behaviour whether an
individual or group of tests.  For instance:

./testing/utils/kvmrunner.py testing/pluto
./testing/utils/kvmrunner.py testing/pluto/basic-pluto-01
./testing/utils/kvmrunner.py testing/pluto/basic-pluto-*

should all run the tests the same way.

> | We have a bunch of tests running shutdown
> | but don't for the majority of tests. I think that's fine.
>
> I don't.  The reason is that each test tests different paths through the
> system and each might cause different problems that linger undetected
> until shutdown.
>
> I come to that conclusion honestly: I don't have a core dump for any of
> these particular assertion failures.  But I might be mis-diagnosing.
>
> There is a chance that kvmrunner.py needs some added code to make me
> happy.  I know that it isn't an advertised component of our test system.
> Do you see core dumps for these assertion failures?

So far the best solution I've seen involves always shutting down pluto
_before_ shutting down the entire system (if systemd is causing pluto
to crash we've another problem).  Perhaps final.sh should be required
to run a new script "swan-destroy", or perhaps that should be run
outside of the *.sh scripts.