[Swan-dev] Leaks when killing states during crypto; time to drop WIRE_*?

Wed Dec 6 23:12:38 UTC 2017

| From: Andrew Cagney <andrew.cagney at gmail.com>

| > Clearly the main thread needs the MD most of the time.  But probably
| > not during "suspension" of a state.  That's when the worker could have
| > ownership.  I'm guessing that the only worker-access needed is for
| > encryption/authentication of the packet itself.
| 
| Yes, the state transition code uses MD (as a parameter).  It doesn't
| need to be stuffing it in st_suspended_md or conversely having state
| stuffed into md.st.

Historically, md has a pointer to the current struct state.  I know
you've changed this, but I'm not sure why that's a win.

The idea was that md would carry all that was needed by a State
Transition Function.

It is true that the rest of the msg_digest was the sliced and diced
packet and that the state object is different.

The state pointer in md was kind of cheap: it did not "own" the state,
so that freeing the md did not require freeing the state.

On the other hand, if the state were freed (killed), the md needed to
be deleted because nothing else "owned" the md.

If a state transition were suspended, the md (if any) needs to be
preserved and remembered: for the restarted the state transition and
for not leaking (kind of related). The md is preserved by nulling out a
pointer to it in the framework so that the framework won't free it at
the end of the (incomplete) state transition.  It is remembered by
being placed in st_suspended_md.

So it a state object is killed, it is easy to find any md attached to
it by suspension.  The maintenance of the md's pointer to the state
falls easily out of that.

| My current hunch is to get pluto's responsiveness up, we'll be trying
| to off-load more and more work.  That would mean that pretty much
| every state transition will consist of:
|        state->in-progress->new-state
| while strictly speaking this should be implemented as separate states,
| adding all the extra states will get so tedious and redundant that
| we'll end up accepting the status quo.

I don't know what you are proposing.  It sounds like a path of most
resistance.  Most of the work that is not already off-loaded is really
modest, as far as I know.

One very important thing is to get Pluto to behave with grace under
load.  I don't know if it does.  That's probably more important than
handling a larger load.

I'm not aware of (haven't been paying attention to?) any reasonable
performance analysis of Pluto.  We do have hunches.  Hunches are
useful mostly for directing measurements.  There is no point in
redesigns for performance without measurements.

Now onto my hunches.

The event-loop is fundamentally single threaded.  The code depends on
variables not changing by outside forces while it is running.  It
is as if a lock were excluding all other actors while an event is
being processed.  That seriously simplifies the code.

If the event loop processing is the bottleneck, and it needs to be
threaded, most of the code needs to be rewritten.

There are current exceptions.  Threads or processes are used for PAM,
for crypto, and for parts of DNS lookup.  I don't know whether
certificate validation is threaded, but it should be.  What are other
operations that take a bunch of time?  Something that takes CPU time
could be a little different from something that takes real time.
Multiple CPUs are needed to improve CPU time consumers whereas
multitasking can deal with real time consumers.

Some parts of Pluto were designed for light duty.  No need to
overengineer.  But now, many years later, some of those mechanisms may
be bottlenecks.

You sped up serial number to state mapping.  No
longer a linear search.  That mechanism wasn't intended to be used
much.  But: are the uses that caused performance issues really
reasonable?  I don't know.

Scanning all existing states is linear.  There seem to be a fair
number of casess where that is used.  Can that be sped up by 
datastructuctures that let us only look at the relevant states?

Performance measurement showed that sanitizing the log messages took a
non-trivial amount of time.  You ditched the sanitization.  It
probably wasn't necessary, but it was part of defense in depth.
Alternatively, we could have found out why it was so slow.  My guess:
something to do with locales, something that should have been easy to
cut out of the processing.  If sanitizing the log messages took
noticeable time, we were doing too much logging -- something expensive
in itself.

| However, one thing that I do think should go is that extra code path
| handling STF_INLINE.  When there are no threads, use the event loop to
| offload the work.

I don't remember that clearly, but I think that's exactly what
STF_INLINE is actually doing.

My hunch is that marshalling and demarshalling data eats a surprising
amount of processing.  That needs to be backed up by measurement.

I've got to go, so this message is half-baked.