[Swan-dev] PAM thread cleanup (vs -fexceptions)

Mon Sep 18 20:40:25 UTC 2017

Hi,

We seem to have hit a problem with combining -fexceptions and
pthread_cleanup_push() - if you add an SOLIB to the mix things don't work.

So what is going on?

Libreswan uses a per-request thread to handle PAM authentication and,
occasionally, that needs to be canceled.  It should work roughly as follows:

main                 worker
allocates XAUTH struct
creates worker
                         sets up cleanup
                         starts PAM operation
                         access XAUTH struct
[sends cancel]
                         access XAUTH struct ...
                         [receives cancel]
                         runs cleanup which scedules "finish" event on
main-thread
                         [thread no longer accesses resources]
runs "finish" and frees MEM; if there is still a state, notify it of "done"
                         [thread eventually dies]

Several things to note:

- XAUTH struct is only ever allocated and destroyed on the main thread
- since XAUTH is only destroyed after the worker is done with it, a write
after free can't happen
- the old code (and a recent patch) assumed that pthread_cancel() was
synchronous and immediately freed XAUTH - this resulted in a
write-after-free
- similarly, the old code's worker thread liked to access the state object
directly, but that could have been deleted - resulting in another
write-after-free
- since cancelling the thread is unbounded, pthread_join() isn't used as
that would block the main thread (this shouldn't matter but I don't trust
pthreads)

(TODO is a second cleanup inside the pam wrappers proper to ensure that
resources allocated by pam also get freed; I've been sitting on this until
I was confident this part was robust)

there's just one "feature" - if the worker thread dies mysteriously and/or
the cleanup isn't run, then there's a resource leak.  Not so bad, given it
SHOULD NEVER HAPPEN ...

Unfortunately, it turns out it can.  There are at least two ways to
implement pthread_cleanup_push():

- using setjmp/longjmp, slow, but well understood
- using exceptions and an unwinder, faster and more hairy; enabled by
-fexceptions

libreswan has, unknowingly I suspect, being using the latter; and it turns
out that the latter isn't working for us - the cleanup isn't called and the
resource leaks.  For instance, for mypam.so, after the cancel is received,
the worker stack looks like:

#0  __GI___pthread_unwind (buf=0x7f5ac8a89da0) at unwind.c:112
#1  0x00007f5ad0e2e9f2 in __do_cancel () at ./pthreadP.h:297
#2  sigcancel_handler (sig=<optimized out>, si=0x7f5ac8a89870,
ctx=<optimized out>) at nptl-init.c:216
#3  <signal handler called>
#4  0x00007f5acf73743d in nanosleep () at
../sysdeps/unix/syscall-template.S:84
#5  0x00007f5acf73737a in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#6  0x00007f5ac80884ea in pam_sm_authenticate () from
/usr/lib64/security/mypam.so
#7  0x00007f5ad1caae71 in _pam_dispatch_aux (use_cached_chain=<optimized
out>, resumed=<optimized out>, h=0x7f5ac00023e0, flags=32768,
pamh=0x7f5ac00023e0) at pam_dispatch.c:110
#8  _pam_dispatch (pamh=pamh at entry=0x7f5ac00008c0, flags=32768,
choice=choice at entry=1) at pam_dispatch.c:426
#9  0x00007f5ad1caa73b in pam_authenticate (pamh=0x7f5ac00008c0,
flags=flags at entry=32768) at pam_auth.c:34
#10 0x000055bc5db249cc in do_pam_authentication (arg=0x55bc5f3ab7b8) at
/source/programs/pluto/pam_conv.c:161
#11 0x000055bc5db24fa5 in xauth_thread (arg=0x55bc5f3ab238) at
/source/programs/pluto/xauth.c:141
#12 0x00007f5ad0e3036d in start_thread (arg=0x7f5ac8a8a700) at
pthread_create.c:456
#13 0x00007f5acf773bbf in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:97

yet glibc manages to unwind no further than __sleep() (it finds that
unwinding it ends up back in __sleep()) so aborts and never runs our
cleanup.  Now the reason for the failure could be glibc's unwinder (GDB
gets it right), but it could be that we built mypam.so wrong.  Unlike
normal exceptions which, I believe, use GCC's unwinder, this uses GLIBC's.

Switching away from the exception handling code and using setjmp/longjmp
while a little slower is much more robust and avoids the problem.

If anyone wants to play with this, I've created:
https://github.com/cagney/pthread-cancel and will probably add this to
https://sourceware.org/bugzilla/show_bug.cgi?id=14266

Andrew

PS: even after this we've more problems vis:
https://sourceware.org/bugzilla/show_bug.cgi?id=12683
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.libreswan.org/pipermail/swan-dev/attachments/20170918/cfd2573a/attachment-0001.html>