<div dir="ltr"><br><br>On Wed, 3 Mar 2021 at 09:43, Andrew Cagney <<a href="mailto:andrew.cagney@gmail.com">andrew.cagney@gmail.com</a>> wrote:<br>><br>> Another update:<br>><br>> We're down to:<br>><br>> <a href="https://testing.libreswan.org/v4.3-86-g701b488bb9-main/ikev2-59-multiple-acquires-alias/OUTPUT/east.pluto.log.gz">https://testing.libreswan.org/v4.3-86-g701b488bb9-main/ikev2-59-multiple-acquires-alias/OUTPUT/east.pluto.log.gz</a><br>> Mar  2 15:56:47.592566: leak: heap logger, item size: 64<br>> Mar  2 15:56:47.592656: leak: heap logger prefix, item size: 26<br>> Mar  2 15:56:47.592737: leak: DH shared secret, item size: 104<br>> Mar  2 15:56:47.592807: leak: DH crypto, item size: 384<br>> Mar  2 15:56:47.592876: leak: dh, item size: 40<br>> Mar  2 15:56:47.592937: leak: calc_dh_local_secret, item size: 32<br>> Mar  2 15:56:47.593002: leak detective found 6 leaks, total size 650<br>><br>> but it only happens occasionally so presumably a race between revival,<br>> crypto helpers, and the main thread during shutdown<br>><br><br>Looks like a leak in an IKEv1's background task that gets scheduled during a shutdown.<div>(Since IKEv2 will need background jobs working for fragmentation this matters).<br><br>Having post-mortem run refcnt.awk helped a lot:<br><a href="https://testing.libreswan.org/v4.4-178-gcd08e1a67e-main/ikev1-x509-18-id-notany/OUTPUT/east.console.diff">https://testing.libreswan.org/v4.4-178-gcd08e1a67e-main/ikev1-x509-18-id-notany/OUTPUT/east.console.diff</a><br>++ awk -f /testing/utils/refcnt.awk /tmp/pluto.log<br>+ERROR: : '0x7fb8c2480fb8' has count 1<br>+2780: | newref clone logger@0x7fb8c2480fb8(0->1) (in submit_task() at server_pool.c:354)<br><br>it let me track down:<br><br>- shutting down (at this point, think of this more as a suggestion than actual action)<div><br></div><div>- an IKEv1 state schedules a job (compute final DH) and puts it into the background<br><div>| job 4 helper 0 #2 DH shared secret (dh): added to pending queue<br>| #2 is idle; has background offloaded task<br><br>- shutdown stops the helper threads (task still on queue)<br>| one helper thread exited, 0 remaining<br>| scheduling callback all helper threads stopped (#0)<br><br>- states deleted<br>| FOR_EACH_STATE_... in foreach_state_by_connection_func_delete<br>"ikev2-westnet-eastnet-x509-cr" #2: deleting state (STATE_MAIN_R2) aged 0.117802s and NOT sending notification</div></div><div><br></div><div>normally deleting the state orphans the job; leaving it to the helper threads to clean up, except here the helper threads aren't running</div><div><br></div><div>several options come to mind:</div><div>- once the states are gone, free any still-out-standing jobs</div><div>- bravely trust locked refcounts and use that to detect when a job can be deleted ...</div><div><br></div><div><br></div></div></div>