[Swan] network blip causes the VPN to be in broken state

Sun Jul 15 00:21:45 UTC 2018

On Fri, 13 Jul 2018, Xinwei Hong wrote:

> I have attached the log we can collect. There were two network blip, both time was detected by DPD. After the first blip at about Jul 11 22:03:21, the renegotiation take
> effect and every thing looks fine. During the second blip (Jul 12 06:16:54), things went into bad state. A main mode renegotiation was started at Jul 12 05:47:48 before
> the DPD error at (Jul 12 06:16:54)
>

> and also many
> 
> Jul 12 06:21:56 vvr-10-9-255-36 pluto[31586]: vpn-711360: initiate_ondemand_body() failed to install negotiation_shunt,
> Jul 12 06:21:56 vvr-10-9-255-36 pluto[31586]: vpn-711360: initiate on demand from 10.1.153.84:22 to 10.1.1.155:54187 proto=6 because: acquire
> Jul 12 06:21:58 vvr-10-9-255-36 pluto[31586]: vpn-711360: assign_holdpass() delete_bare_shunt() failed

This indicates a problem on a packet triggered tunnel policy.
I guess there is confusion about an ongoing tunnel and a new
packet triggering that tunnel. It seems to have caused you to
be in a state where we partially think the tunnel should be up
and partially think we should be done.

> after some time main mode passed. The VPN then tries to do quick mode renegotiation every 50 minutes or so as normal. However the VPN was not working during this time.
> When we noticed the issue and check “ip xfrm policy”, we only see
> 
> :/proc/net# ip xfrm policy
> 
> src 0.0.0.0/0 dst 0.0.0.0/0
>         dir out priority 3136
>         mark 0x5/0xffffffff
>         tmpl src 199.204.218.76 dst 66.193.98.67
>                 proto esp reqid 16389 mode tunnel
> the in and fwd policy is missing. 

Hmm, clearly that should never happen.

> Can you help check what’s wrong here? What can we do to avoid this in future?

It would be helpful to get all the logs of the events from good to bad
state.

Paul