[Swan-dev] ERROR: netlink response for Add SA ... included errno 3: No such process

Paul Wouters paul at nohats.ca
Sat Apr 11 20:07:24 EEST 2015


On Sat, 11 Apr 2015, Herbert Xu wrote:

>>> This is wrong because some updates do not
>>> contain keying material.
>>
>> I don't understand this. Can you explain what the problem is for those
>> SA's ?
>
> Updates are used in two places in pluto.  They're used for inbound
> SAs as part of the get_spi + update procedure, and they are used
> for NAT-T updates.  In the latter case there is no keying material
> so you must not replace the update with an add.
>
> The kernel will never delete any live SAs installed by pluto since
> pluto does not set hard life times on them.  So the NAT-T update
> should never fail anyway unless some third party is deleting SAs.

So the patch that switched it between add and update got quite a history
behind it. And a reverted revert commit.

Part of the problem is https://bugs.libreswan.org/show_bug.cgi?id=75

Some history can be seen in the git commits:

https://github.com/libreswan/libreswan/commit/b5fa5eb1033ee3b73f7121a8ba3e593be21f8226
https://github.com/libreswan/libreswan/commit/f81203faff29490157c6ef1cbc75d476a902bb63
https://github.com/libreswan/libreswan/commit/15d27b8ad4a2f0d1fb252e608cfeafe6b7121773
https://github.com/libreswan/libreswan/commit/39b7891e50fae053e8acebdc1f55af6408f8fdad

So first, the b5 commit changed from add to update:

errors on roadwarriors switching between internal IP's and reconnecting,
where NETKEY says a policy already exists (possibly because we do not
properly delete the policy when we delete the phase1, and the XP clients
delete their phase1 after 1 minute of idle time)

I reverted that, but sadly I didn't log why.

It was then reverted again by my with a comment:

 	* NEW will fail when an existing policy, UPD always works.
 	* This seems to happen in cases with NAT'ed XP clients, or
 	* quick recycling/resurfacing of roadwarriors on the same IP.
 	* req.n.nlmsg_type = XFRM_MSG_NEWPOLICY;

So that does relate to your NAT update comment.

But note that Tuomo also ran into a problem with connecting tunnels as
explained in bug 75:

 	On configuration where a talks with c via b eg. a == b == c where tunnels are
 	defined as a-c  on both a = b and b = c we are missing tunnels.

 	This is bug introduced by commit:
 	15d27b8ad4a2f0d1fb252e608cfeafe6b7121773


 	With that patch applied I get this error when starting ipsec:

 	#31: ERROR: netlink XFRM_MSG_NEWPOLICY response for flow
 	tun.10000 at 87.108.52.177 included errno 17: File exists

 	With patch reverted there are no errors and tunnels work as they should.

> When you do a get_spi the kernel generates a temporary SA to keep
> hold of the SPI so that nobody else gets it.  But this SA only
> lives until xfrm_acq_expires.

Oh, I did not realise that! That's good to know.

> Therefore redoing the add after update might work but is simply
> wrong.  You might as well just pluck some random number out of
> thin air and use that as your SPI.

I understand now. I guess we need to look into the two tunnel problem
listed above and how to deal with the Win XP / NAT issue, and figure
out what is going wrong there.

>> Yes, current git has switched to libevent and subsecond retransmits
>> and timeouts, so we will fall within that 30 second time window as
>> well.
>
> OK if you can guarantee that you will not call update 30 seconds
> after the get_spi, then you should be fine.  In that case you can
> also revert the patch that retries the add after update because
> it is just papering over the xfrm_acq_expires problem and is no
> longer needed.

Right. I'll do that.

>>> For libreswan, I suggest that you increase this parameter to
>>> a more appropriate value.  I haven't done the calculations but
>>> strongswan sets it to 165 which seems to be appropriate.
>>
>> Almost 3 minutes? That seems very long.
>
> Well it just has to be longer than the maximum interval between
> pluto doing get_spi and calling update_sa.

Maybe pluto should explictly track this timer and just fail when it
notices the time has expired.

Paul


More information about the Swan-dev mailing list