[Swan-dev] ERROR: netlink response for Add SA ... included errno 3: No such process

Sat Apr 11 08:13:09 EEST 2015

On Sat, Apr 11, 2015 at 12:52:52AM -0400, Paul Wouters wrote:
>
> > This is wrong because some updates do not
> >contain keying material.
> 
> I don't understand this. Can you explain what the problem is for those
> SA's ?

Updates are used in two places in pluto.  They're used for inbound
SAs as part of the get_spi + update procedure, and they are used
for NAT-T updates.  In the latter case there is no keying material
so you must not replace the update with an add.

The kernel will never delete any live SAs installed by pluto since
pluto does not set hard life times on them.  So the NAT-T update
should never fail anyway unless some third party is deleting SAs.

> > Moreover, the add too can fail if the
> >SPI has already been reallocated to another SA.
> 
> By whom? We assume we are the only IKE daemon running and the only
> entity requesting SPI's from the kernel. Anything else is madness.

When you do a get_spi the kernel generates a temporary SA to keep
hold of the SPI so that nobody else gets it.  But this SA only
lives until xfrm_acq_expires.  So when it expires your SPI can be
allocated to someone else.  Since pluto can easily initiate two
connections at the same time and therefore call get_spi again,
there is no way you can guarantee that the expired SPI won't be
reused by another connection.

Recall that the SPI is used along with the dst/proto as the key
to the SA.  For inbound SAs the dst/proto never change so the
SPI must be unique.  This is why get_spi exists in the first
place as only the kernel can guarantee uniqueness.  You do not
want to find out after you have finished your negotiation that
the SPI you used during your negotiation cannot be used as it's
a duplicate.

Therefore redoing the add after update might work but is simply
wrong.  You might as well just pluck some random number out of
thin air and use that as your SPI.

> Yes, current git has switched to libevent and subsecond retransmits
> and timeouts, so we will fall within that 30 second time window as
> well.

OK if you can guarantee that you will not call update 30 seconds
after the get_spi, then you should be fine.  In that case you can
also revert the patch that retries the add after update because
it is just papering over the xfrm_acq_expires problem and is no
longer needed.

> >For libreswan, I suggest that you increase this parameter to
> >a more appropriate value.  I haven't done the calculations but
> >strongswan sets it to 165 which seems to be appropriate.
> 
> Almost 3 minutes? That seems very long.

Well it just has to be longer than the maximum interval between
pluto doing get_spi and calling update_sa.

Cheers,
-- 
Email: Herbert Xu <herbert at gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt