[Swan-dev] hostpairs and instances, was Re: ikev2-10-2behind-nat and 80d68d57a57

Fri Sep 20 12:03:36 UTC 2019

On Thu, 19 Sep 2019 at 22:06, Paul Wouters <paul at nohats.ca> wrote:
>
> On Thu, 19 Sep 2019, Andrew Cagney wrote:
>
> > seems to be a little too aggressive.  For instance in
> > ikev2-10-2behind-nat on east I see:
>
> I disagree.
>
> > - north establishes #1,#2 creating instance "pool-eastnet-ikev2"[1]
> > with cert E=user-north at testing.libreswan.org
> > - road starts establishing #3,#4 and grabs connection instance
> > "pool-eastnet-ikev2"[1]
> > - during authentication "pool-eastnet-ikev2"[1] is found unsuitable -
> > E=user-road at testing.libreswan.org !=
> > E=user-north at testing.libreswan.org I'm guessing
> >
> > the old code would then stumble on, eventually going back to the
> > original connection and re-refining it (the patch stops this):
>
> The core of the problem is that we use refine_host_connection() in
> IKEv2, even though that function is written for IKEv1. In IKEv1,
> you only have authentication material, so you cannot refine based
> on traffic selectors until after authentication is done and a new
> packet for quick mode appears.
>
> So the IKEv2 code, instead of talking IKE ID's plus Traffic Selectors
> into account for refining the connection, needs to continue whether or
> not there is an AUTH issue. This could be many kind of failures (wrong
> CA, wrong IKE ID, IKE ID not on certificate). Currently, I think if the
> certificate does not validate, it will fail without looking for other
> connections (with possibly other CA's that would make it succeed). If
> the certificate validates, it checks the IKE ID with the certificate IDs
> (SAN) but has to continue when failing. Then it hopefully switches once
> it hits the traffic selector code, and finds a better conn from a traffic
> selector point of view. But then when it switches to that conn, it has to
> retroactively check if the switched conn allows the same authentication
> properties. Eg the CERT could now have the right IKE ID on the SAN,
> but it could also be that this connection now has the wrong CA and the
> previous certificate validation is not valid for this conn.
>
> > "pool-eastnet-ikev2"[1] 192.1.2.254 #3: switched from
> > "pool-eastnet-ikev2"[1] 192.1.2.254 to "pool-eastnet-ikev2"
> > | rw_instantiate() instantiated "pool-eastnet-ikev2"[2] 192.1.2.254
> > for 192.1.2.254
> > "pool-eastnet-ikev2"[2] 192.1.2.254 #3: Authenticated using RSA
> >
> > (#include rant about how connection prefixes changing randomly and
> > meaninglessly)
>
> It does not. If the name changes, it means you switched connections, and
> that there be dragons.

This is exposing internal brokenness in the log file - is logging
against "pool-eastnet-ikev2"[1] which is bogus doomed and confusing.

Either not having the connection prefix, or having the connection root
would be more meaningful.

> > Presumably tweaking the connection would fix this?
> >
> > But here's my question.  Why was road assigned the instance
> > "pool-eastnet-ikev2"[1] to start with?  Would it be better to:
> > - in the responder, on first contact, assign a non-template connection
> > to the state
> > - during auth, when everything is known, switch to or refine the
> > connection instance as needed
>
> Yes, that is one of the proposals I've suggested in the past, but it has
> a problem too. If you don't have the real connection, you don't know
> what DH / KE is allowed, so you have to accept everything, and you
> cannot ever return INVALID_KE, because you accepted whatever they gave
> you in IKE_SA_INIT, and now in IKE_AUTH you realise that the current
> connection does not support the specific DH group.

Yes.  But tink of these as separate problems to solve:

- we should select a connection "base" and then refine it, not select
an instance and then try digging out of a hole
- we need a way to handle overlapping connections with different
crypto suites; for instance proposing the intersection; or only
allowing when their suites are the same

> So there is some merit in trying to find the best one in IKE_SA_INIT,
> at least one with a matching DH / KE. But it's a poor guarantee for
> knowing you are on the right conn.
>
> Furthermore, I have a pending change that I had hoped to talk with Hugh
> about before commiting related to this:

I'm not planning on going near this code.

> diff --git a/programs/pluto/ikev2_ts.c b/programs/pluto/ikev2_ts.c
> index 3ae176e8a3..d8470f1072 100644
> --- a/programs/pluto/ikev2_ts.c
> +++ b/programs/pluto/ikev2_ts.c
> @@ -780,8 +780,8 @@ bool v2_process_ts_request(struct child_sa *child,
>                  if (hp == NULL)
>                          continue;
>
> -               for (struct connection *d = hp->connections;
> -                    d != NULL; d = d->hp_next) {
> +               struct connection *d;
> +               for (d = connections; d != NULL; d = d->ac_next) {
>                          /* groups are templates instantiated as GROUPINSTANCE */
>                          if (d->policy & POLICY_GROUP) {
>                                  continue;
>
> Currently, the traffic selector code loops over the hostpair. But this
> fails when you have multiple connections with the remote as %any because
> you don't know the remote's dynamic IP (but your two connections are
> meant for the same dynamic entity). As you can see in ikev2-06-dual-net,
> without the above change, it will never find the second conn for the
> CREATE_CHILD_SA. It only finds the first template and the instance of it,
> but not the second template conn. Despite the fact that both conns have
> the same leftid/rightid but it has a right=%any. I think it is a
> limitation of the hostpair code, which assumes that you are talking to
> one entity on one IP address, so it is unclear how the second connection
> with %any would handle a connection from a different IP.
>
> The real question here is, should we change the hostpair concept to be
> keyed on ID instead of IP, or do we trust our dynamic trees and buckets
> enough now that the performance gain/loss from hostpair is so little, we
> should just remove the entire concept?
>
> Unfortunately, the reason this commit is pending and I wanted to
> investigate more is that doing this seems to result in a bogus "eroute
> already in use", despite that the two conns have different subnets. Of
> course, this is also an artifact of right=%any. It's not really required
> to instantiate, but it does because of the %any.
>
> We should have used right=%many and right=%dynamic which would have been
> clearer and that %dynamic would not instantiate, but that ship sailed
> in 1995.
>
> > it won't help when first contact has a choice of connections; but it
> > might help when the problem is finding the correct connection instance
> > - the initial decision hasn't dug a hole
>
> Related to the above. For "real" instances of remote access clients, you
> would never need to refine_host_connection() to a template instance. You
> would either find the template and create a new instance, or find
> something that didn't come from this already instantiated template. But
> because of the above, you have to check instances too :/
>
> > thoughts,
>
> Many thoughts are needed. Maybe some prayers too.
>
> Paul