[Swan-dev] analyses of regression in test ikev2-ike-rekey-03

Sat Jun 16 03:59:19 UTC 2018

Hi Paul,
a quick response in the interest of time.
though your analysis appear be detailed the direction of it and the scope puzzles me. It is only analyzing 
test with regression. I think you will find the answers yourself if you compare logs before the regression. 
Pay attention to the log line on east. I noticed this on Tuomo's system at the beginning of this bughunt.
| inserting event EVENT_SA_EXPIRE, timeout in 0.000 seconds for #4

Another suggestion is run the same test with uniqueid=no And compare the logs.

I don't understand the issue you are bringing up about short timers, 30seconds.
shouldn't that work too? It worked before this commit.

First rekey should work with 30second timer too? Second, we/Tuomo ran the 
code with 8hour timer and the issue are similar.  What timer would you like to see? Our test run is 17 hours,
could even create a 8hour test:)

After my fix I created ikev2-ike-rekey-05 which I left running for longer to see the dynamics and no issues shown.

also whack command is not very convincing to me, not on my wish list:) 
A side note, strongswan, has byte limit to rekey IPsec. Which I found it confusing so far.
These tests are complicated it takes a while to follow them.
Even if we have such whack command one would need timer based test too?
Or are you suggesting no timer tests?

I would like to bring your attention to the bug at hand instead of digressing into generic code questions
and timer questions. I would also remind rekey code has limitations focusing one issue/improvement thing at a time is probably better. 

On Fri, Jun 15, 2018 at 05:52:20PM -0400, Paul Wouters wrote:
> On Wed, 6 Jun 2018, Paul Wouters wrote:
> 
> > On Tue, 5 Jun 2018, Antony Antony wrote:
> > 
> > >  I noticed a regression in test ikev2-ike-rekey-03. I was hunting an issue
> > >  Tuomo found. While trouble shooting it I noticed this regression.
> > 
> > >  http://testing.libreswan.org/results/testing/v3.22-1500-g2efd341-master/ikev2-ike-rekey-03/
> 
> I looked at this test case. Clearly there is a bug now, but the
> situation is complicated by the fact that the test case seems
> flawed to begin with due to the extremely short rekey timers used.
> 
> This is my observation on east:
> 
> [initial setup of all tunnels]
> 
> westnet-eastnet-ikev2c #1 IKE SA
> westnet-eastnet-ikev2a #2 [192.0.2.0-192.0.2.255:0-65535 0] -> [192.0.1.0-192.0.1.255:0-65535 0]  ESP=>0x030eb7c4 <0xeb332c41
> westnet-eastnet-ikev2b #3 [192.0.200.0-192.0.200.255:0-65535 0] -> [192.0.100.0-192.0.100.255:0-65535 0]  ESP=>0xc8c58945 <0x71312be9
> westnet-eastnet-ikev2c #4 [192.0.201.0-192.0.201.255:0-65535 0] -> [192.0.101.0-192.0.101.255:0-65535 0]  ESP=>0x98a747b7 <0x4513d3f4
> 
> [rekey event as responder]
> westnet-eastnet-ikev2a #5: negotiated new IPsec SA [192.0.2.0-192.0.2.255:0-65535 0] -> [192.0.1.0-192.0.1.255:0-65535 0] ESP=>0xd09a6b80 <0x13cdc164
> 
> [timer event unrelated to anything of remote peer kicks in]
> | handling event EVENT_SA_EXPIRE for child state #4
> westnet-eastnet-ikev2c #4: deleting state (STATE_V2_IPSEC_R)
> [this means 192.0.201.0/24 - 192.0.101.0/24 tunnel is now BROKEN]
> 
> [Remote deletes old ipsec SA which was just rekeyed]
> westnet-eastnet-ikev2c #1: received Delete SA payload: delete IPSEC State #2 now
> westnet-eastnet-ikev2a #2: deleting other state #2 connection (STATE_V2_IPSEC_R) "westnet-eastnet-ikev2a"
> 
> [remote rekey event]
> "westnet-eastnet-ikev2b" #6: negotiated new IPsec SA [192.0.200.0-192.0.200.255:0-65535 0] -> [192.0.100.0-192.0.100.255:0-65535 0] ESP=>0x304bf497 <0x970d9cc0
> 
> [remote delete ipsec sa which was just rekeyed]
> "westnet-eastnet-ikev2c" #1: received Delete SA payload: delete IPSEC State #3 now
> 
> [next remote sends rekey ipsec sa for #4 which we already expired by timer!]
> "westnet-eastnet-ikev2c" #1: CREATE_CHILD_SA no such IPsec SA to rekey SA(0x98a747b7) Protocol PROTO_v2_ESP
> "westnet-eastnet-ikev2c" #7: responding to CREATE_CHILD_SA message (ID 8) from 192.1.2.45:500 with encrypted notification CHILD_SA_NOT_FOUND
> 
> [remote then tries to delete the thing we told it we already don't know anything about]
> #1: received delete request for PROTO_v2_ESP SA(0x98a747b7) but corresponding state not found
> 
> [remote replaces #4 which we already deleted]
> "westnet-eastnet-ikev2c" #8: negotiated new IPsec SA [192.0.201.0-192.0.201.255:0-65535 0] -> [192.0.101.0-192.0.101.255:0-65535 0] ESP=>0x63221c98 <0x3d570d8c
> [ this means 192.0.201.0/24 - 192.0.101.0/24 tunnel is now repaired]
> 
> [remote rekeys #5 with #9, but our end then mistakenly deletes wrong state]
> "westnet-eastnet-ikev2a" #9: negotiated new IPsec SA [192.0.2.0-192.0.2.255:0-65535 0] -> [192.0.1.0-192.0.1.255:0-65535 0] ESP=>0x10cb8a4c <0x9b32c05b
> "westnet-eastnet-ikev2c" #8: deleting state (STATE_V2_IPSEC_R)
> [note westnet-eastnet-ikev2a vs westnet-eastnet-ikev2c]
> 
> [remote deletes the replaced ipsec sa]
> "westnet-eastnet-ikev2c" #1: received Delete SA payload: delete IPSEC State #5 now
> 
> And it goes on.
> 
> So the first problem is that #4 is deleted before it is rekeyed on east
> and everything is a bit confused from there on. 

Why not compare what happens with uniqueid=no? Pay attention the log line mentioned below.

> But I think the real issue is that we have code that doesn't fully
> understand the concept of shared IKE SA. With the IKE SA being on
> "westnet-eastnet-ikev2c", it seems that it deletes another conn's
> IPsec SA (on westnet-eastnet-ikev2a). Maybe this is because it didn't
> find one on westnet-eastnet-ikev2c, because a timer just deleted it.
> 
> 
> I think that instead of this playing with timers, it is worth
> introducing a new whack command that rekeys an IKEv2 SA, so that we can
> do rekeying without using 30 second timers in our test cases.

are you saying you can't wait 30s? I don't undertand what wrong with a 
test with timers. More tests are ofcourse good idea. I think all timers should work. 
That is the real case. I know we have corner case and deadlocks. But this bug I reported
is not about those. It feels like you trying to get world peace:)

> Looking at if the IPsec SA's properly migrate to the new IKE SA I am
> observing the following:
> 
> [first incoming IKE SA rekey for #1]
> | "westnet-eastnet-ikev2c" #1 received IKE Rekey Request CREATE_CHILD_SA from 192.1.2.45:500 Child "westnet-eastnet-ikev2c" #11 in STATE_V2_REKEY_IKE_R will process it further
> 
> We know that 3 IPsec SA's should move from #1 to #11
> 
> | #11 inherit #10 from parent #1
> | #11 inherit #9 from parent #1
> | #11 inherit #7 from parent #1
> 
> So that seems to work properly (unlike my expectation)

since your initial hypothesis is , atleast in part, proven wrong and there
clearly one commit that causing this regression I suggest you focus there.

However not, while working on this bug I did find another one. If your further tests involve
keep in mind of the patch 91af9fa707.
> 
> But the next one confuses me. I would expcet a IKE REKEY Request for
> #11, but the next one I see is actually:
> 
> | "westnet-eastnet-ikev2c" #13 received IKE Rekey Request CREATE_CHILD_SA from 192.1.2.45:500 Child "westnet-eastnet-ikev2c" #17 in STATE_V2_REKEY_IKE_R will process it further
> 
> So where did #13 come from? Why is that not #11 ?
> 
> | switching from parent? #1 to child #11 in FSM processor

this is not new.

> 
> What does this mean? Surely it does not mean it switched from a parent
> state to a child state because this involves none of the child IPsec
> SA's. But the code indeed switches?

Generally I don't seey why not switch. IKE SA is used to decrypting and rest will be on the Child state.
However, while creating the response message, parent will be used also.

> 
>                 struct state *cst = process_v2_child_ix(md, st);
>                 if (cst == NULL) {
>                         /* no go. Could improve the status code? */
>                         complete_v2_state_transition(mdp, STF_FAIL);
>                         return;
>                 }
> 
>                 DBGF(DBG_CONTROL,
>                      "switching from parent? #%lu to child #%lu in FSM processor",
>                      st->st_serialno, cst->st_serialno);
> 
>                /* switch from parent state to child state */
>                 st = cst;
> 	}
> 
> 	md->st = st;
> 
> It even updated the state associated to the md? That seems wrong.

I do't agree. Again we are digressing. 
I recollect some discussion about the choice of storing st in md and switching it.
May be something discuss as a seperate topic, and not to hack this in haste.

> This is a CREATE_CHILD_SA not involving any IPsec SA, yet we moved
> state from IKE SA to IPsec SA? If that IPsec SA would have a different
> IKE SA, the resulting IKE message would be undecryptable by the remote
> endpoint. Also remember above we deleted #4, so what if there is no
> IPsec SA state? The above code just fails out when cst == NULL.

I think your assumtions are wrong:) Try the test without uniqueid

I noticed this line, is the uniqueid code introduced in

| inserting event EVENT_SA_EXPIRE, timeout in 0.000 seconds for #4

see the log https://swantest.libreswan.fi/results/blackswan/2018-06-11-swantest-3.22-1581-g1cbd37cf9-master/ikev2-ike-rekey-03/OUTPUT/east.pluto.log (this file should open in the browser).

If you look before the regression we do not have such line. And I suspect it is caused by 9bd57bb

https://swantest.libreswan.fi/results/blackswan/2018-05-15-swantest-3.22-1400-g39c2250c1-master/ikev2-ike-rekey-03/OUTPUT/east.pluto.log

after things things are wrong.

> Back to #11, it completes the new IKE SA. We now have two IKE SA's #1
> and #11. And we then receive a delete for #1. Note that I see:
> 
> | STF_OK but no state object remains
> 
> and processing stops. So we did not send an informational delete answer.
> (will this cause any msgid deadlock? Maybe not because it was the last
>  message to be send/received over that IKE SA anyway)
> 
> The next message is from the new IKE SA.
> 
> "westnet-eastnet-ikev2c" #11: CREATE_CHILD_SA no such IPsec SA to rekey SA(0x63221c98) Protocol PROTO_v2_ESP
> 
> This is the state we (wrongly) deleted before. The remote assumed it migrated.
> 
> So what about #11 vs #13? It seems we just derail even more:
> 
> "westnet-eastnet-ikev2c" #11: received Delete SA payload: delete IPSEC State #10 now
> "westnet-eastnet-ikev2b" #10: deleting other state #10 connection (STATE_V2_IPSEC_R) "westnet-eastnet-ikev2b"
> "westnet-eastnet-ikev2b" #10: ESP traffic information: in=336B out=336B
> "westnet-eastnet-ikev2c" #12: deleting other state #12 (STATE_CHILDSA_DEL)
> "westnet-eastnet-ikev2c" #12: ERROR: netlink response for Del SA esp.549ef532 at 192.1.2.45 included errno 3: No such process
> "westnet-eastnet-ikev2c" #12: ERROR: netlink response for Del SA esp.0 at 192.1.2.23 included errno 3: No such process
> "westnet-eastnet-ikev2c" #7: deleting other state #7 (STATE_CHILDSA_DEL) "westnet-eastnet-ikev2c" #7: ERROR: netlink response for Del SA esp.65d93d00 at 192.1.2.45 included errno 3: No such process
> "westnet-eastnet-ikev2c" #7: ERROR: netlink response for Del SA esp.0 at 192.1.2.23 included errno 3: No such process
> "westnet-eastnet-ikev2c" #11: deleting state (STATE_IKESA_DEL)
> packet from 192.1.2.45:500: proposal 1:IKE:ENCR=AES_GCM_C_256;PRF=HMAC_SHA2_512;DH=MODP2048 chosen from remote proposals
> "westnet-eastnet-ikev2c" #13: STATE_PARENT_R1: received v2I1, sent v2R1 {auth=IKEv2 cipher=aes_gcm_16_256 integ=n/a prf=sha2_512 group=MODP2048}
> 
> Things come crashing down, and a new IKE SA is started from scratch,
> with I presume fresh IPsec SA's to be generated as well. And I assume
> we do another loop with similar results.
> 
> I've stopped tracking it at this point. This specific test run has log
> files updated to vault:/tmp/ikev2-ike-rekey-03 if someone wants to have
> a look. But I think it would be better to not use short timers, and
> invoke an IKE rekey via whack, and then observe what is happening.
> 
> Paul