[Swan-dev] Calling all Message ID bugs

Andrew Cagney andrew.cagney at gmail.com
Tue Feb 19 16:12:16 UTC 2019

On Mon, 18 Feb 2019 at 18:41, Paul Wouters <paul at nohats.ca> wrote:
> On Tue, 19 Feb 2019, Antony Antony wrote:
> > Here a few corner cases.
> > what happens in case where
> > an admin type connection down in the middle of the rekey.
> > I mean the initial rekey message is lost and pluto is doing its retransmit
> > cycle.  While it is doing that admin type a "auto --down/delete conn"
> >
> > n=1 would suggest pluto can't initiate a delete informational message
> > until rekey message  is acknowledged. You have to go on retransmitting
> > rekey.  While at it also think the same case without liveness/dpd and with
> > liveness. They wold be different.
> Yes, i guess the "down" message would have to go on the "pending list".
> > I also think pluto jump message id when liveness is lost. Instead of
> > re-transmitting old informational. You may run into that if you strictly
> > implement n=1.

While testing the new code I found and hacked a few bugs in the old.
This might have been one of these cases, but I'm not sure.

> That probably requires code fixes. We should not bump the msgid and
> indeed just retransmit the old one. Although that does make weird
> for our dpddelay/dpdtimeout point of view. Since dpddelay would follow
> our regular retransmit backoff, and the only thing needed would be the
> dpdtimeout.

Right.  See my post about code doing a Record 'n' Send midway through
a state transition.

> > Another corner case is mobike. Which one could test what iphone also.
> >
> > As a thought experiment.
> > Say iphone is on LTE. IPsec connection is established. Say iphone is at
> > message id 5. And you join a WiFi with captive portal or some kind or
> > filtering. iphone would send out mobike message 6, mobike probe.The probe,
> > message id 6, is lost for some reason, filtering or captive portal.  Then
> > one of these happens say you disable wifi or leave the wifi, while the
> > message id 6 is unacknowledged.

I believe the tests are doing something like:

- end has request with Message ID 5 outstanding (for instance
keep-alive is going into a hole)
- mobile ike or similar kicks in and sends off a request with Message ID 6

I had to add code like:

        switch (sending) {
        case MESSAGE_REQUEST:
        case MESSAGE_RESPONSE:
                 * pluto is responding to MD.
                 * Since this is a response, the MD's Message ID
                 * trumps what ever is in responder.sent.  This way,
                 * when messages are lost, the counter jumps forward
                 * to the most recent received.

(the responder is SENDING it's MESSAGE_RESPONSE) to handle this.

> > Can iphone initiate any new message exchange 7?
> > One could argue create a new empty informational with message id 6 and send
> > it over lte link. Then it is not strictly retransmit of the previous message
> And is strictly not allowed per RFC. But I think in theory we could
> create a mobike payload that satisfies all networks, that does not
> rely on the IKE header for address update. That one in theory could
> be send once over one interface, and then later over another interface,
> as from a content point of view, the IKE payload is identical (although
> the source IP is not)

> > the content would vary. Source address would be different too, mobike
> > message include a NAT-T payload which is derived from IP address phone got
> > from the wifi network. Sending that payload over LTE sounds strange or
> > rather should not do. I am yet to figure what what RFC mean strict n=1.
> I think people agree at IETF this needs clarification.
> > I am wonder how strict is pluto's n=1 with one tpacket per ike_st.
> > I think instead of tpacket (which is only the last messge "send" -- which
> > could initiated or responded) we need
> > last_message_initiated and last_message_responded buffers for full
> > asynchronous exchange with message window n=1. Without we could run  into more
> > corner cases.


Like for fragments, a packet window will require a queue of inbound /
outbound messages so that they can be processed in order.  And then
there's the weird way partially constructed CHILD SA initiates go into
limbo while waiting for an opportunity to send their packet - perhaps
it is tied in with the interaction of our queued requests with those
from the other end - e.g., REKEY simultaneous to DELETE.

> I believe Andrew already did some of this?

Yes.  Strictly speaking I've been fixing our code so that w=1 works -
a precursor to w>1 :-) (FWIW, any commit with "v2 msgids.*hack" in the
subject was a case of the new code flagging the old code as broken).

So far I've switched the logic dispatching a response to the new code.
The old code path, that contained all sorts of 'enum state_kind'
checks has been reduced to:

                 * A response; find the SA (IKE or CHILD) that
                 * initiated the request.
                st = find_v2_sa_by_request_msgid(&md->hdr.isa_ike_spis,

The other things I'm aware of are:
- CHILD SA request queue (this logic scares me)
- the logic dispatching message requests (or discarding, or re-replying)
Perhaps, after that, the old code can be deleted.

That will then leave the big one - Record 'n' Send (see the post "the
problems with Record 'n' Send and delete_state()").


More information about the Swan-dev mailing list