[Swan-dev] retransmit-interval and retransmit-timeout

Sat Jan 24 00:56:52 EET 2015

We are constantly making libreswan more complicated, both internally and 
to the user.  The most serious additions are the ones that affect the
user interface.  We have way too many knobs that the user can turn.

We should resist this temptation when possible

- more settings to get wrong

- more to understand (or find out that it can be ignored)

- more ways to drive libreswan into impossible or broken states

- more combinations to test

- more long term commitment (support lifetime of stable distro
  releases?)

In any case, I think that adding anything to our interface should be
done very carefully, with much deliberation.

================

My reverse engineering of two new knobs:

retransmit-interval is meant to allow the user to specify that libreswan 
should be quicker off the mark in issuing the first retransmission packet.  
(In units of milliseconds.)

Why is this useful?

1) I think that Paul has said that iPhones lose the first packet when they 
   are asleep.  Apple users are impatient: our current retry isn't fast 
   enough.
   [How soon can the retry be and still be received in this case?]

2) the old initial retry delay (10 seconds?) was too sluggish in the 
   modern world.  Even 1 second is considered too slow [by whom?
   Why?].
   [In the real world, how commonly are packets lost by systems where
   1 second is too slow?]

Are these two reasons the same?

Are there more reasons?

If both are true, why not change the initial retry delay to 0.5 seconds 
for everyone?  Why make it configurable?

================

retransmit-timeout is meant to say how long (in seconds) libreswan should 
keep waiting for an answer to a particular IKE message.

The old code had wired-in the number of retransmissions it was willing to 
do.  After that, it would (under user control) retry the whole 
negotiation.

Why is this new parameter useful?

Why is it usefully expressed in seconds, not retransmissions (supported by 
the existing code, but not configurable).

What should happen if the previous retries add up t

Actually, the old code used a maximum of 2 retransmissions for most parts 
of the exchange but used 20 for the initial message.  This kind of nuance 
is either lost or is within overlapping/interfering mechanisms.

Summary: I'd like to see a stronger case for this extra interface 
complexity

I also don't much care for the names.

retransmit-interval might be better named response-initial-deadline.  This 
name indicates that this is properly considered a deadline.

================

RFC6298 covers a similar problem for TCP.  Maybe we should adopt their
policies or at least learn from them.  After all, their collective
wisdom is greater than ours.

In 2.1, it says

    Until a round-trip time (RTT) measurement has been made for a
    segment sent between the sender and receiver, the sender SHOULD
    set RTO <- 1 second, though the "backing off" on repeated
    retransmission discussed in (5.5) still applies.

(RTO == Retry Time Out)

The initial retry MUST be after at least 1 second.

Given the TCP terminology (of which I'm not a total fan), our
"retransmit-timeout" is confusingly named: ours means when to stop
retransmitting but RTO means when to start retransmitting.

================

This paper suggests exponential backoff isn't such a great idea:
<http://www.cs.northwestern.edu/~akm175/docs/extr.pdf>
I admit that I've only glanced at it.

Intuitively, doubling seems a bit severe.  I admit that I introduced it
to Pluto.  To be honest, I don't know that it matters very much.

Why exactly are we mucking about with retransmission counts?  Are we
fixing an observed problem?  One that matters?  If so, what exactly is
the problem (not the solution!).