[Swan-dev] retransmit-interval and retransmit-timeout

Sun Jan 25 00:03:42 EET 2015

| From: Paul Wouters <paul at nohats.ca>

Either these features are experimental or essentially forever.

Experiments are good.  Especially if there is an actual experimental
protocol proposed.

Experimental features should not be documented as if they are
permanent.  They should be flagged so users know not to invest in
them.

If they are not experimental, all my objections stand, as far as I can
see.

| This value has always been inside the code, hardcoded.

As much as we can, we should hide useless complexity from the user.
Hardcoding things that are reasonable and robust is not a mistake.

 What Antony and I
| wanted to do is make it at least a config setup option before we would
| go and dramatically change that to be much more aggressive then the
| past.

| It allows us to set the initial period for further exponentiation.

Experiments are good

| Think of it as a false-save switch.

That I don't get.

| > 1) I think that Paul has said that iPhones lose the first packet when they
| >   are asleep.  Apple users are impatient: our current retry isn't fast
| >   enough.
| >   [How soon can the retry be and still be received in this case?]
| 
| We did not think we had all the answers, and therefor wanted a little
| flexibility in the new system. Right now we have it at 500ms, but we
| really do plan to bring that down a lot before a release.

I would love to see the experimental hypotheses written up.  And the
testing protocol.

| > 2) the old initial retry delay (10 seconds?) was too sluggish in the
| >   modern world.  Even 1 second is considered too slow [by whom?
| >   Why?].
| 
| By ever enduser in the world :/
| 
| And it was 20 seconds even. In fact, some iphones would abort within 20
| seconds so any single packet loss would end up in failure before
| retransmit.

One second, which is the minimum RTO for TCP would suggest to me that
1 second ought to be a minimum RTO for IKE.  Or at least a good
starting point.

| >   [In the real world, how commonly are packets lost by systems where
| >   1 second is too slow?]
| 
| I've already found that some "hangs" I saw with pluto were in fact
| packet loss on my DSL link. I now see retransmits on my client
| while I see no duplicate packets on my server. This code has already
| proven that I was suffering from packet loss without knowing.

What would have helped you discover this?

How could it cause big issues?  (That question is not rhetorical.)

It has the problem that the two settings are not synchronized.  To me,
it makes sense to synchronize the "give up" with exactly when you were
about to do a retransmission.  If the previous retransmit was just
before you give up, that's silly.

| > Intuitively, doubling seems a bit severe.  I admit that I introduced it
| > to Pluto.  To be honest, I don't know that it matters very much.
| 
| I think it is very good especially within the sub-second range. I agree
| that once you pass a second or two, it becomes way too slow in practise.
| But we're hoping to go down much lower that 500ms.

What's the logic?

500ms seems really fast unless you think we have stupidly lossy
networks.

Doubling seems like a reasonable approach to scale-searching, but I
don't think that's what we're doing.  On the other hand, blasting
every 500ms feels like the wrong scale to me.  ("Feelings" aren't good
engineering.  Experiments are a good idea.)

What's your model of what could be going on?

I think most retransmissions are due to interop problems (bad
configs).  Retransmission policy doesn't matter much there.  Give-up
policy does.

Some are due to transmission errors (not that many, in my opinion)
including congestion.

Some are due to dead peers (including network partitions).
Retransmission is a way of finding out when the peer is back.  Often
it is better to start negotiation again rather than retrying the
current message.

We could measure previous RTTs from the same peer and set RTO based on
that.  That doesn't work for the first message, the most critical.  So
it probably isn't worth the bother.  Besides RTT can be affected by
amount of crypto work in the particular message.

| > Why exactly are we mucking about with retransmission counts?  Are we
| > fixing an observed problem?  One that matters?  If so, what exactly is
| > the problem (not the solution!).
| 
| See above. I've suffered from regular packet loss and I restarted pluto
| because I thought it was hung and didn't want to wait 20s to find out.

So you lived for a long time with a disease and you want something to
cover it up :-)  Better to expose and fix it.

(PS: I think that my DSL isn't healthy at the moment.  I have to start
instrumenting it so I can figure the poblem out.)

20 seconds is probably unreasonable.  It's quite a leap to .5 seconds
(almost two orders of magnitude).  It would have been trivial to
experiment with 1 second.

| Browser people who deal with user attention span talk about every single
| roundtrip. Their users care about every 10ms. Also, if we attempt to do
| OE and setup a hand full of connections, we don't want to user to wait
| "a few seconds". We have to succeed or fail fast.

This isn't about all round-trips, it is about the rare case of
behaviour when the network is in distress (errors, congestion,
partition, ...).