Let's talk retries

Tue Aug 9 08:17:55 UTC 2016

On 9 August 2016 at 08:33, Andrew Wilkins <andrew.wilkins at canonical.com> wrote:
> On Tue, Aug 9, 2016 at 3:31 PM William Reade <william.reade at canonical.com>
> wrote:
>>
>> I feel obliged to note that we also have axw's operation queue, used in
>> storageprovisioner, and that it's the only one which doesn't make the
>> assumption that the code being retried is the only important thing that
>> could possibly be happening in the calling context.
>>
>> All the other approaches discussed will leave us blocking 99 good machines
>> behind a single failure -- and while that's harmless-ish in the current
>> provisioner, because it has a strategy of waiting no more than 30s in total,
>> it's going to bite us hard as soon as we switch to a strategy that actually
>> backs off and keeps retrying long enough to be useful.
>>
>> Andrew: as I recall, that code is actually pretty general. Is there a
>> strong reason it's tucked away in an /internal/ package?
>
>
> Nope. I did propose creating a new repo a while back, but it fell by the
> wayside: https://github.com/axw/juju-time. It should be a natural fit for
> the machine provisioner.

This looks like a nice primitive to have available when scheduling many
similar things at once. However I don't think it's a replacement for a
more specific
retry strategy when a goroutine has a single job to do but needs
to be resilient in the face of failure.

Personally, I often like the usual Go approach of assigning a goroutine
to each independent task. For example, in the provisioner code, we'd have
a goroutine individually responsible for each machine that
could use an single-threaded retry strategy without getting
in the way of all the others.

If we say that something like github.com/axw/juju-time/schedule is *the*
retry primitive we should be using everywhere and that,
by inference, the code being retried is never the most
important thing in the calling context, I think we end up with
a situation where it's never OK to block - there'd need to be a single
master control loop that pops items off the timer queue and
runs their associated operations.

There are indeed many popular languages where that's exactly what
you need to do, but I'm hoping that avoid heading in this direction everywhere,
that we can still see goroutines as a simplifying primitive. I certainly
find that thinking about one sequential thing at a time, certainly
seems to help me design
robust subsystems where I can easily understand individual components
and combine them reliably into a larger whole. Isn't that
the essence of Communicating *Sequential* Processes?

BTW I'm not saying that a timer queue is never the correct answer. In some
circumstances, it can be the exactly the right thing to use.

  cheers,
     rog.