Let's talk retries

William Reade william.reade at canonical.com
Tue Aug 9 07:31:20 UTC 2016


I feel obliged to note that we also have axw's operation queue, used in
storageprovisioner, and that it's the only one which doesn't make the
assumption that the code being retried is the only important thing that
could possibly be happening in the calling context.

All the other approaches discussed will leave us blocking 99 good machines
behind a single failure -- and while that's harmless-ish in the current
provisioner, because it has a strategy of waiting no more than 30s in
total, it's going to bite us hard as soon as we switch to a strategy that
actually backs off and keeps retrying long enough to be useful.

Andrew: as I recall, that code is actually pretty general. Is there a
strong reason it's tucked away in an /internal/ package?

On Tue, Aug 9, 2016 at 9:02 AM, roger peppe <roger.peppe at canonical.com>
wrote:

> On 9 August 2016 at 07:28, roger peppe <roger.peppe at canonical.com> wrote:
> > On 9 August 2016 at 01:22, Katherine Cox-Buday
> > <katherine.cox-buday at canonical.com> wrote:
> >> Hey All,
> >>
> >> We currently have 3 ways we're performing retries in Juju:
> >>
> >> 1. Archaic, custom, intra-package retry code.
> >> 2. github.com/juju/utils::{Countdown,BackoffTimer}
> >> 3. github.com/juju/retry (current best practice)
> >
> > There's a fourth way that you haven't mentioned, which fits
> > somewhere in between 1 and 2 I think (it was the first
> > explicit general retry code in Juju AFAIK), which is
> > utils.Attempt.
> >
> > I'd suggest that the way it's used matches the pattern
> > you'd like to write quite well:
> >
> >      for attempt := strategy.State(); attempt.Next(); {
>
> Oops; s/State/Start/ of course.
>
> >           Do something
> >      }
> >
> > AFAIR this pattern was arrived at after some discussion between myself
> > and Gustavo. At the time I was thinking of some kind of
> > callback-based scheme like the others schemes you mention, but
> > I think that leaving the caller in control and the loop explicit
> > is actually nicer - the logic is clear to the reader and more
> > composable with other primitives (it is also a good match
> > for the pattern you propose).
> >
> > How about making AttemptStrategy a little more flexible?
> > It was always my intention to admit other kinds of strategy - if
> > a BackoffFunc field were added to AttemptStrategy and
> > a Stop argument added to AttemptStrategy.Start it would
> > become as general as retry.Call, while keeping the nice
> > usage pattern and general straightforwardness.
> >
> >   cheers,
> >     rog.
> >
> >> I have just touched some code which fits #1, and when I was attempting
> to retrofit it to #3, I found it a little obtuse. Here's why:
> >>
> >> In my case (and I assume most), I wanted something like this:
> >>
> >> for range retries { // Loop should break if our parent is cancelled
> >>         // Do something
> >>     // Success, retry, or fatal error
> >> }
> >>
> >> To implement this, I do something like this:
> >>
> >> args := retry.CallArgs{
> >>     Func: func() error {
> >>         // Body of my loop
> >>     },
> >>     BackoffFunc: func(delay time.Duration, attempt int) time.Duration {
> >>         if attempt == 1 {
> >>             return delay
> >>         }
> >>         return delay * factor
> >>     },
> >>     Stop: dying,
> >>     // etc...
> >> }
> >>
> >> Call(args)
> >>
> >> We attempt to encapsulate every scenario for retries in
> github.com/juju/retry/::CallArgs. We have variables to determine if
> errors are fatal, to notify things, how to back off, etc. This feels very
> heavy-weight to me, and a bit like the monolith model rather than the unix
> model. I would prefer the logic be left to the caller (i.e. do I break out
> of this loop? do I notify something?), and the interface into a retry
> strategy be much simpler, say a channel like in time.Tick(time.Duration)
> <-chan time.Time.
> >>
> >> How would folks feel about something like this?
> >>
> >> func BackoffTick(done <-chan struct{}, clock clock.Cock, delay
> time.Duration, factor int64) <-chan time.Time {
> >>         signalStream := make(chan time.Time)
> >>         go func() {
> >>                 defer close(signalStream)
> >>                 for {
> >>                         select {
> >>                         case <-done:
> >>                                 return
> >>                         case signalStream <- time.After(delay):
> >>                                 delay = time.Duration(delay.Nanoseconds()
> * factor)
> >>                         }
> >>                 }
> >>         }()
> >>         return signalStream
> >> }
> >>
> >> With this, the above becomes:
> >>
> >> for range BackoffTick(dying, myClock, 1*time.Second, 2) {
> >>     // Determination of fatality and notifications happen here
> >> }
> >>
> >> If you want a max retry concept, you do this:
> >>
> >> for range Take(done, 3, BackoffTick(dying, myClock, 1*time.Second, 2)) {
> >>     // Determination of fatality and notifications happen here
> >> }
> >>
> >> If you want a max duration you do this:
> >>
> >> done = Timeout(done, 5*time.Minute)
> >> for range Take(done, 3, BackoffTick(dying, myClock, 1*time.Second, 2)) {
> >>     // Determination of fatality and notifications happen here
> >> }
> >>
> >> Functionally, what's in juju/retry is fine; and I can stuff anything I
> want into the function. It just feels a little odd to use in that I must
> put the body of my loop in a function, and I dislike that the CallArgs
> struct attempts to encapsulate a bunch of different scenarios.
> >>
> >> Thoughts?
> >
> > I'm not keen on the channel-based approach because if you want to
> > terminate early, you need to shut down the goroutine - it's just
> > one more resource to clean up. Also, the goroutine that's sending
> > the events is one step removed from what's going on in the loop - the
> > delay is calculated one step before the current iteration is complete.
> >
> >   cheers,
> >     rog.
> >
> >>
> >> Also, is it on the roadmap to address the inconsitant retries
> throughout Juju? I almost used utils.BackoffTimer until I started looking
> deeper. I'm going to submit a PR to flag it as deprecated.
> >>
> >> --
> >> Katherine
> >>
> >> --
> >> Juju-dev mailing list
> >> Juju-dev at lists.ubuntu.com
> >> Modify settings or unsubscribe at: https://lists.ubuntu.com/
> mailman/listinfo/juju-dev
>
> --
> Juju-dev mailing list
> Juju-dev at lists.ubuntu.com
> Modify settings or unsubscribe at: https://lists.ubuntu.com/
> mailman/listinfo/juju-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20160809/daeccfdf/attachment.html>


More information about the Juju-dev mailing list