Current handling of failed upgrades is screwy

Tue Jul 15 09:33:15 UTC 2014

FWIW, we could set some error status on the affected agent (so users can
see there's a problem) and make it return 0 (so that upstart doesn't keep
hammering it); but as jam points out that's not helpful when it's a
transient error. I'd suggest retrying a few times, with some delay between
attempts, before we do so (although reporting the error, and making it
clear that we'll retry automatically, is probably worthwhile).

And, really, I'm not very keen on the prospect of continuing to run when we
know upgrade steps have failed -- IMO this puts us in an essentially
unknowable state, and I'd much rather fail hard and early than limp along
pretending to work correctly. Manual recovery of a failed upgrade will
surely be tedious whatever we do, but a failed upgrade won't affect the
operation of properly-written charms -- it's a management failure, so you
can't scale/relate/whatever, but the actual software deployed will keep
running. However, I can easily imagine that continuing to run juju agents
against truly broken state could lead to services actually being shut
down/misconfigured, and I think that's much more harmful.

Cheers
William

On Thu, Jul 10, 2014 at 9:57 AM, John Meinel <john at arbash-meinel.com> wrote:

> I think it fundamentally comes down to "is the reason upgrade failed
> transient or permanent", if we can try again later, do so, else log at
> Error level, and keep on with your life, because that is the only chance of
> recovery (from what you've said, at least).
>
> John
> =:->
>
>
> On Thu, Jul 10, 2014 at 11:18 AM, Menno Smits <menno.smits at canonical.com>
> wrote:
>
>> So I've noticed that the way we currently handle failed upgrades in the
>> machine agent doesn't make a lot of sense.
>>
>> Looking at cmd/jujud/machine.go:821, an error is created if
>> PerformUpgrade() fails but nothing is ever done with it. It's not returned
>> and it's not logged. This means that if upgrade steps fail, the agent
>> continues running with the new software version, probably with partially
>> applied upgrade steps, and there is no way to know.
>>
>> I have a unit tested fix ready which causes the machine agent to exit (by
>> returning the error as a fatalError) if PerformUpgrade fails but before
>> proposing I realised that's not the right thing to do. The agent's upstart
>> script will restart the agent and probably cause the upgrade to run and
>> fail again so we end up with an endless restart loop.
>>
>> The error could also be returned as a "non-fatal" (to the runner) error
>> but that will just cause the upgrade-steps worker to continuously restart,
>> attempting the upgrade and failing.
>>
>> Another approach could be to set the global agent-version back to the
>> previous software version before killing the machine agent but other agents
>> may have already upgraded and we can't currently roll them back in any
>> reliable way.
>>
>> Our upgrade story will be improving in the coming weeks (I'm working on
>> that). In the mean time what should we do?
>>
>> Perhaps the safest thing to do is just log the error and keep the agent
>> running the new version and hope for the best? There is a significant
>> chance of problems but this is basically what we're doing now (except
>> without logging that there's a problem).
>>
>> Does anyone have a better idea?
>>
>> - Menno
>>
>>
>>
>>
>>
>> --
>> Juju-dev mailing list
>> Juju-dev at lists.ubuntu.com
>> Modify settings or unsubscribe at:
>> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>>
>>
>
> --
> Juju-dev mailing list
> Juju-dev at lists.ubuntu.com
> Modify settings or unsubscribe at:
> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20140715/4e1e54cd/attachment.html>