Current handling of failed upgrades is screwy

Menno Smits menno.smits at canonical.com
Mon Jul 14 01:02:51 UTC 2014


On 10 July 2014 20:57, John Meinel <john at arbash-meinel.com> wrote:

> I think it fundamentally comes down to "is the reason upgrade failed
> transient or permanent", if we can try again later, do so, else log at
> Error level, and keep on with your life, because that is the only chance of
> recovery (from what you've said, at least).
>

This is a good approach but I don't see any way that the machine agent can
know if an error is transient or permanent with any certainty.

Tim has contributed some useful guidance. Given that we currently have no
reliable way of automatically rolling back upgrades, we should aim to just
stay on the new software version (this is what we silently do now anyway).
Instead of stopping on the first failed upgrade step, all upgrade steps
should be attempted with all upgrade step failures logged, the thinking
being that the environment is more likely to be operational the more
upgrade steps that have run.

This approach will also be used for the upcoming upgrade changes when
backups aren't available (i.e. when upgrading from a version that doesn't
support the backup API). If backups are available then upgrades will be
aborted after the first failure with the backup being used to roll back any
changes that may have been made.





>
> John
> =:->
>
>
> On Thu, Jul 10, 2014 at 11:18 AM, Menno Smits <menno.smits at canonical.com>
> wrote:
>
>> So I've noticed that the way we currently handle failed upgrades in the
>> machine agent doesn't make a lot of sense.
>>
>> Looking at cmd/jujud/machine.go:821, an error is created if
>> PerformUpgrade() fails but nothing is ever done with it. It's not returned
>> and it's not logged. This means that if upgrade steps fail, the agent
>> continues running with the new software version, probably with partially
>> applied upgrade steps, and there is no way to know.
>>
>> I have a unit tested fix ready which causes the machine agent to exit (by
>> returning the error as a fatalError) if PerformUpgrade fails but before
>> proposing I realised that's not the right thing to do. The agent's upstart
>> script will restart the agent and probably cause the upgrade to run and
>> fail again so we end up with an endless restart loop.
>>
>> The error could also be returned as a "non-fatal" (to the runner) error
>> but that will just cause the upgrade-steps worker to continuously restart,
>> attempting the upgrade and failing.
>>
>> Another approach could be to set the global agent-version back to the
>> previous software version before killing the machine agent but other agents
>> may have already upgraded and we can't currently roll them back in any
>> reliable way.
>>
>> Our upgrade story will be improving in the coming weeks (I'm working on
>> that). In the mean time what should we do?
>>
>> Perhaps the safest thing to do is just log the error and keep the agent
>> running the new version and hope for the best? There is a significant
>> chance of problems but this is basically what we're doing now (except
>> without logging that there's a problem).
>>
>> Does anyone have a better idea?
>>
>> - Menno
>>
>>
>>
>>
>>
>> --
>> Juju-dev mailing list
>> Juju-dev at lists.ubuntu.com
>> Modify settings or unsubscribe at:
>> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20140714/d34c2cc2/attachment.html>


More information about the Juju-dev mailing list