Current handling of failed upgrades is screwy

Thu Jul 10 07:18:12 UTC 2014

So I've noticed that the way we currently handle failed upgrades in the
machine agent doesn't make a lot of sense.

Looking at cmd/jujud/machine.go:821, an error is created if
PerformUpgrade() fails but nothing is ever done with it. It's not returned
and it's not logged. This means that if upgrade steps fail, the agent
continues running with the new software version, probably with partially
applied upgrade steps, and there is no way to know.

I have a unit tested fix ready which causes the machine agent to exit (by
returning the error as a fatalError) if PerformUpgrade fails but before
proposing I realised that's not the right thing to do. The agent's upstart
script will restart the agent and probably cause the upgrade to run and
fail again so we end up with an endless restart loop.

The error could also be returned as a "non-fatal" (to the runner) error but
that will just cause the upgrade-steps worker to continuously restart,
attempting the upgrade and failing.

Another approach could be to set the global agent-version back to the
previous software version before killing the machine agent but other agents
may have already upgraded and we can't currently roll them back in any
reliable way.

Our upgrade story will be improving in the coming weeks (I'm working on
that). In the mean time what should we do?

Perhaps the safest thing to do is just log the error and keep the agent
running the new version and hope for the best? There is a significant
chance of problems but this is basically what we're doing now (except
without logging that there's a problem).

Does anyone have a better idea?

- Menno
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20140710/153ba4ca/attachment.html>