<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On 10 July 2014 20:57, John Meinel <span dir="ltr"><<a href="mailto:john@arbash-meinel.com" target="_blank">john@arbash-meinel.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr">I think it fundamentally comes down to "is the reason upgrade failed transient or permanent", if we can try again later, do so, else log at Error level, and keep on with your life, because that is the only chance of recovery (from what you've said, at least).</div>
</blockquote><div><br></div>This is a good approach but I don't see any way that the machine agent can know if an error is transient or permanent with any certainty.<div><br></div><div>Tim has contributed some useful guidance. Given that we currently have no reliable way of automatically rolling back upgrades, we should aim to just stay on the new software version (this is what we silently do now anyway). Instead of stopping on the first failed upgrade step, all upgrade steps should be attempted with all upgrade step failures logged, the thinking being that the environment is more likely to be operational the more upgrade steps that have run.</div>
<div><br></div><div>This approach will also be used for the upcoming upgrade changes when backups aren't available (i.e. when upgrading from a version that doesn't support the backup API). If backups are available then upgrades will be aborted after the first failure with the backup being used to roll back any changes that may have been made.</div>
<div><br></div><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr">
<div>
<br></div><div>John</div><div>=:-></div></div><div class="gmail_extra"><br><br><div class="gmail_quote"><div><div class="h5">On Thu, Jul 10, 2014 at 11:18 AM, Menno Smits <span dir="ltr"><<a href="mailto:menno.smits@canonical.com" target="_blank">menno.smits@canonical.com</a>></span> wrote:<br>
</div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><div class="h5"><div dir="ltr">So I've noticed that the way we currently handle failed upgrades in the machine agent doesn't make a lot of sense.<div>
<br></div><div>Looking at cmd/jujud/machine.go:821, an error is created if PerformUpgrade() fails but nothing is ever done with it. It's not returned and it's not logged. This means that if upgrade steps fail, the agent continues running with the new software version, probably with partially applied upgrade steps, and there is no way to know.</div>
<div><br></div><div>I have a unit tested fix ready which causes the machine agent to exit (by returning the error as a fatalError) if PerformUpgrade fails but before proposing I realised that's not the right thing to do. The agent's upstart script will restart the agent and probably cause the upgrade to run and fail again so we end up with an endless restart loop.</div>
<div><br></div><div>The error could also be returned as a "non-fatal" (to the runner) error but that will just cause the upgrade-steps worker to continuously restart, attempting the upgrade and failing.</div><div>
<br></div><div>Another approach could be to set the global agent-version back to the previous software version before killing the machine agent but other agents may have already upgraded and we can't currently roll them back in any reliable way.</div>
<div><br></div><div>Our upgrade story will be improving in the coming weeks (I'm working on that). In the mean time what should we do?</div><div><br></div><div>Perhaps the safest thing to do is just log the error and keep the agent running the new version and hope for the best? There is a significant chance of problems but this is basically what we're doing now (except without logging that there's a problem).</div>
<div><br></div><div>Does anyone have a better idea?<span><font color="#888888"><br></font></span></div><span><font color="#888888"><div><br></div><div>- Menno</div><div><br></div><div><br></div>
<div><br></div><div><br></div></font></span></div>
<br></div></div><span class=""><font color="#888888">--<br>
Juju-dev mailing list<br>
<a href="mailto:Juju-dev@lists.ubuntu.com" target="_blank">Juju-dev@lists.ubuntu.com</a><br>
Modify settings or unsubscribe at: <a href="https://lists.ubuntu.com/mailman/listinfo/juju-dev" target="_blank">https://lists.ubuntu.com/mailman/listinfo/juju-dev</a><br>
<br></font></span></blockquote></div><br></div>
</blockquote></div><br></div></div>