agent upgrading

Mon Jun 11 14:46:36 UTC 2012

On 11 June 2012 15:18, Kapil Thangavelu <kapil.thangavelu at canonical.com> wrote:
>> The idea is that there's a symlink for each tool which points to the
>> current
>> version of the tool. A tool upgrades itself by atomically redirecting that
>> symlink to the new version of itself (that way we don't need to rewrite
>> the upstart file each time the version is changed).
>
> so a failure on the new release, would also entail reverting this symlink?

Perhaps the responsibility for changing the symlink should rest with the
newly started tool. Or perhaps we could give the upgrader process
that responsibility (it would change it after a successful upgrade).

>> > Which states are being recorded
>>
>> > in zk. ISTM, that we should have running versions, and proposed versions
>>
>> +1. I sketched over this bit, but that was part of the plan.
>> Then you can see in the state when agents have successfully
>> upgraded.
>>
>> > and
>> > proposed timestamp recorded for all agents, so we can detect deltas from
>> > outside the context of a given agent.
>>
>> I'm not sure what you mean by "proposed timestamp" here.
>
>
> its hard to distinguish error from op in progress without explicitly
> tracking the operation state or failing that a timestamp to differentiate
> persistent errors.

I think that if each agent had an "upgrade-status" node, showing
the status of the most recent upgrade attempt, that should be
sufficient.

I am assuming here that a given version will fail deterministically
on a given machine, so there's no need to re-attempt an
upgrade to a particular version. Perhaps that's not a good assumption
though, in which case some kind of a counter to force another
upgrade attempt after an initial failure might be a good idea.

>> > sounds like your saying
>> > that such other init systems might not have process management enough to
>> > relaunch the agents if they die?
>>
>> No, I assume that we'll have something that can automatically start
>> an executable on reboot, and if the agent crashes unexpectedly.
>
> old school init.d systems don't nesc restart automatically, their originial
> intention was boot. as a quick example.
>
> $ sudo apt-get install redis-server
> $ sudo kill redis-server
> $ ps aux | grep redis-server
>
> afaik, as far as auto restart, the common init like tools that support are
> upstart, daemontools, systemd, and inittab.

If we don't have one of these things, I think we should write/provide
one. Something that runs an executable, waits for it to exit
and re-runs it (with some heuristics around whether and how often
to retry) would probably be sufficient.

>> > its not clear why you need a separate up-grader process, the two agent
>> > processes can coordinate between each other so the new version can be
>> > verified and the old one shutdown.
>>
>> That was my original plan, but I realised that AFAICS that's not
>> compatible
>> with the way that upstart (and presumably other supervision systems) work,
>> as when the old process exits, upstart will think that it's terminated and
>> start
>> a new instance, even though a new instance has already started.
>> It was not my plan to eliminate the use of upstart completely.
>>
>
>
> its unclear how using an upgrader process gets around that. upstart/init sys
> still wants to track the new process pid.

That's why the upgrader process remains around - it does not
normally upgrade when the agents upgrade, so upstart
can monitor its pid as usual. When an agent crashes, the
upgrader exits, letting our fallback scheme take over
(which will usually just restart the upgrader and its associated
executable, as specified in the upstart config file). That's how
we upgrade the upgrader itself BTW.

>> Another potential (though currently unrealised) advantage of a separate
>>
>> upgrader process is that it might allow us to automatically rewind to a
>> previous
>> version if a new version starts failing for some reason after succeeding
>> initially. That may well be a crackful idea however.
>
> the rewind should be possible till the new agent starts its activities, even
> after its possible even without the upgrader, (ie. i've restarted 10 times
> in 5m, revert to old release), that's dicey though. say you do a schema
> upgrade as part of the upgrade, reverting backwards is a hoser.

Yeah, it's definitely something that might not be feasible in practice.
We've got a direct line to the upgrader though, so we can in theory
tell it what upgrades may or may not be revertable.