Implement system reboot via juju hooks

Tue Aug 12 09:57:02 UTC 2014

On 11 August 2014 21:14, William Reade <william.reade at canonical.com> wrote:
> On Mon, Aug 11, 2014 at 3:00 PM, Stuart Bishop <stuart.bishop at canonical.com>
> wrote:
>>
>> On 11 August 2014 18:20, William Reade <william.reade at canonical.com>
>> wrote:
>>
>> > I'd like to explore your use cases a bit more to see if we can find a
>> > clean
>> > solution to your problems that doesn't go too far down the (2) road that
>> > I'm
>> > nervous about. (The try-again-later mechanism is much smaller and
>> > cleaner
>> > and I think we can accommodate that one pretty easily, fwiw -- but what
>> > are
>> > the other problems you want to solve?)
>>
>> Memory related settings in PostgreSQL will only take effect when the
>> database is bounced. I need to avoid bouncing the primary database:
>>  1) when backups are in progress.
>>  2) when a hot standby unit is being rebuilt from the primary.
>>
>> Being able to have a hook abort and be retried later would let me
>> avoid blocking.
>
> Hmm. The trouble here is that releasing the execution lock would *also* free
> up the machine agent to be rebooted -- the benefits of being able to run
> other hooks while you wait don't feel quite so compelling to me now.

Yes,  I guess a subordinate could request a reboot even if my charm
takes care not too. It is sounding like long running operations will
need to continue to block.

>> A locking service would be useful too for units to signal certain
>> operations (with locks automatically released when the hooks that took
>> them exit). The in-progress update to the Cassandra charm has
>> convoluted logic in its peer relation hooks to do rolling restarts of
>> all the nodes, and I imagine MongoDB, Swift and many others have the
>> same issue to solve.
>
> I see -- to get rolling restarts you'd need to spread an awful lot of
> finicky logic across the peer relation hooks. I'm expecting to address this
> issue by allowing leaders to run actions on their minions. ie as leader, you
> can just run the action and wait for it to succeed or fail before
> continuing, all inside a single hook. Sane/helpful?

It helps a lot with rolling restarts, yes. The leader would be able to
restart nodes (rather than just give them permission), and if
necessary rerun the hook that needed to be aborted.

With the PG charm, I am using a PostgreSQL advisory lock on the
primary database for signaling that a hot standby unit is rebuilding
and the primary shouldn't restart right now. I suspect this can
disappear too with leadership, so isn't a compelling use case for a
general lock service.

-- 
Stuart Bishop <stuart.bishop at canonical.com>