watching related units

Sat Jun 9 17:42:53 UTC 2012

Hi all

I have a bit of a problem, and I'd appreciate advice (and preloading
context into your brains so the MPs don't look too crackful when they
land).

Context:

* Unit relations have 2 important nodes: settings, and presence.

* When a unit joins a relation, it needs to keep track of both presence
and settings for every related unit.

* Changes to related presence and settings nodes translate into hook
executions as follows:

  * When a related unit becomes present (ie its agent knows it's in the
relation and fires up a pinger on the presence node), the watching unit
must run relation-joined and relation-changed hooks.

  * When the settings of a present related unit change, the watching
unit must run the relation-changed hook.

  * When a related unit becomes absent, the watching unit must run the
relation-departed hook.

  * When an absent related unit's settings change, nothing should
happen.

* While it is not *incorrect* to rerun hooks -- hooks are meant to be
idempotent -- I'd prefer an implementation that was somewhat meticulous
about avoiding wanton rerunning of redundant hooks.

* Processes can die, including unit agent processes, and we need to be
able to gracefully handle sudden process death by persisting (1) our
best guess at the "current" unit-local state of the relation and (2) the
"future" unit-local changes we've detected and queued but not yet
executed; and (3) being able to reconcile that persisted state with the
actual new state when a unit agent process comes up after an arbitrary
length of time.

Beguiling notion:

Given the use of presence nodes rather than ephemeral nodes, we're no
longer bound by the hassles of the python implementation [0]: rather
than converting arbitrary sets of added/removed children into individual
joined/departed events, the raw data we get back from the various
watchers is near-enough trivial to convert directly into
join/depart/change events.

Complication:

To do this, I'd need to start a RelatedUnitsWatcher and pass in the list
of currently known related units, and do some icky hackery in there to
figure out what events to send to do the reconciliation. Specifically,
without something like this, I can't see a reliable way to detect a unit
that was entirely deleted during process downtime. But this seems like
the wrong place for this sort of logic, and would make the API icky, and
I'd just generally prefer not to do so.

Alternative:

Make the RelatedUnitsWatcher send events much more like the existing
python ones: that is, "the current list of active relations is X" and
"the settings version of unit Y is Z". (The existing python hook
scheduler doesn't deal with *exactly* this input, but the description is
close enough for this discussion I think.)

But I'm a touch reluctant to do this, because it feels frankly perverse
to turn nice clear "unit X arrived" and "unit Y went away" messages into
"the current units are A, B, C" messages only to have the client unpack
those into almost exactly the original form. The more I think about it,
the more this approach seems correct [1]... but I'd greatly appreciate
thoughts, guidance, mockery, whatever springs to mind. Anyone?

Cheers
William

[0] not that there aren't other ways we could have done it in the first
place... I can think of at least one that looks more like what I just
suggested, but it suffers from exactly the same problems and would (I
think) still lead to a stream-becomes-states-becomes-stream solution,
for the same reasons.

[1] it is currently more complex that I'd prefer (ofc, I expect there's
some fat that could be trimmed) but it does naturally lead to a clean
separation between the "generate a stream of events" bit at the source
and the "persistence" bit in the middle, which IMO should really not be
mixed; the fact that it essentially turns back into a stream of very
similar events at the end is, I think, a red herring.