MongoDB Races, Locks, and Transactions

Tue Jul 3 04:13:19 UTC 2012

On Wed, Jun 27, 2012 at 10:41 PM, Gustavo Niemeyer
<gutavo.niemeyer at canonical.com> wrote:
> This is an issue indeed. I have a slightly different idea, but will
> polish it a bit further before posting.

So, after pondering on this for a good while, I have a proposal which
hopefully is not only simple, but it also addresses a problem we've
been postponing for a while: proper termination of resources.

I apologize in advance, though, for the length of this email. Rest
assured it doesn't reflect the complexity of the solution.

The issue of termination is that when we have, for example, a unit
that is running on a machine, it's pretty bad to simply say
RemoveUnit(u) and take down all the data that the unit had associated
with it. The unit won't be able to magically stop working instantly
and clean up after itself in all possible relationships it has.
Instead, it should shut down politely, doing whatever it needs to do
for cleaning up after itself.

I've been postponing getting into this for a while, but turns out that
if we change the termination mechanism to solve this issue now, it
turns out that such associations "at the edge" are a lot simpler to be
made correctly too, so it may well be a double-win.

In more concrete terms, this is the proposal: we add a Lifecycle field
to all types and respective documents for which we care to track
existence in such form (services, units, machines, etc). This field
may have three values: Alive, Dying, or Dead (1, 2 or 3 in the
database). Every entity gets in the database as Alive. When its
termination is requested via Remove, its Lifecycle is changed to
Dying. This is a hint to the agent representing the entity itself
(e.g. the unit agent) that it's supposed to shut itself down ASAP.
Right before exiting, the agent informs the system it has ack'd its
death by setting its own Lifecycle to Dead. At this point, whoever is
responsible for the resources used by such entity (e.g. if its a unit,
the machine agent) can clean up the resources, and finish the job of
terminating the use of those resources (e.g. the machine agent removes
the container) and finally garbage collects the database document
itself, closing its lifecycle. At some point, we can detect abnormal
situations and have the resource manager (e.g. machine agent) killing
the managed resource after a grace period.

So, in practice, this is pretty straightforward to implement right
now, because we don't have to implement the full scheme at once. We
can start by having just equivalent functionality to what we have in
Python and in Go with the current state package. That is:

1) Introduce Lifecycle in all docs (not visible outside yet)
2) Change insertion points so the doc goes in with Lifecycle == Alive
3) Change query points so they don't return docs with Lifecycle != Alive
4) Change removal points so they set Lifecycle to Dying rather than removing

That should be it. That should gives us at least equivalence for the moment.

But! Now we can get back to the original point of this thread (hint:
good time for your preferred hot drink).

To recapitulate after such a long run, the original issue was that we
had the following procedures happening concurrently:

A) Remove service s
B) Add unit s/N

So, with the proposed world view, to remove service s we:

A1) Set s.Lifecycle to Dying.
A2) Set all units of s to Dying.

At the same time, to add unit s/N we:

B1) If s.Lifecycle is not Alive, error with s not found
B2) Insert s/N into units

So, your perceptive eyes will certainly have noticed that we have not
one, but two problems above, in case A2 or B2 never happen. That said,
they are in fact the same problem: one or more units of a Dying
service are still Alive. This is a trivial problem to solve with a
corrective agent that goes around looking for inconsistencies and
takes action on them, and is several orders of magnitude simpler to
get right than any kind of transactioning mechanism that has to take
account concurrency across several agents. We can easily have multiple
of those corrective agents running concurrently, if we police
ourselves to only do corrective actions idempotently, which the
described problem really is by nature.

So, there we go. In few words, rather than implementing transactions,
we take corrective actions.

Is anyone still with me? :-)

gustavo @ http://niemeyer.net