Using subdocument _id fields for multi-environment support

Wed Oct 1 12:31:34 UTC 2014

it feels a little strange to use a mutable object for an immutable field.
that said it does seem functional. although the immutability speaks to the
first disadvantage noted for the separate fields namely becoming out of
sync, which afaics isn't something that's possible with the current model,
ie. a change of name needs to generate a new doc. Names (previous _id) are
unique in usage minus the extant bug that unit ids are reused. even without
that the benefits to avoiding the duplicate doc data and manual parse on
every _id seem like clear wins for subdoc _ids. Although i'm curious what
effect this data struct has on mongo resource reqs at scale vs the compound
string, as mongo tries keeps _id sets in mem, when it doesn't fit in mem,
perf becomes unpredictable (aka bad) as there's two io per doc fetch (id,
and doc) and extra io on insert to verify uniqueness.

cheers,

Kapil

On Wed, Oct 1, 2014 at 12:25 AM, Menno Smits <menno.smits at canonical.com>
wrote:

> Team Onyx has been busy preparing for multi-environment state server
> support. One piece of this is updating almost all of Juju's collections to
> include the environment UUID in document identifiers so that data for
> multiple environments can co-exist in the same collection even when they
> otherwise have same identifier (machine id, service name, unit name etc).
>
> Based on discussions on juju-dev a while back[1] we have started this
> doing this by prepending the environment UUID to the _id field and adding
> extra fields which provide the environment UUID and old _id value
> separately for easier querying and handling.
>
> So far, services and units have been migrated. Where previously a service
> document looked like this:
>
>     type serviceDoc struct {
>          Name          string `bson:"_id"`
>          Series        string
>          ...
>
> it nows looks like this:
>
>     type serviceDoc struct {
>          DocID         string `bson:"_id"`       // "<env
> uuid>:wordpress/0"
>          Name          string `bson:"name"`      // "wordpress/0"
>          EnvUUID       string `bson:"env-uuid"`  // "<env uuid>"
>          Series        string
>          ...
>
> Unit documents have undergone a similar transformation.
>
> This approach works but has a few downsides:
>
>    - it's possible for the local id ("Name" in this case) and EnvUUID
>    fields to become out of sync with the corresponding values the make up the
>    _id. If that ever happens very bad things could occur.
>    - it somewhat unnecessarily increases the document size, requiring
>    that we effectively store some values twice
>    - it requires slightly awkward transformations between UUID prefixed
>    and unprefixed IDs throughout the code
>
> MongoDB allows the _id field to be a subdocument so Tim asked me to
> experiment with this to see if it might be a cleaner way to approach the
> multi-environment conversion before we update any more collections. The
> code for these experiments can be found here:
> https://gist.github.com/mjs/2959bb3e90a8d4e7db50 (I've included the
> output as a comment on the gist).
>
> What I've found suggests that using a subdocument for the _id is a better
> way forward. This approach means that each field value is only stored once
> so there's no chance of the document key being out of sync with other
> fields and there's no unnecessary redundancy in the amount of data being
> stored. The fields in the _id subdocument are easy to access individually
> and can be queried separately if required. It is also possible to create
> indexes on specific fields in the _id subdocument if necessary for
> performance reasons.
>
> Using this approach, a service document would end up looking something
> like this:
>
>     type serviceDoc struct {
>          ID            serviceId `bson:"_id"`
>          Series        string
>          ...
>     }
>
>     type serviceId struct {
>   EnvUUID string `bson:"env-uuid"`
>   Name    string
>     }
>
> There was some concern in the original email thread about whether
> subdocument style _id fields would work with sharding. My research and
> experiments suggest that there is no issue here. There are a few types of
> indexes that can't be used with sharding, primarily "multikey" indexes, but
> I can't see us using these for _id values. A multikey index is used by
> MongoDB when a field used as part of an index is an array - it's highly
> unlikely that we're going to use arrays in _id fields.
>
> Hashed indexes are a good basis for well-balanced shards according to the
> MongoDB docs so I wanted to be sure that it's OK to create a hashed index
> for subdocument style fields. It turns out there's no issue here (see
> TestHashedIndex in the gist).
>
> Using subdocuments for _id fields is not going to prevent us from using
> MongoDB's sharding features in the future if we need to.
>
> Apart from having to rework the changes already made to the services and
> units collections[2], I don't see any downsides to this approach. Can
> anyone think of something I might be overlooking?
>
> - Menno
>
>
> [1] - subject was "RFC: mongo "_id" fields in the multi-environment juju
> server world"
>
> [2] - this work will have to be done before 1.21 has a stable release
> because the units and services changes have already landed.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20141001/ff81714a/attachment.html>