[RFC] dirstate design update

Martin Pool mbp at canonical.com
Mon Feb 19 03:42:57 GMT 2007


On 19 Feb 2007, Robert Collins <robertc at robertcollins.net> wrote:
> So, dirstate is now provisionally working, but I've realised theres a
> flaw in the current design. Its one that John and I discussed without
> really getting detailed enough last year, and I've now got a good handle
> on it.
> 
> I think we should modify dirstate to be roughly the following for each
> line in the state file: Make the tuple (dirname, basename, fileid) be
> the unique key (rather than (dirname, basename, fileid).

As Andrew said, the previous proposal had just (dirname, basename.)

I found it easier to think of this as a table with the key down the left
and one column for each parent tree.  For a tree with one parent (no
merges):

key                  working            basis
=================    ===============    =========================
src foo.py foo-id    file abcd 120 x    file abcd 120 x "mbp-123"

These cells can be empty if no file exists at that path in that tree.

> Secondly, we should make the data for each tree more consistent,
> probably by factoring out the path for each parent; this will make the
> state file smaller in the common case without reducing our ability to
> bisect for paths in each tree, because, rather than storing the details
> for parent entries under the path in the current tree, we should move
> them to be at the path in their own tree, giving us a per-treeid
> path->id mapping that we can bisect into. To keep the unique key unique,
> we need a record in a tree that can act as a pointer to the alternative
> path. Currently we have the following structure:
> 
> rows = dirname basename KIND fileid_utf8 size packed_stat fingerprint {parent_details} 
> parent_details = revision_utf8 KIND dirname basename size executable fingerprint
> 
> I propose that it become (all fields in utf8):
> row = key current_details {parent_details}
> key = dirname basename fileid
> current_details = common_details working_details
> common_details = minikind fingerprint size executable
> working_details = packed_stat
> parent_details = common_details history_details
> history_details = revisionid

To be clear, the fingerprint is the sha1 of a plain file, or the
contents of a symlink, etc.

> In any tree, a kind of 'moved' indicates that the fingerprint field
> (which we treat as opaque data specific to the 'kind' anyway) has the
> details for the id of this row in that tree.

Since the other fields will be irrelevant maybe we should have a special
packing for moved files.

To expand a bit on handling of moved files:

There can be several rows with the same file id and different paths,
each with a cell for each tree.  If there's no file with that id in a
particular tree, then all of the cells in rows with that id are blank.
(With a kind indicating 'not present'.)

If there is a file with that id, then the row with the corresponding
path and id contains its details.  There may be other rows with that id
and different paths, describing where it is present in other trees.  The
cells in those other rows all contain references to the place where it
is really found.

> I'm strongly tempted to add a id->path index as well, but I think that
> where we need id->path mapping; we also usually read the whole file, so
> I'm going to skip that for the moment, as we have the ability to locate
> via bisect any path in any tree, and if we lookup things by path, we can
> accumulate a id->path mapping as we go, which will tend to match what we
> looked for.
> 
> I plan to implement this asap, so please speak up now to alter/tweak the
> design - and once we stabilise on this, I'll update the wiki page for
> it.
> 
> The rationale for all this is that we want fast operations for the
> common case (diff/status/commit/merge on all files) and extremely fast
> operations for the less common but still occurs a lot status/diff/commit
> on specific files). Operations on specific files involve a scan for all
> the children of a path, *in every involved tree*, which the current
> format did not accommodate. 

-- 
Martin



More information about the bazaar mailing list