RFC: split-inventories design: children-of-node/path-traversal

Wed Oct 29 21:43:15 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> Opening this up for discussion; I don't have a solid answer yet, so this
> is a bit of a brain-dump at this pont.
> 
> Doing path traversal - a/b/c could be done a few ways.
> 
> basename index w/value list:
> here we would have a index on the basename, which isn't unique, so we'd
> need to be able to treat the value as a list, or allow repeated keys, or
> some such. Path traversal would then be a matter of doing set reduction
> - to get b, we could just lookup b, grab all the fileids with a basename
> of b, and walk up from them to the root - one will resolve to a/b/c.

Do you mean "b" here, or do you mean "c"?

> 
> fullpath index:
...
> fullpath index w/hashed keys:
...
> 
> parentid->childid index w/value list:

This seems like the same way we would do it now in the in-memory inventory.

> 
> parentid,child_basename->childid index:
> here we would have an index on (fileid, child_basename). This requires a
> index which supports things like 'foo bar' as keys, or encoding of
> basenames before using it.
> 
> Now, directory objects in memory have a children attribute listing all
> the paths within the directory; call this path discovery or 'ls'.
> 
> So doing ls: ls a/*:
> 
> basename index :- doesn't help

Well, if I understand correctly, means that you would do a basename
lookup to find the file-id corresponding to 'a', and then you would look
by file-id to find the children of a. (I suppose you could start at
root, and do everything by file-id.) I guess it depends on what other
indexes you have available.

> 
> fullpath index :- if we can iterate on keys starting with 'a/' we can
> find a/b and a/b/c, but this would need some logic to avoid finding out
> about c when we are only interested in children of a.

we could always make it a 'dirblock' style index, right? So that all
children of 'a' would be clumped together, and all children of 'a/b'
would be clumped together.

It is a little bit harder to compare keys in dirblock order, but we
pretty much have that worked out, I believe.

But your comment that we would need to support whitespace, etc, in the
index still applies. Though I think btree indexes pretty much only use
'\0' as the key separator, so they are a lot more flexible with what
they would allow.

> 
> fullpath hash-keys :- doesn't help
> 
> parentid-> childid :- works fine, just read everything.
> 
> parentid,child_basename :- works well, if we can get everything with the
> parent's value.

so you would start at root, find its file-id, then look for
(root_id,"a") to find its file-id, and then look for (a_id,*) to find
the rest of the entries?

> 
> I think the parentid,child_basename -> childid approach has legs, and I
> intend to make sure the CHKMap will support it.
> 
> -Rob

I would lean towards fullpath index. Because if you are using btree
index the keys end up compressed anyway, so the duplication is mostly
removed by the time we've writted pages to disk.

Also, there is the same duplication in parent_id,child_basename, except
a "parent_id" is a string that is 60 characters long which is longer
than most of the paths are going to be.

The average path is 30 characters in bzr.dev, and that includes the
basename. The average path length in mysql is 37 characters, also
including basename.

So while I think "parentid,child_basename" is interesting, I think a
fullpath index is much more what we want. Possibly a dirblock sorted
fullpath index, but fullpath nonetheless.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkI2PMACgkQJdeBCYSNAAOUBgCeKzBZLmNbRYtDukHPf2doHKFK
5FgAoLCkxYTJEHHAHTA2eXvfbHjArc9k
=Tdtd
-----END PGP SIGNATURE-----