[RFC] Strawman replacement local directory state

Fri Jun 16 11:40:35 BST 2006

On Wed, 2006-06-14 at 11:15 +1000, Michael Ellerman wrote:
> On Tue, 2006-06-13 at 11:45 -0500, John Arbash Meinel wrote:
> > Does anything scale well for a 2 million file repository? I suppose git
> > uses a per-directory approach. Though you must have a hell of a lot of
> > churn at the top level directory entry. (Since each dir changes when the
> > dirs below it change)
> 
> That's not the point though. A data structure like this makes it
> impossible/hard to speedup partial-tree operations.

Not at all. Its structured specifically to allow random access (with the
dual delimiters in place, this is now ready for use)

> If you want to do "status" on 2 million files then it's going to take a
> while. I think that's something users accept, because the operation the
> user is thinking of "status of 2 million files" is =~ what bzr is doing.
> 
> The problem is when a user runs "status just-one-file". They're thinking
> "status of 1 file should be fast", but with a data structure like this
> bzr is doing a similar amount of work to the full status. I think that
> has the potential to annoy people.

The optimal work to do status on one file with this format is:
 * read the dirstate header to determine the number of parents.
 * bisect search into the file to find the record for the file, or its
absence.
 * if we did not get the asked for file, report it as deleted.
 * parse the record if we found one.
 * stat the file on disk
 * calculate the sha1 or symlink target if it exists and the stat cache
failed to match.
 * print the output.

Thats log n seeks into the file, and parsing of (say) 300 bytes at each
seek to find the record we have landed on. Call it a read of 1K at each
seek point for nicety. Its even possible to consider a C extension for
this that operates on a mmapped file with the delimiters for data
structured to make the strings usable in situ.

That said, we'll probably start with read the whole file, split on '\0',
then bisect search to find the start point, and parse from there on.

we'll also want a multi-bisect to search for N supplied paths
concurrently.

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060616/cb3d4b07/attachment.pgp