[plugin] dirstate experiments

Wed Jun 14 07:38:43 BST 2006

After reading Robert's discussion of possible dirstate requirements, I
went ahead and implemented it in a few possible formats.

The plugin is available here:
http://bzr.arbash-meinel.com/plugins/dirstate-tests/

Basically it creates a few new commands:

  build-sql-dirstate
  read-sql-dirstate

  build-flat-dirstate
  read-flat-dirstate

  build-blob-dirstate
  read-blob-dirstate

The build* commands create a dirstate from the contents of current
working directory. And they take an argument as to where to store the
result.
The read* commands time how long it takes to read the dirstate in
directory order (everything in a directory, then first child, etc).

I basically run this by grabbing a Kernel tree (19k files, 1201 dirs).
And then running:

\rm ../dirstate.sql; bzr build-sql-dirstate ../dirstate.sql
bzr read-sql-dirstate --check ../dirstate.sql

etc for each format type. I guess only the sql one really needs the file
wiped, since the others should wipe automatically.

These are the times that I get:

	build	read
sql	4.5s	500-800ms
flat	1.3s	290ms
blob	1.1s	350ms

These are not fully optimized with a fine tooth comb, but I thought
someone else might be interested in playing with them.

My 'flat' format is a newline delimited format, with one record (\0
separated) per line.

I chose that format because it permits parsing without doing any if
conditionals to figure out where you are in the list.

Now, these create dirstate files that are much larger that hg's
equivalent. Mostly because they store the full fingerprint, hash, and
file_id, rather than just storing the minimal st_mtime, st_size.

Also, it is important that the blob format is 2/3rds the size of the
flat or sql formats.

I'm pretty happy with the results. I think we can improve things.

As an approximate point of comparison, hg's dirstate.read() seems to
take about 139ms for the kernel tree that I'm processing. And there are
a couple of commits in 'crew' which claim to extract a little bit more
performance. Though I can say I see a big glaring:
if '\0' in f:
  f,c = f.split('\0')

And that '\0' in f should be pretty expensive. Since it requires
iterating over every character in f. (Or even worse, building up a dict
to do that).

I would think if they just always split, and then handled whether there
were 2 or 1 items returned.

These formats could also be tweaked if we want to decrease the number of
'.decode()' calls. I think my 'flat' format would work just fine with a
single .decode() call. However, that would have to decode all of the
numbers, etc. so it might be an overall wash.

I think being around 250ms is pretty darn good, considering we are
handling unicode, and file ids, and full stat information. (Though I'm
not converting everything into integers just yet). So the actual cost
may be higher.

Another point of comparison:
flat bzr.dev with 702 entries is created in 60ms, and read in 8ms.
flat kernel with 20755 entries is created in 1300ms, read in 245ms.
blob bzr.dev with 702 entries is created in 45ms, and read in 13ms.
blob kernel with 20755 entries is created in 1180ms, read in 350ms.

According to these numbers, we get slightly faster when processing more
records. 80us/entry writing, 11us reading versus 62us and 11.8us

People are welcome to follow suite, or tweak the hell out of this code,
or these formats.

I don't have all the pieces that were in the spec. Specifically, I'm
missing the base entries. (Path in base, etc).
Also, the write time includes the time to generate all new ids.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060614/8015b233/attachment.pgp