Dirstate bisecting

Fri Feb 23 23:48:08 GMT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

One of the feature of the dirstate format is that it is designed such
that we can efficiently bisect into the file to find a set of records,
without having to read the entire file.

So I have resurrected my old multi-way bisect and updated it for the
case that the dirstate can actually have repeated (dir, name) pairs
(since (dir, name, file_id) is the unique key).

So I thought I would give some timings on the result. It basically boils
down into how many jumps need to be done to find the files that you care
about. The actual testing looks something like:

time PYTHONPATH=/home/jameinel/dev/bzr/experimental/dirstate \
  python -m timeit -s "import bzrlib.workingtree" \
  -s "tree=bzrlib.workingtree.WorkingTree.open('HEAD.dirstate')"
  "tree.lock_read(); tree._dirstate._bisect([('dir', 'name')]); \
   tree.unlock()"

To find the root entry (which is one of the more expensive entries
because bisect is always jumping to the middle) takes 2.5ms

To find all of the entries for a the path
netwerk/streamconv/src/Makefile.in (5 entries including root) takes 5.03ms.

To find all of the top-level paths in the Mozilla tree (82 files and
directories) takes 5ms (finding all entries in a single dir should be
fast because they are all grouped together, so it is really just a
bisect to find that block).

To find all files underneath 'netwerk' (699) takes 23ms.

Unfortunately it is difficult to do a bisect for all files in the
mozilla tree, because the lengths of all of the paths exceeds the length
of a command line. But doing something like "find . ! -path './.bzr*' |
xargs" is pretty consistent that extracting 2000 files takes around 100ms.

Now, there is still quite a bit of tuning that could be done. But to
compare, it takes 530ms to read all of the blocks in the dirstate file.
In the same amount of time, I can bisect for approximately 10,000 files.
So the break-even point is around 1/5th.

But if we can implement this in a reasonable way, it means that we can
cut the time for "bzr diff foo" to a tiny fraction of the old cost.
Since we can use bisect to quickly pull up the path => file_id,
cur_info, old_info mapping. There are other things we need to work out,
like getting rid of the hash-cache, since right now reading it takes
approx 1.5s.

I would like to work on a version of bisect that acts like 'paths2ids()'
so that it can recurse when it finds renamed records and directories,
returning everything that is contained underneath.

Robert has also mentioned implementing it with a different structure,
and binding it more completely into the rest of dirstate. Basically, we
split up the dirstate file into pages which have or have not been read
yet, and then read and parse what we need to. It is an interesting way
of keeping track of what you have already read versus what you still
need to read.

Another possibility is to read the dirstate file from the top down,
reading the top dirblock, and then seeking for the directory dirblock,
on down until you get as deep as you need. By reading in this way, you
have an idea of where the next jump should be. But I think you would end
up doing a lot more reading and parsing rather than the current
bisecting code.

Anyway, just some small dirstate updates.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFF3304JdeBCYSNAAMRAg8eAKC0ydO8eHorFExXz8qMdd28kzAkPgCeNk+1
acn2NBQc8Ssd3YAFrUwswOo=
=STjA
-----END PGP SIGNATURE-----