dirstate bisect update

Thu Jul 6 04:30:49 BST 2006

As some of you may have been following, I've been working on a prototype
of possible dirstate format that is meant to replace our current
'inventory' and 'basis-inventory' files.

Well, I've done a little bit more work on them, and I'm pretty please
with where I've gotten. I implemented the bisect search for multiple
paths at once. I've played with a lot of different implementations in
the last couple of days, and I have to say, its pretty damn fast.

By adding support for plugins having benchmarks, I added tests that take
a dirstate file for an actual kernel tree (bzip compressed in the source
dir), and time how long it takes to 1) read the whole file, and 2)
bisect out just a few files.

After watching how bisect was jumping around the file, I ended up doing
more of a page-based reader. So it now reads a 'page' (4k block), and
checks the first and last entries, and then anything that should be on
that page gets compared. And then does a jump from there.

Anyway, when all is said and done, I can extract 40 records (2
directories) out of a 21k dirstate file (with 2 parent entries), in 6ms.
(reading in the complete file takes 325ms). It's hard to get it much
faster, especially with the benchmark resolution being 1ms.

I've gone back and forth whether the file should contain split paths, or
relpaths, and I've finally decided on split paths. It takes slightly
longer to process, but I also tested join versus split speed, and a
custom join function can take ~1us, while os.path.split takes 3-4us.
(os.path.join is also pretty slow at 3-4us).

With 20k records, even if it costs us 10ms to read, we save 20k*2us =
40ms in split time later on. And really it is 20k*2*2us=80ms, because of
having a parent relpath in there as well.
(In the bisect testing, the number of join/split calls is small enough
to barely register).

So with the bisect stuff in place, it should be possible to do partial
diff/status operations extremely fast. At least dirstate should not be a
bottleneck.
We still have to figure out how we want to do all of this for the basis
inventories versus the current inventory. The biggest problem is that
when you rename or delete a file, where do you put the records?
I'm tempted to just have multiple files, I know there is some overhead
if you have to seek between two files, which is why we would like to
have everything combined in a single file.

The only other thing to work out is being able to run a generator on
these files, so you can start doing something after you've read a little
bit, rather than waiting until everything is finished.

Right now the whole file reader parses everything up front, and then
returns a big list. Conceptually it could be turned into a generator for
the chunks. The only tricky point is that the parent ids are also in the
file, along with all the other data. So they might have to be the first
things returned by the generator. And it could return the number of
entries, since it knows that before it parses the rest of the file.

The current bisect reader keeps a dictionary, because I didn't want to
force the api to require the input list to be in directory sorted order,
and that is how we have to bisect through the file.
If we change the api, then it could definitely return each record as it
is found.

John
=:->

PS> This is the current benchmark results:
running tests...
...ate.BenchDirstate.test_bisect_kernel_1_parent   OK     5ms/  548ms
...nchDirstate.test_bisect_kernel_1_parent_dense   OK     2ms/  773ms
...nchDirstate.test_bisect_kernel_2_parent_dense   OK     6ms/ 1228ms
...enchDirstate.test_bisect_simple_0_parent_utf8   OK     0ms/    6ms
...enchDirstate.test_bisect_simple_1_parent_utf8   OK     0ms/    5ms
...enchDirstate.test_bisect_simple_2_parent_utf8   OK     0ms/    6ms
...state.BenchDirstate.test_read_kernel_0_parent   OK   107ms/  431ms
...state.BenchDirstate.test_read_kernel_1_parent   OK   222ms/  747ms
...state.BenchDirstate.test_read_kernel_2_parent   OK   356ms/ 1074ms
...state.BenchDirstate.test_read_kernel_3_parent   OK   434ms/ 1160ms
...nchDirstate.test_read_simple_1_parent_unicode   OK     0ms/    6ms
....BenchDirstate.test_read_simple_1_parent_utf8   OK     0ms/    5ms

PPS> These are my timing tests for join vs split:

python -m timeit -s "
def myjoin(base, path):
  if base:
    return base + '/' + path
  else:
    return path

x = ''
" "myjoin(x, 'foo')"
1000000 loops, best of 3: 0.671 usec per loop

python -m timeit -s "
def myjoin(base, path):
  if base:
    return base + '/' + path
  else:
    return path

x = 'bar'
" "myjoin(x, 'foo')"
1000000 loops, best of 3: 1.06 usec per loop

$ python -m timeit -s 'import os' "os.path.join('', 'foo')"
100000 loops, best of 3: 2.46 usec per loop
$ python -m timeit -s 'import os' "os.path.join('bar', 'foo')"
100000 loops, best of 3: 3.65 usec per loop

$ python -m timeit -s 'import os' "os.path.split('foo/bar')"
100000 loops, best of 3: 3.84 usec per loop
$ python -m timeit -s 'import os' "os.path.split('foo/bar/baz')"
100000 loops, best of 3: 4.09 usec per loop

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060705/15e46279/attachment.pgp