index/dirstate bisection notes

John Arbash Meinel john at arbash-meinel.com
Sun Oct 7 16:01:40 BST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> I've been putting bisection into the toy GraphIndex logic.
> 
> I wrote a generic bisect_multi_bytes routine, that may be usable for
> both GraphIndex and dirstate. One thing I realised after doing that
> interface is that its a bit limiting - after bisection for one key, we
> often have enough info to bisect a much smaller region of the object, if
> we use the knowledge from the prior bisection to seed the upper and
> lower bounds for that key. But its trickier for generic driver to do
> this, so I'm probably going to send up a slightly less efficient initial
> patch than is possible - but the good news is that the iniefficiency
> will be purely in memory, not externally.
> 
> -Rob
> 
> 

Having written the initial dirstate one, I would certainly agree. I would
consider writing it as a Bisector class, that could be preserved between calls.

Assuming you are working with a page-based version (which is what I did,
because it had the best local performance), you could just cache the first and
last entry per-page => offset in file.

I believe the functions you need to make it "generic" is:

1) A function to take a 'page' and turn it into 'possible records'. For
dirstate this was simply "page.split('\n')". It may be the same for you index code.

2) A function to take 'possible records' and inspect them enough to get the key
out. The dirstate code actually did "rows = record.split('\0')" because it new
that all fields were null separated. However, it only needed the first 2 'rows'
to be able to answer:

3) A comparison function to see whether the searched for key matches, comes
before, or comes after the extracted key.

4) A 'row to object' function. Something that takes the parsed stuff, and turns
it into your final representation.

Some interesting observations...

The dirstate bisect code is actually written to search for prefixes of the
correct keys. This is because we are bisecting based on path, but the key is
actually ((dir-, -path), name, file_id)

I was actually wanting to write a directory recursive version of bisect_multi,
which when it found a directory, would find all the children of that entry. I
have one, but it is inefficient because it "forgets" everything between each
search. However, with the page-caching it would probably be just as good as a
fully customized loop.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHCPTUJdeBCYSNAAMRAiSTAJ9PwLoNR2Ws5cC41cX/qRV/yE9a5wCgxJRU
pIi8fT+mA3nYE4kTwf1NxGU=
=mWAR
-----END PGP SIGNATURE-----



More information about the bazaar mailing list