dirstate _bisect_recursive and paths2ids

Sun Feb 25 16:25:28 GMT 2007

Over the weekend I finished implementing a recursive bisect function,
which is capable of finding all records for a set of paths (just like
paths2ids). It grabs all records underneath directories, and returns
both sides of a rename.

I'm not convinced that the api is perfect, because the data requires a
bit of massaging to work with the other apis. And while it returns the
entries it reads, the apis would really expect the DirState object
itself to hang on to the records. So I need to come up with a way to
track what records have been read, and where they have been found, so
that we can use them in future calls. It will also make the bisect
recursive function a little faster since it won't bisect the same rows
over and over again.

However, it is potentially a great win. Just as a point of comparison, I
can use it to find all entries underneath the 'netwerk' directory, which
is 699 records.

time PYTHONPATH=$HOME/dev/bzr/dirstate python -m timeit \
  -s "import bzrlib.workingtree"
  -s wt = bzrlib.workingtree.WorkingTree.open('HEAD.dirstate');" \
   "bt = wt.basis_tree(); wt.lock_read(); \
    wt.paths2ids(['netwerk'], [bt]), wt.unlock()"

'netwerk'
paths2ids: 560 msec per loop
paths2ids_using_bisect: 34.8 msec per loop

The big savings is because we don't have to parse all 50,000 records
just to get the 699 records.

Another comparison is to look for a single file "netwerk/Makefile.in"
which should return only a single record.

'netwerk/Makefile.in'
paths2ids: 560 msec per loop
paths2ids_using_bisect: 5 msec per loop

(A 100x increase in performance :)

Now, there are cases where performance is probably worse. If the data
has already been read into memory, or if you are looking for all
records. ('bzr status', versus 'bzr status foo').

'' (all files)
paths2ids: 922 msec per loop
paths2ids_using_bisect: 2.5 sec per loop

(A 2x loss)

My thought is to counter this with something like:

        if (state._dirblock_state == dirstate.DirState.NOT_IN_MEMORY
            and '' not in paths):
            paths2ids = self._paths2ids_using_bisect
        else:
            paths2ids = self._paths2ids_in_memory

Where _paths2ids_in_memory is the current implementation. It is just an
imperfect heuristic, but I think it is a decent place to start.

So if we have already loaded the data into memory, here are the
performance comparisons

'' (all files)
paths2ids: 320 msec per loop
paths2ids_using_bisect: 2.64 sec per loop

'netwerk' (699 files)
paths2ids: 3.79 msec per loop
paths2ids_using_bisect: 35.8 msec per loop

'netwerk/Makefile.in' (1 file)
paths2ids: 178 usec per loop
paths2ids_using_bisect: 1.71 msec per loop

So while _bisect_recursive won't help yet (because everything else will
want to load the whole dirstate first), it has the potential to save us
a *lot* of time when handling small subsets of large trees.

John
=:->

PS> I'm trying hard to stop working on bisect and just get DirState
passing all tests, so I'm not planning on working on bisect for a while.
But I thought I would get it to a basic functionality, which also gives
us an idea of the performance we could get out of it.