some more indexing thoughts..

Robert Collins robertc at robertcollins.net
Sun Jul 15 09:01:01 BST 2007


This isn't formulated enough for me to want to drop it in the
documentation yet.

I've been thinking about the data access patterns that are shaping up
from the use-case repository-level requirements.

We do a lot of topological related activity.

Specifically file text reconstruction and graph walking.

So the questions I have are:
 - whats the win by sorting file texts topologically (latest is a full
text, older ones are deltas)
 - whats the win by having a topologically sorted revision graph index
e.g. make finding a revid a linear scan then have topological data from
there on in. This would give great locality of reference for 'recent'
data in any index for determining merge bases and the like.

I like the idea of starting with a simple but 'good enough' index and
sorting out the other challenges with changing the repo layer
(naturally, or I wouldn't be hacking on this)... but there seems to be
room for some heavy tuning too.

BTW, quick figures on the GraphIndex based revisions.knit performance:

(its still not bisecting)
- initial pull to rev 1000
New index:
real    0m8.413s
user    0m6.880s
sys     0m0.348s

regular:
real    0m6.710s
user    0m4.888s
sys     0m0.364s

- incremental pull to 2000
real    9m42.330s
user    8m46.457s
sys     0m9.157s
...and a crash :|.

I think the slowdown on the initial pull is the overhead of reading the
index after the revisions are inserted to pull file texts out as the
InMemoryGraph should be fast to query.

The second pull is clearly slower and that is most certainly index
lookups. Building a cache of the index and using that will help;
possibly caching the decoded values in the GraphKnitIndex class may help
too. I'm not sure where to introducing caching at the moment. I want to
make it a little faster for testing; without getting distracted from the
repo refactoring needed to make repos support a wider keyspace and thus
split inventories etc etc etc.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070715/076d37a7/attachment.pgp 


More information about the bazaar mailing list