some more indexing thoughts..

Mon Jul 16 14:32:18 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> This isn't formulated enough for me to want to drop it in the
> documentation yet.
> 
> I've been thinking about the data access patterns that are shaping up
> from the use-case repository-level requirements.
> 
> We do a lot of topological related activity.
> 
> Specifically file text reconstruction and graph walking.
> 
> So the questions I have are:
>  - whats the win by sorting file texts topologically (latest is a full
> text, older ones are deltas)

I think for actual file contents, it has a lot of promise.  It means we
probably don't need to read multiple ranges of a file-- all the data to
reconstruct a given version should be nearby.

I would imagine topological sorting could be harmful it the "file texts"
were revisions or inventories, since they may be in unrelated projects.
 A topological sort could cause all revisions for project foo to come
before any revisions of project bar, which would unfairly penalize
access speed for revisions or inventories of bar.

If all subsequent file texts are deltas, there's certainly a win for
generating bundles.  I'm not sure we can abandon the notion of
occasional snapshots, though-- we don't want retrieval of old texts to
be painfully slow.  It would be nice to have a delta *and* and snapshot
in those cases.

>  - whats the win by having a topologically sorted revision graph index
> e.g. make finding a revid a linear scan then have topological data from
> there on in. This would give great locality of reference for 'recent'
> data in any index for determining merge bases and the like.

Our normal usage patterns are heavily-weighted towards recent data.  If
we can get fast results on recent data, that's almost as good as having
fast results on arbitrary data.  And since it's much simpler to
implement than fast access to arbitrary data, it's definitely worth
trying first.

We could decide

> The second pull is clearly slower and that is most certainly index
> lookups. Building a cache of the index and using that will help;
> possibly caching the decoded values in the GraphKnitIndex class may help
> too. I'm not sure where to introducing caching at the moment.

I find that the kcachegrind display is very useful for analysing
performance.  Particularly the callee map, which is a tree map
(http://en.wikipedia.org/wiki/Treemapping).

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGm3Ni0F+nu1YWqI0RAmUFAJ4xCF+v4Cs+AwCNSSh97ZRhkkGotACbBV6e
+mQcxca5TScr6gG383iJLwc=
=kh8H
-----END PGP SIGNATURE-----