Optimizing tree building

Fri Jun 8 02:56:14 BST 2007

On Thu, 2007-06-07 at 21:13 -0400, Aaron Bentley wrote:

> >> I figure if we do get a container-based repository, we'll still need
> >> indexes, no?  That makes me think it might be more productive to work on
> >> a new way of handling inventories than to optimize the current one.
> > 
> > We'll still need indexes. What of, and how many, and what the keys are
> > are still open questions. Changes to inventories - their content and
> > representation - will also impact what indexing is needed, but I think
> > the indexing layer should be implementable separately. One important
> > thing is whether indexes contain graph data or not. Specifically, in
> > knits the index and the graph are punned. This is good in some respects
> > and bad in others; I hope we can address this during the design of
> > indexing.
> 
> Oops.  I meant "indexs", not "inventories".

:).
Yes we'll need indices for the containers. Some queries that we need
today:
 * find the disk reads required to reconstruct a series of texts[text
including revisions, signatures and inventories]
 * is the text T present?
And punned with the current knit index:
 * Give me the graph data for text T

> In a design like our current one, keeping the size of the indices low is
> vital, so stripping the graph data would be a win in that regard.  It
> would still be nice to be able to quickly get a list of
> build-dependencies, though, so perhaps the primary index could point at
> a secondary index, which would give a list of build-dependencies (and
> their locations), which could then be requested from the text storage
> all at once.

Thats a possible approach. I think the irreducible data inside the core
container is:
 * Names of the text
When delta'd:
 * What it is delta'd against;
 * and the delta used if its not self-identifying.

(In knits this is obtained by punning: the name is the key in the knit;
the delta source is the left most parent; the delta format is fixed)

For the use case 'is the text T present' our goal is a small number of
IO's to determine presence/absence. A simple bisect into a simple
serialised list can deliver log(keys) performance. Fixed size records
may help a little but don't make a significant difference from what I
can tell. They matter if you want to address record X, but not if you
are binary searching based on byte count. (John has written code to do
this for dirstate, and it does not have fixed width records).

For the use case 'find the disk reads required to reconstruct a series
of texts[which can be revisions, signatures and inventories]' our goal
is to figure out what disk reads are needed to reconstruct the texts. If
we wanted to we could consider mixing disk reads from multiple locations
- e.g. read from local disk to point X, then from remote network through
to Y. To do that we probably need to consider the figure-out step as one
which works backwards from the desired text figuring out what IO is
needed. Each time it stops we should know the next text it wants to be
able to compose on top of; then we can switch to a better source. On the
other hand small indexes are good so if we can avoid knowing each
individual step when many are bundled together, that would be good. E.g.
for the new container format you could do this by the index saying
'start at byte N, read and parse until you find what you want'. Then if
there we 32KB of deltas run-together with the text T at the end, the
index could point at the beginning of the run, and an optimistic IO read
would get everything in one hit.

With knits our indexing problem is that we store all the deltas for a
related-over-time object together, so regenerating an index for the knit
is expensive - a single group of data is O(history). If we were to
upload an additional index per revision that might be interesting
though.

e.g. create a text construction index for all the data in a revision -
pulling all the knit offsets from across the repository into one place.
Point at the beginning of the continuous run for any part of a delta
chain. Name the index based on the revision name. Then for a file
changed in any given revision we can possibly avoid reading the knit
index completely. We could build large composed indexes like this too,
and only upload one when we do a push, and when uploading one only
include the revisions we pushed, not the whole repository. E.g. apply
the management techniques we're talking about for container-based
repositories to knit based repos. I think this could be an interesting
excercise, and if done right the code would be immediately relevant for
container based repos too.

Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070608/46aff37d/attachment.pgp