RFC: scaling-inventory rework current thoughts

Tue Aug 12 00:45:23 BST 2008

On Mon, 2008-08-11 at 17:38 -0500, John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1

> > Going back to basics then, for serialisation scaling, we want something
> > that:
> >  1 allows commit to write less data than the full tree most of the time
> >  2 allows the data to be written to be calculated without processing
> >    size(tree) information [this lets importers and named-file commits
> >    be faster, and allows a nice coupling between workingtree (all-data
> >    by nature) and repositories.
> >  3 generates the same exact representation for a given inventory
> >    regardless of the amount of history available
> 
> I think there is some flexibility here between the logical form, and the
> stored form on disk. For example, if we have deltas be optional in the
> stream. So converters are welcome to generate deltas directly, but
> inserting full versions in their place would be acceptable as well. (As
> I understand your "journaling" system, this was not the case, as the
> delta text was the canonical form.)

I agree that the storage subsystem might do a binary delta to store
things; but actually I think we want to control this: lets say I have a
100K path tree split into 5K elements - so 20 separate serialised
components, if the storage layer then does deltas which add latency like
knits do - 100 elements on average, thats 200 IO's to read in the
inventory, rather than 20 which the higher level code will be
anticipating.

So we should analyse the full stack and be clear about the impact of
such things. Additionally, if the storage layer *needs* to do binary
deltas to have efficient storage, then we undo the goal of doing less
work to some degree. So it needs to be considered holistically.

> >  4 allows a delta to be generated *from the serialised form* with work 
> >    independent of the actual size of the tree - and ideally ~= the
> >    actual changes involved. (The emphasis on the serialised form is
> >    simply that upcasting to an in memory structure requiring the entire
> >    inventory data to be read would defeat the point, even if the in
> >    memory logic was very cheap).
> 
> Are you requesting compositing here? (Given 4 deltas, combine them into
> an overall delta).

No. Given two serialised inventories, get a delta.

> Perhaps a better way of phrasing this, would be that you can map between
>  logical/serialized and fulltext/delta without going to the other.

I haven't suggested in this document that we use deltas per se for
storage at all - rather than we need to take deltas in and output
deltas, or something similar anyhow. (Because we don't want to work in
size-of-tree datasets).

> LF ↔ LD
>  ↕   ↕
> SF ↔ SD
> 
> So if you had 2 serialized fulltexts, you could generate a serialized
> delta without converting to logical fulltext and then logical deltas,
> and then back down to serialized deltas.

This implies we have serialized deltas, I don't think that we can do
that sanely given my constraint 3.

> >  5 lets bzr determine what we need to pull without having to probe for
> >    size-of-tree index keys to determine if we have the referenced data
> >    locally. (This is something that the current format allows).
> >  6 lets bzr map paths to fileids without reading the entire serialised
> >    form (this is used by e.g. diff -r X PATH)
> >  7 lets bzr map fileids to paths without reading the entire serialised 
> >    form (this is used by e.g. loggerhead, bzr search, and anything
> >    doing UI output of historical partial trees.)
> > 

> > Note that currently inventories can be split amongst up to 200 text
> > deltas regardless of tree size so while we can't ignore the impact of
> > needing to do more separate IO's to deal with the full inventory of
> > large trees its not a forgone conclusion that its a problem.
> 
> ^- I don't think I follow this statement. It sounds like you have a
> design in mind already, but haven't actually stated what it is.

I have thoughts toward a design, my mail contained all the ones I am
sure of. The statement you can't follow can be rephrased as:
"its probably ok to need to do multiple reads to get a full inventory,
because we do up to 200 reads today to get an inventory"

> > 
> > Aarons concept of using the dirstate fast-diff facilities to get a delta
> > from the working tree to the basis tree, and then using purely
> > historical diffs to work from there to other trees is a very useful one
> > for increasing performance, because it allows us to have two different
> > storage systems - one optimised for cheap low-latency high bandwidth IO,
> > and one optimised for transmission and network use.
> 
> This is patch composition, correct? (Being able to determine the delta
> from A=>C by going along A=>B=>C.)

Yes and No. Its similar but its not linear the way patch composition is.

Say we have three trees:
W: working
B: Basis
O: Any other tree

One common operation is W.iter_changes(O).

We have fast iter_changes from W<->B thanks to dirstate.
And the design goals above give us fast iter_changes from B<->O if they
are both in the same serialised form.
So we can get fast W<->O by bouncing through B. 

That is, its always 2 steps long, never 500 because ancestry is not part
of the logic.

> > Accordingly, I think the most important thing to discuss is how strictly
> > we want to satisfy 3; and if we go for 'very strictly' what we should do
> > with the derived data (which amongst other things is used to generate
> > per-file graphs to do [relatively] cheap 'log FILENAME'.
> 
> Well, we currently have it serve double duty. One is to provide a
> per-file graph, which is useful for annotate, weave + lca merge (both do
> per-file merges, and I might make --merge3 do it as well), log, and a
> few other places.
> 
> The other is just a handle to access the text (file_id, revision_id) is
> our "text_key".

We can't assume that a (fileid, revid) key exists a-priori, we could
easily change to a different key system for text keys and have that work
- we already should always be redirecting through the inventory for the
revision we're examining anyhow - except when we're doing per-file graph
traversal. And for that we'd need to push the actual reference into the
tree data, if we wanted to change the text key definition. (I don't).

> > In my view it may be desirable to split the derived data out, but its
> > definitely more work and we can achieve a looser form of 3 above without
> > making the issues in that area that we have today any *worse*, so it is
> > definitely a choice rather than a requirement. As its a choice I think
> > we should not couple it to fixing the scaling bugs we have in the
> > commands mentioned above.
> > 
> 
> This would be nice, and I think is a "correct" way to go. I'm not sure
> about before or after getting the other work done. So I agree it
> shouldn't be strictly coupled, but *could* be an opportune time to make
> changes that require "expensive" format upgrades.

I think its massively harder to do the decoupling, because we'd have to
solve three or four additional deep performance problems - more scatter
gather, updatable local caches, testament rework, maybe more.

> > One possible approach to storing-and-serialisation is to write some very
> > low level code that can store many texts with common regions with
> > partial access to a given text, in an efficient manner. I think this
> > would be hard to do and that there are probably easier ways. Another
> > approach is to split the overall document up as part of the
> > serialisation step, and then store all the fragments - this is
> > conceptually very similar to our current storage facilities and as such
> > likely easy to accomplish. The key thing I can see being of issue there
> > is how to ensure that the names of the fragments honours condition 3
> > above - because the names are most likely embedded as references.
> > (Alternate: write something to store many copies of an identical text
> > with different names efficiently). Roughly speaking, the more we can
> > push complexity about this up towards the application, the simpler the
> > storage layer can be - and up to a point this is always good :). (The
> > point that its not is when you make other things unreasonably complex by
> > doing this).
> 
> I think the combination of 2 & 3 make it difficult to do a radix tree.

I'm quite certain we could update a radix tree given an inventory delta
to create a new radix tree (satisfies 2). And I'm quite certain we could
make packing of internal nodes into named byte sequences that we can
point to deterministic, both on initial creation and update (satisfies
3). I don't know that a radix tree is the right structure, but certainly
I don't see any reason why we couldn't use one.

> > A useful thing to note is that while we need access to parts of an
> > inventory, we don't need that keyed by revision id - that is, we can be
> > sure that we'll start at a revision object, which has the property of
> > having a normal revision id; we can follow from the into whatever
> > storage/serialisation is used for inventories.
> 
> "follow from there" ?

Yes.

> > In terms of generating fragments for storage and naming them - I'm
> > inclined to propose using CHK - content hash keys - to name them. For
> > instance, we might call fragments ('sha1:123456789....',). I suggest
> > this because such keys are dependent purely on the inventory content
> > rather than how we arrived at it - so its part of the answer for goal 3.
> > Note that this is not the thin edge of the wedge to be same as GIT - it
> > is not changing the naming scheme for revisions, simply some of the
> > internals for how inventories are stored.
> 
> Note also that splitting into fragments makes things like file_id <=>
> path mapping a bit harder. For example, if split up based on path you
> could pass a hint about where the file is located in *this* inventory as
> a likely place that you would want to look for it in *that* inventory.

I don't think we can satisfy the 'does less than size-tree work' without
splitting the document into smaller pieces - fragments. So it might make
things a little harder, but that shouldn't matter.

> It should be noted that one problem with CHK is that you don't know the
> key until you've finished serializing. Which is one of the reasons to
> make mostly a storage detail.

The entry key to the system has always been the revision id - and I
explicitly noted that I don't want to change that. So I don't see any
reason that CHK's would be a problem for us.

> Oh, and it seems a bit inefficient if you actually had to generate the
> serialized form for every fragment, try to insert it, and then have a
> no-op on insertion.

That would violate goal 2.

> There is also the question of what does a delta look like, when you are
> using fragments. Do you get a delta-text per fragment? Are the deltas a
> logical whole and then fragmented  (rather than fragment and delta, you
> delta and fragment). Is the top level of a delta just the list of
> fragments that were effected? (Or doesn't it matter because the
> fragments would be 'chained', the intermediates will always be updated
> when a child is updated.)

As I said above deltas are the inventory deltas we have today - the in
memory list of changes.

> > There are many ways we can split up an inventory - the most obvious in
> > some respects is a tree formed around the directory structure of the
> > tree. This is problematic in several regards for me. The primary problem
> > I have is that we cannot balance such a tree - the user determines the
> > tree shape and if they want 100K directories with files in the leaf;
> > we'll be stuck trying to make that work - or equally very deep trees, or
> > very broad trees with 10's of thousands of files in a single directory.
> 
> I agree that we can't force balancing, I'm not sure it is strictly
> problematic. It can simply be an edge case that we don't support yet,
> because the 90% most common use cases don't trigger it, and are thus not
> penalized (either by performance, or just simply complexity).
> 
> As long as it would still *work*, I don't think it is critical that it
> also be *optimal* in all possible cases.

We've said that about many things in the past; and then they bite us.
I'd rather not put a bunch of effort into creating and optimising
something with *known flaws*.

> I think having it at least be recommended to be able to go from
> serialized delta <=> logical delta directly would be a nice property
> (which I don't feel you explicitly defined). I think you defined
> serialized fulltext <=> serialized delta.

I don't think we want a serialised delta concept - it will violate goal
3. Trivially, you have revision X and Y:[X], I have just Y with X a
ghost (because you elided patent affected code or something when you
pushed). My Y cannot have a serialised delta against X. So serialised
delta <=> logical delta would be a misfeature.

Having efficient 'give me a delta from X and Y in serialised form' is a
key goal though: goal 4 above. And it means you don't need to be able to
go from serialised delta <=> logical delta.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080812/5c928a02/attachment-0001.pgp