RFC: scaling-inventory rework current thoughts

Mon Aug 11 23:38:42 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> I've tried to capture a bunch of things that are floating around in my
> head about inventories with regards to performance. I'd appreciate
> comments - critical review if you will - of this; going forward I expect
> that I or someone else will propose concrete designs, and I hope that
> this document can provide a way of assessing those designs.
> 
> One of the key things with bzr today is that our internal model for a
> tree - the inventory - scales poorly; twice as many files means twice as
> much data to consider for the simplest operation such as diff or commit.
> Added to that, our serialisation format also scales in the same manner,
> so we end up slowing down disproportionately on large trees.
> 
> I proposed a journal rather than a single document as a way to improve
> serialisation without necessarily addressing the in-memory model, but
> missed an important consideration :- that when the journals are
> different we have to do quite a bit more work [e.g. during fetch we
> would have to transcode], and that reconcile, or parallel conversions
> from partial-history systems, or upgrades, might generate different
> journals.
> 
> Going back to basics then, for serialisation scaling, we want something
> that:
>  1 allows commit to write less data than the full tree most of the time
>  2 allows the data to be written to be calculated without processing
>    size(tree) information [this lets importers and named-file commits
>    be faster, and allows a nice coupling between workingtree (all-data
>    by nature) and repositories.
>  3 generates the same exact representation for a given inventory
>    regardless of the amount of history available

I think there is some flexibility here between the logical form, and the
stored form on disk. For example, if we have deltas be optional in the
stream. So converters are welcome to generate deltas directly, but
inserting full versions in their place would be acceptable as well. (As
I understand your "journaling" system, this was not the case, as the
delta text was the canonical form.)

>  4 allows a delta to be generated *from the serialised form* with work 
>    independent of the actual size of the tree - and ideally ~= the
>    actual changes involved. (The emphasis on the serialised form is
>    simply that upcasting to an in memory structure requiring the entire
>    inventory data to be read would defeat the point, even if the in
>    memory logic was very cheap).

Are you requesting compositing here? (Given 4 deltas, combine them into
an overall delta).

Perhaps a better way of phrasing this, would be that you can map between
 logical/serialized and fulltext/delta without going to the other.

LF ↔ LD
 ↕   ↕
SF ↔ SD

So if you had 2 serialized fulltexts, you could generate a serialized
delta without converting to logical fulltext and then logical deltas,
and then back down to serialized deltas.

I think it is equally important to be able to generate a logical delta
directly from the serialized delta.

At the moment, we take a serialized fulltext, apply several serialized
deltas to get another serialized fulltext, and then convert to a logical
fulltext, and use iter_changes to get a logic delta.

>  5 lets bzr determine what we need to pull without having to probe for
>    size-of-tree index keys to determine if we have the referenced data
>    locally. (This is something that the current format allows).
>  6 lets bzr map paths to fileids without reading the entire serialised
>    form (this is used by e.g. diff -r X PATH)
>  7 lets bzr map fileids to paths without reading the entire serialised 
>    form (this is used by e.g. loggerhead, bzr search, and anything
>    doing UI output of historical partial trees.)
> 
> Achieving all these points will let us preserve performance as trees
> scale up in size for operations that need or can be considered in terms
> of deltas - e.g.:
>  * log -v
>  * merge (only consider changes)
>  * commit (don't have to process a huge inventory)
>  * diff -r (both local-historical and historical-historical
>  * loggerhead (which is essentially a combination of cat, annotate
>    and and ls)
>  * status -r
> 
> Note that currently inventories can be split amongst up to 200 text
> deltas regardless of tree size so while we can't ignore the impact of
> needing to do more separate IO's to deal with the full inventory of
> large trees its not a forgone conclusion that its a problem.

^- I don't think I follow this statement. It sounds like you have a
design in mind already, but haven't actually stated what it is.

> 
> Aarons concept of using the dirstate fast-diff facilities to get a delta
> from the working tree to the basis tree, and then using purely
> historical diffs to work from there to other trees is a very useful one
> for increasing performance, because it allows us to have two different
> storage systems - one optimised for cheap low-latency high bandwidth IO,
> and one optimised for transmission and network use.

This is patch composition, correct? (Being able to determine the delta
from A=>C by going along A=>B=>C.)

I agree that *could* be a route to go. I think you can also argue for
something like recursive hashes making comparing A & C directly much
cheaper.

If you look at bzr's history, it does happen occasionally that a patch
gets delayed for several releases before it gets merged. Such as Daniel
Watkin's 'bzr check' refactoring. He branched from bzr.dev 3015, and was
merged as 3583. (Started ~2-28, merged 7-28).

In those cases, you end up chaining back 500+ deltas, and then follow up
another 20-50. And if we were using multi-parent deltas, things could
get even more tricky. (Though it might let us shortcut through the
'merge bzr.dev' shortly before his patch was merged.)

I'm mostly giving that as a counter example that patch composition is
always cheaper than something like recursive hashes. The extreme case is
a patch modifying a single top-level file, that takes 6 months to land.
Or a *long* lived integration branch that hasn't had a common left-hand
parent in a long time. (Such as the mysql 5.0 versus 5.1 and 6.0 branches.)

I *do* think it is worth investigating, I'm just not convinced it is
universally better. As it becomes O(total_changes_along_path) and not
O(changes_this_vs_other).

I guess it depends on how smart our "path" selection can be.

> 
> We can do minor tweaks to the inventory serialiser like stripping out
> xml and so on, but they won't make a dramatic difference compared to a
> deeper set of changes because they don't change the basic problem.

I certainly agree here.

> 
> I see several implications from these goals; I think the most
> significant one is that we have to remove all derived data to satisfy 3
> in a strict sense: while the last-changed revision is very useful and
> important to have, it depends on derived data - literally on 'when the
> last change occurred' - and that calculation could be buggy, or we could
> decide there are better ways to implement it in the future.

Sure, especially with incomplete history.

> 
> Accordingly, I think the most important thing to discuss is how strictly
> we want to satisfy 3; and if we go for 'very strictly' what we should do
> with the derived data (which amongst other things is used to generate
> per-file graphs to do [relatively] cheap 'log FILENAME'.

Well, we currently have it serve double duty. One is to provide a
per-file graph, which is useful for annotate, weave + lca merge (both do
per-file merges, and I might make --merge3 do it as well), log, and a
few other places.

The other is just a handle to access the text (file_id, revision_id) is
our "text_key".

> 
> I think we can either say that the inventory contents is a higher layer
> than the serialiser, and this its ok to keep the last-changed field
> as-is - to defer rework on that to another time; or we can say that it
> needs to be decoupled and put into a separate document which (unlike the
> inventory itself) is permitted or even expected to have different values
> for a given revision between repository instances.
> 
> In my view it may be desirable to split the derived data out, but its
> definitely more work and we can achieve a looser form of 3 above without
> making the issues in that area that we have today any *worse*, so it is
> definitely a choice rather than a requirement. As its a choice I think
> we should not couple it to fixing the scaling bugs we have in the
> commands mentioned above.
> 

This would be nice, and I think is a "correct" way to go. I'm not sure
about before or after getting the other work done. So I agree it
shouldn't be strictly coupled, but *could* be an opportune time to make
changes that require "expensive" format upgrades.

> One possible approach to storing-and-serialisation is to write some very
> low level code that can store many texts with common regions with
> partial access to a given text, in an efficient manner. I think this
> would be hard to do and that there are probably easier ways. Another
> approach is to split the overall document up as part of the
> serialisation step, and then store all the fragments - this is
> conceptually very similar to our current storage facilities and as such
> likely easy to accomplish. The key thing I can see being of issue there
> is how to ensure that the names of the fragments honours condition 3
> above - because the names are most likely embedded as references.
> (Alternate: write something to store many copies of an identical text
> with different names efficiently). Roughly speaking, the more we can
> push complexity about this up towards the application, the simpler the
> storage layer can be - and up to a point this is always good :). (The
> point that its not is when you make other things unreasonably complex by
> doing this).

I think the combination of 2 & 3 make it difficult to do a radix tree.

> 
> A useful thing to note is that while we need access to parts of an
> inventory, we don't need that keyed by revision id - that is, we can be
> sure that we'll start at a revision object, which has the property of
> having a normal revision id; we can follow from the into whatever
> storage/serialisation is used for inventories.

"follow from there" ?

> 
> In terms of generating fragments for storage and naming them - I'm
> inclined to propose using CHK - content hash keys - to name them. For
> instance, we might call fragments ('sha1:123456789....',). I suggest
> this because such keys are dependent purely on the inventory content
> rather than how we arrived at it - so its part of the answer for goal 3.
> Note that this is not the thin edge of the wedge to be same as GIT - it
> is not changing the naming scheme for revisions, simply some of the
> internals for how inventories are stored.

Note also that splitting into fragments makes things like file_id <=>
path mapping a bit harder. For example, if split up based on path you
could pass a hint about where the file is located in *this* inventory as
a likely place that you would want to look for it in *that* inventory.

It should be noted that one problem with CHK is that you don't know the
key until you've finished serializing. Which is one of the reasons to
make mostly a storage detail.

Oh, and it seems a bit inefficient if you actually had to generate the
serialized form for every fragment, try to insert it, and then have a
no-op on insertion.

There is also the question of what does a delta look like, when you are
using fragments. Do you get a delta-text per fragment? Are the deltas a
logical whole and then fragmented  (rather than fragment and delta, you
delta and fragment). Is the top level of a delta just the list of
fragments that were effected? (Or doesn't it matter because the
fragments would be 'chained', the intermediates will always be updated
when a child is updated.)

> 
> There are many ways we can split up an inventory - the most obvious in
> some respects is a tree formed around the directory structure of the
> tree. This is problematic in several regards for me. The primary problem
> I have is that we cannot balance such a tree - the user determines the
> tree shape and if they want 100K directories with files in the leaf;
> we'll be stuck trying to make that work - or equally very deep trees, or
> very broad trees with 10's of thousands of files in a single directory.

I agree that we can't force balancing, I'm not sure it is strictly
problematic. It can simply be an edge case that we don't support yet,
because the 90% most common use cases don't trigger it, and are thus not
penalized (either by performance, or just simply complexity).

As long as it would still *work*, I don't think it is critical that it
also be *optimal* in all possible cases.

> 
> I'd like to frame the problem in two parts: defining a canonical form
> for an inventory, and the amount of work, both physical and logical, to
> go from an existing canonical-form inventory to a new one given an
> inventory delta (for now, our existing inventory delta definition is
> sufficient).
> 
> So the canonical form is simply what we get when we serialise an
> inventory. If it's done by a tree-per-directory then thats simply one
> fragment per directory, and some simple serialiser within each fragment.
> 
> What we want is the work to update a canonical form - to generate the
> new canonical form from the prior form and a delta - to be closely tied
> to the size of the delta.
> 
> There are three key things here - the amount of CPU/memory required to
> calculate the new canonical form, the amount of input needed, and the
> number of fragments created which are not shared with the original
> canonical form. If we imagine a trivial canonical form of 'sort the
> inventory by file id, serialise one id per line, and split into 64K
> fragements by the first-line break before 64K', then adding a new file
> id will on average require reading all the fragments after it, and
> generating new ones. And this is much more work than a simple delta
> adding one row should cause :).

> 
> I'm continuing to think about this with the goal of creating a design;
> but would really appreciate hole-poking in my logic to date - have I
> made incorrect asssumptions, proposed bad issues, missed stuff in my
> analysis [or perhaps just not mentioned it - I realise just now I didn't
> mention what to do when SHA1 becomes unsuitable(*)]. I also didn't want
> to DOS all the folk reading the list with a huge missive; I've been on
> lists where that happens routinely and its a bit offputting. Sorry if
> I've done that :(.
> 
> *: (When SHA1 stops being safe, our revision ids don't need to change,
> we just issue a format bump that updates the serialiser to use sha256
> CHK's).
> 
> -Rob

I think having it at least be recommended to be able to go from
serialized delta <=> logical delta directly would be a nice property
(which I don't feel you explicitly defined). I think you defined
serialized fulltext <=> serialized delta.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkigv3IACgkQJdeBCYSNAANE2ACeJLrKp7BqgiKCXqkEP71u39i3
3EsAnREZZfbNjrmDpT4Pz1Y/wbjQml9t
=ZDy5
-----END PGP SIGNATURE-----