RFC: scaling-inventory rework current thoughts

Wed Aug 13 07:56:45 BST 2008

On Mon, Aug 11, 2008 at 4:59 PM, Robert Collins
<robertc at robertcollins.net> wrote:

> I've tried to capture a bunch of things that are floating around in my
> head about inventories with regards to performance. I'd appreciate
> comments - critical review if you will - of this; going forward I expect
> that I or someone else will propose concrete designs, and I hope that
> this document can provide a way of assessing those designs.

That was quite nicely expressed, thanks.

>
> One of the key things with bzr today is that our internal model for a
> tree - the inventory - scales poorly; twice as many files means twice as
> much data to consider for the simplest operation such as diff or commit.
> Added to that, our serialisation format also scales in the same manner,
> so we end up slowing down disproportionately on large trees.
>
> I proposed a journal rather than a single document as a way to improve
> serialisation without necessarily addressing the in-memory model, but
> missed an important consideration :- that when the journals are
> different we have to do quite a bit more work [e.g. during fetch we
> would have to transcode], and that reconcile, or parallel conversions
> from partial-history systems, or upgrades, might generate different
> journals.
>
> Going back to basics then, for serialisation scaling, we want something
> that:
>  1 allows commit to write less data than the full tree most of the time
>  2 allows the data to be written to be calculated without processing
>   size(tree) information [this lets importers and named-file commits
>   be faster, and allows a nice coupling between workingtree (all-data
>   by nature) and repositories.
>  3 generates the same exact representation for a given inventory
>   regardless of the amount of history available
>  4 allows a delta to be generated *from the serialised form* with work
>   independent of the actual size of the tree - and ideally ~= the
>   actual changes involved. (The emphasis on the serialised form is
>   simply that upcasting to an in memory structure requiring the entire
>   inventory data to be read would defeat the point, even if the in
>   memory logic was very cheap).
>  5 lets bzr determine what we need to pull without having to probe for
>   size-of-tree index keys to determine if we have the referenced data
>   locally. (This is something that the current format allows).
>  6 lets bzr map paths to fileids without reading the entire serialised
>   form (this is used by e.g. diff -r X PATH)
>  7 lets bzr map fileids to paths without reading the entire serialised
>   form (this is used by e.g. loggerhead, bzr search, and anything
>   doing UI output of historical partial trees.)
>
> Achieving all these points will let us preserve performance as trees
> scale up in size for operations that need or can be considered in terms
> of deltas - e.g.:
>  * log -v
>  * merge (only consider changes)
>  * commit (don't have to process a huge inventory)
>  * diff -r (both local-historical and historical-historical
>  * loggerhead (which is essentially a combination of cat, annotate
>   and and ls)
>  * status -r
>

> Note that currently inventories can be split amongst up to 200 text
> deltas regardless of tree size so while we can't ignore the impact of
> needing to do more separate IO's to deal with the full inventory of
> large trees its not a forgone conclusion that its a problem.

In other words, one concern about have the stored inventory split up
at all is that it will cause more latency following pointers, either
as we traverse down to get information on one file, or as we retrieve
the whole thing.  However, if we manage to have shorter or zero delta
chains to get the text of each element, this may cancel out.  For
large inventories split into many parts this may start to intrude
again.  Though we might be able to arrange that the maximum number of
pointers to walk to get to any particular node (depth) is less than
100 on average.

> I see several implications from these goals; I think the most
> significant one is that we have to remove all derived data to satisfy 3
> in a strict sense: while the last-changed revision is very useful and
> important to have, it depends on derived data - literally on 'when the
> last change occurred' - and that calculation could be buggy, or we could
> decide there are better ways to implement it in the future.
>
> Accordingly, I think the most important thing to discuss is how strictly
> we want to satisfy 3; and if we go for 'very strictly' what we should do
> with the derived data (which amongst other things is used to generate
> per-file graphs to do [relatively] cheap 'log FILENAME'.
>
> I think we can either say that the inventory contents is a higher layer
> than the serialiser, and this its ok to keep the last-changed field
> as-is - to defer rework on that to another time; or we can say that it
> needs to be decoupled and put into a separate document which (unlike the
> inventory itself) is permitted or even expected to have different values
> for a given revision between repository instances.

So it seems like here you're saying that file last-modified revisions
are not really part of the inventory.  In some ways they are not - the
testament omits them.  But obviously at present the inventory does
currently promise to store them and it is fairly widely used both in
how we store files and in how it's used by the rest of the code it may
be intrusive to pull it out.

This reminds me of another desideratum (sp?), which is that it would
be good if the format could efficiently directly map in place of a
testament hash.

In this kind of space you could say that many mappings back from file
id to name are unnecessary and not really where we want to take file
ids in the future, but we should go one step at a time.  Having two
indexes pointing to the same data may somewhat complicate this design
and/or introduce derived data into the format.  If we were going to
work towards removing some of those uses in the medium term we might
decide that loading and scanning the whole thing would be acceptable,
and no worse than we do at present.  (That is to say, we could weaken
condition 7)

> There are many ways we can split up an inventory - the most obvious in
> some respects is a tree formed around the directory structure of the
> tree. This is problematic in several regards for me. The primary problem
> I have is that we cannot balance such a tree - the user determines the
> tree shape and if they want 100K directories with files in the leaf;
> we'll be stuck trying to make that work - or equally very deep trees, or
> very broad trees with 10's of thousands of files in a single directory.

I definitely agree we should not just assume this is sufficient.  In
particular it doesn't answer how we will store file_id->name mappings.

I don't find the cases of very large or very skewed trees enormously
convincing because that shape is going to tend to bite the user during
other operations such as working on the checkout.

-- 
Martin <http://launchpad.net/~mbp/>