RFC: scaling-inventory rework current thoughts

Mon Aug 11 07:59:05 BST 2008

I've tried to capture a bunch of things that are floating around in my
head about inventories with regards to performance. I'd appreciate
comments - critical review if you will - of this; going forward I expect
that I or someone else will propose concrete designs, and I hope that
this document can provide a way of assessing those designs.

One of the key things with bzr today is that our internal model for a
tree - the inventory - scales poorly; twice as many files means twice as
much data to consider for the simplest operation such as diff or commit.
Added to that, our serialisation format also scales in the same manner,
so we end up slowing down disproportionately on large trees.

I proposed a journal rather than a single document as a way to improve
serialisation without necessarily addressing the in-memory model, but
missed an important consideration :- that when the journals are
different we have to do quite a bit more work [e.g. during fetch we
would have to transcode], and that reconcile, or parallel conversions
from partial-history systems, or upgrades, might generate different
journals.

Going back to basics then, for serialisation scaling, we want something
that:
 1 allows commit to write less data than the full tree most of the time
 2 allows the data to be written to be calculated without processing
   size(tree) information [this lets importers and named-file commits
   be faster, and allows a nice coupling between workingtree (all-data
   by nature) and repositories.
 3 generates the same exact representation for a given inventory
   regardless of the amount of history available
 4 allows a delta to be generated *from the serialised form* with work 
   independent of the actual size of the tree - and ideally ~= the
   actual changes involved. (The emphasis on the serialised form is
   simply that upcasting to an in memory structure requiring the entire
   inventory data to be read would defeat the point, even if the in
   memory logic was very cheap).
 5 lets bzr determine what we need to pull without having to probe for
   size-of-tree index keys to determine if we have the referenced data
   locally. (This is something that the current format allows).
 6 lets bzr map paths to fileids without reading the entire serialised
   form (this is used by e.g. diff -r X PATH)
 7 lets bzr map fileids to paths without reading the entire serialised 
   form (this is used by e.g. loggerhead, bzr search, and anything
   doing UI output of historical partial trees.)

Achieving all these points will let us preserve performance as trees
scale up in size for operations that need or can be considered in terms
of deltas - e.g.:
 * log -v
 * merge (only consider changes)
 * commit (don't have to process a huge inventory)
 * diff -r (both local-historical and historical-historical
 * loggerhead (which is essentially a combination of cat, annotate
   and and ls)
 * status -r

Note that currently inventories can be split amongst up to 200 text
deltas regardless of tree size so while we can't ignore the impact of
needing to do more separate IO's to deal with the full inventory of
large trees its not a forgone conclusion that its a problem.

Aarons concept of using the dirstate fast-diff facilities to get a delta
from the working tree to the basis tree, and then using purely
historical diffs to work from there to other trees is a very useful one
for increasing performance, because it allows us to have two different
storage systems - one optimised for cheap low-latency high bandwidth IO,
and one optimised for transmission and network use.

We can do minor tweaks to the inventory serialiser like stripping out
xml and so on, but they won't make a dramatic difference compared to a
deeper set of changes because they don't change the basic problem.

I see several implications from these goals; I think the most
significant one is that we have to remove all derived data to satisfy 3
in a strict sense: while the last-changed revision is very useful and
important to have, it depends on derived data - literally on 'when the
last change occurred' - and that calculation could be buggy, or we could
decide there are better ways to implement it in the future.

Accordingly, I think the most important thing to discuss is how strictly
we want to satisfy 3; and if we go for 'very strictly' what we should do
with the derived data (which amongst other things is used to generate
per-file graphs to do [relatively] cheap 'log FILENAME'.

I think we can either say that the inventory contents is a higher layer
than the serialiser, and this its ok to keep the last-changed field
as-is - to defer rework on that to another time; or we can say that it
needs to be decoupled and put into a separate document which (unlike the
inventory itself) is permitted or even expected to have different values
for a given revision between repository instances.

In my view it may be desirable to split the derived data out, but its
definitely more work and we can achieve a looser form of 3 above without
making the issues in that area that we have today any *worse*, so it is
definitely a choice rather than a requirement. As its a choice I think
we should not couple it to fixing the scaling bugs we have in the
commands mentioned above.

One possible approach to storing-and-serialisation is to write some very
low level code that can store many texts with common regions with
partial access to a given text, in an efficient manner. I think this
would be hard to do and that there are probably easier ways. Another
approach is to split the overall document up as part of the
serialisation step, and then store all the fragments - this is
conceptually very similar to our current storage facilities and as such
likely easy to accomplish. The key thing I can see being of issue there
is how to ensure that the names of the fragments honours condition 3
above - because the names are most likely embedded as references.
(Alternate: write something to store many copies of an identical text
with different names efficiently). Roughly speaking, the more we can
push complexity about this up towards the application, the simpler the
storage layer can be - and up to a point this is always good :). (The
point that its not is when you make other things unreasonably complex by
doing this).

A useful thing to note is that while we need access to parts of an
inventory, we don't need that keyed by revision id - that is, we can be
sure that we'll start at a revision object, which has the property of
having a normal revision id; we can follow from the into whatever
storage/serialisation is used for inventories.

In terms of generating fragments for storage and naming them - I'm
inclined to propose using CHK - content hash keys - to name them. For
instance, we might call fragments ('sha1:123456789....',). I suggest
this because such keys are dependent purely on the inventory content
rather than how we arrived at it - so its part of the answer for goal 3.
Note that this is not the thin edge of the wedge to be same as GIT - it
is not changing the naming scheme for revisions, simply some of the
internals for how inventories are stored.

There are many ways we can split up an inventory - the most obvious in
some respects is a tree formed around the directory structure of the
tree. This is problematic in several regards for me. The primary problem
I have is that we cannot balance such a tree - the user determines the
tree shape and if they want 100K directories with files in the leaf;
we'll be stuck trying to make that work - or equally very deep trees, or
very broad trees with 10's of thousands of files in a single directory.

I'd like to frame the problem in two parts: defining a canonical form
for an inventory, and the amount of work, both physical and logical, to
go from an existing canonical-form inventory to a new one given an
inventory delta (for now, our existing inventory delta definition is
sufficient).

So the canonical form is simply what we get when we serialise an
inventory. If it's done by a tree-per-directory then thats simply one
fragment per directory, and some simple serialiser within each fragment.

What we want is the work to update a canonical form - to generate the
new canonical form from the prior form and a delta - to be closely tied
to the size of the delta.

There are three key things here - the amount of CPU/memory required to
calculate the new canonical form, the amount of input needed, and the
number of fragments created which are not shared with the original
canonical form. If we imagine a trivial canonical form of 'sort the
inventory by file id, serialise one id per line, and split into 64K
fragements by the first-line break before 64K', then adding a new file
id will on average require reading all the fragments after it, and
generating new ones. And this is much more work than a simple delta
adding one row should cause :).

I'm continuing to think about this with the goal of creating a design;
but would really appreciate hole-poking in my logic to date - have I
made incorrect asssumptions, proposed bad issues, missed stuff in my
analysis [or perhaps just not mentioned it - I realise just now I didn't
mention what to do when SHA1 becomes unsuitable(*)]. I also didn't want
to DOS all the folk reading the list with a huge missive; I've been on
lists where that happens routinely and its a bit offputting. Sorry if
I've done that :(.

*: (When SHA1 stops being safe, our revision ids don't need to change,
we just issue a format bump that updates the serialiser to use sha256
CHK's).

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080811/1c24081d/attachment.pgp