[RFC] store inventory in tab-separated file

Mon Jan 29 16:31:38 GMT 2007

Alexander Belchenko wrote:
> I wrote draft implementation of new serializer format that use tab-separated
> text instead of XML. John Meinel often says that our weakness point is
> inventory. So I make some experiment to rewrote our serializer.
> 
> After converting current bzr.dev inventory to new format I have reducing
> in size from 73619 to 56476 bytes, i.e. 23%. I expect that on kernel tree
> this effect will be much bigger. I also expect that inventory.knit
> in repository also will be reduced.

I would expect it to be approximately a fixed percent difference.
Basically, for each line you are removing a specific amount of text
which XML stores. (file_id=, revision=, ...).

As such, I don't think it will change based on the size of the
inventory. It is still useful, but it will be a 'fixed per line'
difference. And since filenames are generally the same size, and file
ids are fairly fixed in size, it would stay a uniform 23%.

I'm happy to see you interested in this area, it is certainly a place
that I've spent a lot of time in. I have some concerns with this
specific implementation, but I'd be happy to work with you on dealing
with serialization.

I would be curious to see how 'csv' compares, since I haven't profiled
it at all. But in my dirstate testing, I found that you can do some
pretty big improvements by having a fixed character that you can split on.

I tried a few possibilities in this plugin:
http://bzr.arbash-meinel.com/plugins/dirstate-tests/

And ended up with something that should work really well for a working
inventory format. And I think it could be adapted for use as a stored
format.

There are a few things going on right now which will probably change our
inventory structure.

There will end up a small difference between stored form and working
form. Because in the working form we need good path => file_id mappings,
while in stored form we tend to need more file_id => path mapping. More
importantly, though, is that for storage I think we want each
directory's inventory to be stored separately, because it changes some
of our O(N) work to become O(log N).
For the working tree, I don't think we want to have many separate files.
My dirstate format was designed such that you can bisect through the
format and extract only a few records. And in testing with a kernel
tree, it was quite fast.

I think using '\0' is a little bit safer than using '\t'. '\0' isn't
valid in any of our strings, while '\t' might be allowed in filenames.
(IIRC cElementTree has a bug with \t, but that is a bug)

We really want to get away from building up an InventoryEntry until we
really need it. Creating a tuple is 30x faster than creating an object

python -m timeit -s "class Foo(object):
  def __init__(self, x, y):
    self.x = x
    self.y = y
" "z = Foo(1,2)"

3.49 usec per loop

__slots__ helps:
python -m timeit -s "class Foo(object):
  __slots__ = ['x', 'y']

  def __init__(self, x, y):
    self.x = x
    self.y = y
" "z = Foo(1,2)"
2.99 usec per loop

but tuples dominate:
python -m timeit "z = (1,2)"
0.116 usec per loop

Because of how string parsing works in python, dirstate actually uses
lists, which aren't quite as fast:
python -m timeit "z = [1,2]"
0.758 usec per loop

But still better than full-blown objects. (And it is faster than parsing
into a list, and then turning that into a tuple).

Anyway, we need to restructure our code so that we can take advantage of
these things.

> 
> I'm also make a one step towards implementing versioned properties
> (http://bazaar-vcs.org/VersionedProperties).

This spec has never been fully accepted, as mentioned by Aaron. I had
looked into adding them to XML a long time ago (1.5yr?). People would
like to think that they solve a lot of problems, but they don't solve as
many as people think.

Specifically, there will always be a set of properties that 'bzr' the
program must honor. (For example if we add line ending properties, then
having a client that doesn't honor them would break the assertions).
Which means that adding a new property will often require a new
repository format, because you don't want a client to be violating the
constraints of the new properties.

...

> I have some questions and need some guidance for next steps on this.
> 
> 0) Does my work have sense to continue?

It is a great way to get a feel for profiling, and tweaking python code.
However, I'm not sure that it makes sense to have a lot of redundancy
between what you are doing, and what Martin and I are working on.

> 1) How to benchmark speed with using new inventory format? I expect it shoud be
> faster but I can't predict real value.

There are a 2 things you can do. I would generally set up a test for
taking an in-memory inventory and writing it out to disk, and taking an
on-disk format and reading it into memory. (My 'dirstate-tests' does
some of that).

I would write specific functions that just do that, so that you can time
them under "real-world" conditions, as well as time them under 'bzr
--lsprof'. --lsprof can be very nice to find places to tune. But I've
also found that it isn't 100% accurate. So making sure that you can do a
test with:

start = time.time()
do_something()
done = time.time()

print 'that took %.3fs' % (done - start,)

> 2) Why we are using 2 similar formats v5 and v6? Why for working inventory
> used v5 -- for speed-up reasons? Does I need implement v7 and v8 formats,
> or I need one rich format a-la v6?

Current storage is v5. Knit2 will use v6 (which pretty much only adds a
last modified revision for the tree root, but there are a couple subtle
changes).

As Aaron said, we initially thought we would switch to Knit2 earlier.
But there hasn't been enough changes to convince us to "bzr upgrade" all
of their repositories. I think when his nested-trees work lands, there
will be real benefits to using the new format, and we will upgrade
around that time.

> 3) I don't understand how versioned properties should be extended? Does I need
> simply throw away unrecognized properties? Can I add to specific InventoryEntry
> classes (InventoryDirectory, InventoryFile, etc) some support for packing/unpacking
> of versioned properties? Specification http://bazaar-vcs.org/VersionedProperties
> says that "Inventory and InventoryEntry will get proplist attributes, that will hold the
> properties". Does it means that we need shine new inventory2.py file with new
> implementation?
> 4) What tests I need to write for new serializer?

Because it is a disk format, you need some very strict "this in memory
==> this on disk". Because otherwise our serializer could drift with
time, which is very bad considering not everyone is using 'bzr.dev'.

We actually have run into this 1 time (part of why we didn't upgrade to
v6 formats). Aaron upgraded the code to properly support unique tree
roots. But it turned out that older bzr (<0.11??) would fail on the new
trees. Fortunately we caught it before we made a release, because I
happened to use a different version of bzr on different machines.

> 5) How to write converter for upgrade?
> 

There is quite a bit already in bzrlib/upgrade.py Which has a framework
for converting from version A => version B.

I haven't spent a lot of time in that code, but it amounts to:

1) Changing .bzr/branch-format to make sure clients don't try to do
anything with this branch/repository. Essentially, this is taking the
branch "offline".

2) Backing up whatever was important. If you are making a lot of
changes, this is usually done by creating .bzr.backup. For the
conversion to prefixed locations, we just did it in place because it was
just moving a few files. In this specific case, we might get by with
just creating inventory.knit.backup

3) You are changing the inventory text for every revision. So you need
to regenerate the serialized form for each one, and the save it into the
knit. There are some potential tricks for doing this, like just opening
the old Knit, and iterating over the versions, extracting the full text,
re-serializing the new inventory, and storing it into the new Knit.

4) Updating all of the 'format' markers (.bzr/branch-format,
.bzr/branch/format, .bzr/repository/format) to bring everything up to date.

John
=:->