[RFC] store inventory in tab-separated file

Robert Collins robertc at robertcollins.net
Thu Apr 5 05:12:41 BST 2007


This seems to have gone quiet. I'd love to see a more compact repository
format: Every byte we have to send on the wire costs us.

On Mon, 2007-01-29 at 05:53 +0200, Alexander Belchenko wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> I wrote draft implementation of new serializer format that use tab-separated
> text instead of XML. John Meinel often says that our weakness point is
> inventory. So I make some experiment to rewrote our serializer.

Cool.

> After converting current bzr.dev inventory to new format I have reducing
> in size from 73619 to 56476 bytes, i.e. 23%. I expect that on kernel tree
> this effect will be much bigger. I also expect that inventory.knit
> in repository also will be reduced.
> 
> I'm also make a one step towards implementing versioned properties
> (http://bazaar-vcs.org/VersionedProperties).

I'm pretty keen on doing things in small steps; the specification in
question is https://launchpad.net/bzr/+spec/versioned-properties and
there is marked as 'needs guidance'. My suggestion is to focus on a
lossless transform of the current knit3 repository format contents, to
give you a single goal.

> In attachment you can find my module tabseparated.py with new serializer,
> test_tans.py script to manual testing and inventory.xls -- bzr.dev inventory
> that converted to new format. This file easily opened with OpenOffice.
> 
> I have some questions and need some guidance for next steps on this.
> 
> 0) Does my work have sense to continue?

I think so. Getting a smaller inventory capable of our current
operations does not prevent us coming back later and writing a more
capable format.

> 1) How to benchmark speed with using new inventory format? I expect it shoud be
> faster but I can't predict real value.

The benchmark suite should help; adding benchmarks there for 'pull'
operations, and 'file_ids_affected_by_revisions' (if its not there
already) will help. Also we can do real-worlds tests by converting e.g.
mozilla to this format and testing. I'm in favour ofusing the benchmark
suite primarily, with reference to real-world data to craft test data.

> 2) Why we are using 2 similar formats v5 and v6? Why for working inventory
> used v5 -- for speed-up reasons? Does I need implement v7 and v8 formats,
> or I need one rich format a-la v6?

Historical reasons. I suggest just writing a replacement for whats in
knit3; as that is planned to be recommended as the default soon.

> 3) I don't understand how versioned properties should be extended? Does I need
> simply throw away unrecognized properties? Can I add to specific InventoryEntry
> classes (InventoryDirectory, InventoryFile, etc) some support for packing/unpacking
> of versioned properties? Specification http://bazaar-vcs.org/VersionedProperties
> says that "Inventory and InventoryEntry will get proplist attributes, that will hold the
> properties". Does it means that we need shine new inventory2.py file with new
> implementation?

I'd skip this for now - we'll get real wins by 'just' shrinking
inventory sizes by 23%.

> 4) What tests I need to write for new serializer?

I'd expect it to be fairly well covered by the repository implementation
tests; but make sure there are tests that that cover all the
permutations you can think of:
- inventory revision flag
- serialising of each inventory entry type, with all permitted value
types in each attribute
- serialising of invalid entries (e.g. a dir with a symlink_target
should error)


> 5) How to write converter for upgrade?

The repository copy-converter should mean that you dont need to write an
explicit converter.



-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070405/ff209248/attachment.pgp 


More information about the bazaar mailing list