[RFC] store inventory in tab-separated file

Martin Pool mbp at canonical.com
Thu Feb 1 06:46:53 GMT 2007


On 29 Jan 2007, Alexander Belchenko <bialix at ukr.net> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> I wrote draft implementation of new serializer format that use tab-separated
> text instead of XML. John Meinel often says that our weakness point is
> inventory. So I make some experiment to rewrote our serializer.
> 
> After converting current bzr.dev inventory to new format I have reducing
> in size from 73619 to 56476 bytes, i.e. 23%. I expect that on kernel tree
> this effect will be much bigger. I also expect that inventory.knit
> in repository also will be reduced.
> 
> I'm also make a one step towards implementing versioned properties
> (http://bazaar-vcs.org/VersionedProperties).
> 
> In attachment you can find my module tabseparated.py with new serializer,
> test_tans.py script to manual testing and inventory.xls -- bzr.dev inventory
> that converted to new format. This file easily opened with OpenOffice.
> 
> I have some questions and need some guidance for next steps on this.
> 
> 0) Does my work have sense to continue?
> 1) How to benchmark speed with using new inventory format? I expect it shoud be
> faster but I can't predict real value.
> 2) Why we are using 2 similar formats v5 and v6? Why for working inventory
> used v5 -- for speed-up reasons? Does I need implement v7 and v8 formats,
> or I need one rich format a-la v6?

> 3) I don't understand how versioned properties should be extended? Does I need
> simply throw away unrecognized properties? Can I add to specific InventoryEntry
> classes (InventoryDirectory, InventoryFile, etc) some support for packing/unpacking
> of versioned properties? Specification http://bazaar-vcs.org/VersionedProperties
> says that "Inventory and InventoryEntry will get proplist attributes, that will hold the
> properties". Does it means that we need shine new inventory2.py file with new
> implementation?

They're basically a string->string dictionaries attached to the entry.
(Or, according to that spec, we might let the values be lists of
strings...)  I don't think the inventory or inventory serialization
should care what is in them.  You can simply add this to the
InventoryEntry class, and refuse to store it into old formats.

> """Tab separated file to hold inventory data
> 
> Each inventory entry has 4 mandatory parameters:
> 
>     kind, file_id, name, parent_id
> 
> All this parameters could be written in tab separated text file,
> easy to read and easy to write. Each line starts with kind,
> then other parameters delimited by TAB.
> 
> Additional properties should be written in consequent lines,
> properties should be alphabetically sorted (for diff purpose).
> Each line with property starts with TAB, then follow property
> name, TAB, then follow value (or multiple values separated
> with TAB).
> 
> XXX Does bzr allow filenames with TAB inside?
> 
> So typical inventory of working tree will looks like:
> 
> # bzr inventory format 7
> 	revision_id	pqm at pqm.ubuntu.com-20070125194626-4ded330415b7276d
> file	bzrignore-20050311232317-81f7b71efa2db11a	.bzrignore	TREE_ROOT
> 	revision	xxxx
> 	text_sha1	xxxx
> 	text_size	xxxx
> 	executable	yes
> 
> Such file should be easily opened by Excel or OOo (especially if file
> has .xls extension)
> """

Thanks for including the example and docstring, and for working on this.

Being able to open it in a spreadsheet is probably useful but tab is a
somewhat risky character.  Even if it's not common in filenames it might
well be wanted in property values.

One option I was thinking about was using bencode for each entry, with
newlines in between.  (google for a python implementation.)  That avoids
issues of quoting and would make it easy to put versioned file
properties in as a dictionary.  I'm not sure how it would perform
compared to what you already have.

  http://en.wikipedia.org/wiki/Bencode

Since 'revision', 'sha1', etc occur for every file I think they should
be implied and on the same line.

This sort of format should be fine when split up per directory in a
later change but let's by all means put it in all-in-one for now.


-- 
Martin



More information about the bazaar mailing list