Small note for serialized inventories

John Arbash Meinel john at arbash-meinel.com
Fri Feb 9 16:07:18 GMT 2007


Alexander Belchenko wrote:
> John Arbash Meinel ?8H5B:

...

>> #start block ''
>> a file record
>> b file record
>> c dir record
>> d file record
>> e dir record
>> #end block ''
>> # start block 'c'
>> f record (e/f)
>> g record (e/g)
>> # end block 'c'
>> # start block 'e'
>> # end block 'e'
> 
>> You wouldn't really need both start and end, and if you assumed a
>> certain order of things, you wouldn't need to label the blocks
>> You just need to have a "start block" record for every possible directory.
> 
>> However, this file is really hard to partially parse. As information for
>> a given record is scattered all over the file
> 
> The scheme above is near to my thoughts but I still think that each block
> should be for items in directory without subdirectory. And for each
> directory block of course it's better to write full path. So I end up
> with file like
> 
> directory '' TREE_ROOT
> file a
> file b
> symlink e
> directory d d-XXXXXXX-YYYYYYYY
> file aa
> file bb
> directory d/foo
> directory d/foo/bar
> ...
> 
> etc.
> 
> So start of block is easily detected. End of block -- it's a start of next
> block.

The problem with that is what happens when a directory block is empty? I
might be missing something in what you wrote.
But you need to have an entry in the parent location (to say the
directory is a child of the other directory). For the above, I would say
you need:

directory '' TREE_ROOT
dblock ''
file a
file b
directory d
symlink e
dblock d
file aa
file bb
directory foo
dblock d/foo
file cc
directory d/foo/bar

So this would add an explicit "directory block" section.

One problem with that, is that while parsing you now need an if/then
statement, to tell what you are parsing, rather than being able to parse
each record without needing context.

Generally parsing without context will be a lot faster, and is
frequently worth the size tradeoff.

Also, if you are working on decreasing the size, consider switching to
'd', 'f', 'l' for the entry names. Since that removes at least 3 and as
many as 8 characters from each line. (4*50,000 == 200KB saved)


> And with index file that give me offset and size of each directory block
> I could read those inventory partially. And this index is easily regenerated
> from the whole inventory file.
> 
> This scheme also has one big advantage: we can drop parent_id.
> And resulting inventory file for Mozilla tree then highly reduced
> (converting to nul-separated file and drop parent_id give by 47% smaller
> file size).

I'm still wondering if you are missing the ability to properly generate
a full inventory. And I would like to avoid an index if we can get away
with it. (Since now we need 2 files rather than 1).

> 
>> Like if you want to read 1 file's full path. You generally have to read
>> everything to get there, because there are no back pointers. And no
>> redundant information.
> 
> Then I need to know in what directory this file resides. And then get
> full directory path as prefix.
> 

If you are going from a working tree, you pretty much always know the
full path.

If you are going from a revision tree, then you may only know the
file-id, and be wanting to translate that to a path. Which generally
involves reading the whole inventory.

>>>> Just something to keep in mind.
>>> I try to figure out what the best way to deal with moving file to another directory
>>> and/or rename? Keep file by directory block means that in one place it will be deleted
>>> and in another it will be added, and only with unique fileid it's will be detected
>>> as move/rename.
>>>
>> You could mark the deleted record as "renamed to" and the added record
>> as "renamed from"
> 
> In inventory itself? Hmmm. It make sense.

It makes sense for the working inventory, not as much for a committed
inventory. Since it is mixing delta information (moved relative to
parent) with what is generally considered a snapshot.

John
=:->



More information about the bazaar mailing list