Small note for serialized inventories

John Arbash Meinel john at arbash-meinel.com
Thu Feb 8 19:33:49 GMT 2007


Alexander Belchenko wrote:
...

> I thinking about keep inventory in one file but split by directories
> and have fast access to any directory via separate index file.
> Because Mozilla sources inventory has near 5000 directories,
> so keep each sub-inventory in separate file could be lose in speed.

That depends on a lot of things. One thing we are hoping to improve is
to be able to know "That directory isn't under consideration, don't load
it".

Which means that when we split 50,000 files into 5,000 directories, and
we only need to read 10 directory blocks == 100 file records.

So yes, splitting means reading everything is slower. But the goal is to
avoid that whenever possible.

Dirstate would still have an all-in-one because it is common to need the
whole tree for working operations.

But commit, pull, push, diff, etc can already have culled the list down
into a few entries, and now you can read just a few files.

> 
>> If we do directory-block all-in-one (like dirstate is), then I would
>> recommend including the full path to the file.
> 
> Why for? Inside directory block we could switch to inner diff loop with the same
> behaviour, but using info about path prefix implicitly.
> 
> Don't writing full path of each file is big save in size of inventory file.
> And therefore it's a big win to read/write operations.
> 

I don't know if you noticed what I told you in the other email, but one
way or another you have to know what your path prefix is. And a lot of
implicit schemes fail because you have to jump in and out of records.

It would be possible, I guess to use a logical structure that uses a
separate identifier for "end of directory block", and always has 1
directory block entry for every directory even if that directory is
empty. Something like

#start block ''
a file record
b file record
c dir record
d file record
e dir record
#end block ''
# start block 'c'
f record (e/f)
g record (e/g)
# end block 'c'
# start block 'e'
# end block 'e'

You wouldn't really need both start and end, and if you assumed a
certain order of things, you wouldn't need to label the blocks
You just need to have a "start block" record for every possible directory.

However, this file is really hard to partially parse. As information for
a given record is scattered all over the file

Like if you want to read 1 file's full path. You generally have to read
everything to get there, because there are no back pointers. And no
redundant information.



>> Just something to keep in mind.
> 
> I try to figure out what the best way to deal with moving file to another directory
> and/or rename? Keep file by directory block means that in one place it will be deleted
> and in another it will be added, and only with unique fileid it's will be detected
> as move/rename.
> 

You could mark the deleted record as "renamed to" and the added record
as "renamed from"

That has some very nice properties for working inventories. For
long-term serialization it is a limitation that you have a snapshot,
when sometimes you want a delta.

Partly I've advocated having a separate "changes" file, which mentions
all the things that have happened (at least what file-ids/paths have
been effected). So you can look up the delta quickly, and then jump into
the storage.

The last time I proposed it, the redundancy was considered a negative,
because you now have the possibility of disagreeing. The changes file
could be considered a cache of extracting to Inventories and doing a
delta against them. If you record this, you may end up disagreeing if
you did the delta again.

However, since our current operations are fairly slow, and expect to be
able to do that exact action on the contents of inventory.knit. And that
delta is known to be incomplete for deletions (which is a general
deficiency about all of our indices as .kndx does not record when a
delete occurs).

John
=:->



More information about the bazaar mailing list