Small note for serialized inventories

Alexander Belchenko bialix at ukr.net
Thu Feb 8 18:11:22 GMT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel пишет:
> One small thing I wanted to mention. As you are both looking into
> possible new formats for serialized inventories. I would recommend
> explicitly writing out the text for whatever sort key is used.
> 
> I tried implementing a custom diff algorithm for inventory texts, since
> we know that the are in sorted order. Which theoretically means we can
> process them in O(N) time, rather than O(N^2) (O(NlogN)?). Simply
> because we can iterate through both sides, and if we have a match, go to
> next. If A > B, B++, B > A, A++. It is single pass, and each node is
> only evaluated 1 time.
> 
> However, our current inventory form is sorted by path blocks
> path/to/file is sorted by ['path', 'to', 'file'], but we only store the
> 'name' and 'parent-id' to be able to figure out what that is. Which
> means I need to do a lot of processing to track the current ['path',
> 'to']. So for at least bzr sized inventories (and even a little bit
> bigger) there is a net loss because the constant factors override the
> O(N). It might be again for 50,000 entry inventories, but I haven't
> really wanted to spend too much effort on it.
> 
> Anyway, if we switch to directory split inventories, and then sort by
> filename, we have that property quite cheaply. So I think a simple diff
> algorithm could be easily created.

I thinking about keep inventory in one file but split by directories
and have fast access to any directory via separate index file.
Because Mozilla sources inventory has near 5000 directories,
so keep each sub-inventory in separate file could be lose in speed.

> 
> If we do directory-block all-in-one (like dirstate is), then I would
> recommend including the full path to the file.

Why for? Inside directory block we could switch to inner diff loop with the same
behaviour, but using info about path prefix implicitly.

Don't writing full path of each file is big save in size of inventory file.
And therefore it's a big win to read/write operations.

> 
> Just something to keep in mind.

I try to figure out what the best way to deal with moving file to another directory
and/or rename? Keep file by directory block means that in one place it will be deleted
and in another it will be added, and only with unique fileid it's will be detected
as move/rename.

- --
Alexander

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFy2fKzYr338mxwCURAjTCAJ9Q4/WzD7bpgmFdPy+iKX/L3aPQnQCbBTa2
GmFQCwATaxwrwK7FuyHVMJQ=
=cHDP
-----END PGP SIGNATURE-----



More information about the bazaar mailing list