Small note for serialized inventories

John Arbash Meinel john at arbash-meinel.com
Thu Feb 8 16:07:26 GMT 2007


One small thing I wanted to mention. As you are both looking into
possible new formats for serialized inventories. I would recommend
explicitly writing out the text for whatever sort key is used.

I tried implementing a custom diff algorithm for inventory texts, since
we know that the are in sorted order. Which theoretically means we can
process them in O(N) time, rather than O(N^2) (O(NlogN)?). Simply
because we can iterate through both sides, and if we have a match, go to
next. If A > B, B++, B > A, A++. It is single pass, and each node is
only evaluated 1 time.

However, our current inventory form is sorted by path blocks
path/to/file is sorted by ['path', 'to', 'file'], but we only store the
'name' and 'parent-id' to be able to figure out what that is. Which
means I need to do a lot of processing to track the current ['path',
'to']. So for at least bzr sized inventories (and even a little bit
bigger) there is a net loss because the constant factors override the
O(N). It might be again for 50,000 entry inventories, but I haven't
really wanted to spend too much effort on it.

Anyway, if we switch to directory split inventories, and then sort by
filename, we have that property quite cheaply. So I think a simple diff
algorithm could be easily created.

If we do directory-block all-in-one (like dirstate is), then I would
recommend including the full path to the file.

Just something to keep in mind.

John
=:->



More information about the bazaar mailing list