[MERGE] Add an InventoryEntry cache for xml deserialization

John Arbash Meinel john at arbash-meinel.com
Thu Dec 11 20:05:51 GMT 2008

Hash: SHA1

Robert Collins wrote:
> Robert Collins has voted approve.
> Status is now: Approved
> Comment:
>> 1) First an assumption. For a given (file_id, revision_id) pair, the
>> InventoryEntry must be the same.
> This is only guaranteed true for the same model repo - E.g. tree roots
> are different between plain and rich-root. Also there was a very short
> lived bug back around 0.0.8 that misrecorded the exec bit for about 2
> commits; but I think this is ignorable.

I thought they we actually the same because we force non-rich-root
repositories to return (root_id, tree_revision_id) for the root node, to
prevent any problems.

> split-inv might benefit from a cache too; and for split-inv the source
> string is a safe cache key (it is safe in xml inventories too).

This is certainly possible. I'm not sure where to fit it in around the
existing "_entry_cache" etc.

>> 5) A FIFO cache is a lot better for this reason, but we have to take
>> more care when we actually fill the cache. This is because 90% of what
>> is in the cache we want to re-use, the records come to us in "sorted"
>> order. So once we start pushing off a couple of the records, we tend to
>> push out all of the interesting ones, and restart the cache.
> This is usual for FIFO caches; though your fifo cache looked more like a
> LRU cache implemented with a FIFO queue to me (because each use popped
> and readded the item (unless I misread it)).

This doesn't happen on *access* this happens when you *replace* an item.
Which I don't have any other reasonable answer for. For InventoryEntries
we won't be changing them. Other users of the cache might.

I'll mention that replacing an existing entry is expensive, because
popping an item out of the middle of a deque is expensive.

> A dual queue (e.g.
> frequency/lru) can perform a lot better in terms of hit rate, though at
> a computation cost.

Yeah, I thought out a structure using a dict + linked-list that would
allow an LRU without as much overhead. Every access would just pop the
item to the back of the linked-list queue. So still more overhead that
dict.__getitem__ but far less than the current implementation.

In the short term... we have this and it works well. Resizing it allows
a reasonable amount of work before it gets flushed, while still allowing
memory to be limited.

>> 8) One other really big win that I didn't do is to get rid of calling
>> InventoryEntry.copy() for files and symlinks. This is safe to do if the
>> Inventories are treated as readonly, or if "apply_delta" would create
>> new File records, but I'm not sure if we can trust Inventory users to
>> not mess with the entries in our cache.
> We can't :)
>> It *is* a fairly large win (25s => 20s, or about 20% of remaining time).
>>  So I would consider adding a flag somewhere to tell the deserializer
>> that "I know I have to treat these as readonly".
> I'd introduce immutable versions of the objects; CHKInventory wants this
> too.

So my idea was to have a flag for the code to indicate that it promises
not to mutate them. Converting the first 1k revisions of MySQL drops
from about 2.5m-3m to 2m with non-copied file records.

I agree immutable would be nice, but I don't see how to do that without

1) Going down to C/Pyrex where you can easily declare an attribute readonly
2) Replacing the attributes with properties, but I imagine that will add
a certain amount of overhead. I'm not sure how much versus (1).

Anyway, in the short term we can get most of the benefit, and have an
obvious place to toggle if we want to get the rest.

And again, I don't want to spend a huge amount of effort on the existing
inventory code, because no matter what we do, it is still an O(tree)
operation. But it would be nice to be able to convert mysql in say an
hour, instead of it taking overnight.

Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


More information about the bazaar mailing list