[MERGE] _generate_text_key_index
John Arbash Meinel
john at arbash-meinel.com
Mon Nov 19 19:23:25 GMT 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Robert Collins wrote:
> Thanks for the review.. I'm going to reply to it as an action list for
> me to consult later when I make the changes. (I'll let all the component
> branches be review and then merge one result larger patch).
>
...
>> ...
>> + # Set a cache with a size of 10 - this suffices for bzr.dev but
>> may be
>> + # too small for large or very branchy trees. However, for 55K
>> path
>> + # trees, it would be easy to use too much memory trivially.
>> Ideally we
>> + # could gauge this by looking at available real memory etc, but
>> this is
>> + # always a tricky proposition.
>> + inventory_cache = lru_cache.LRUCache(10)
>>
>> You could instead use a the LRUSizeCache of the inventories. Inventories
>> already have __len__ defined, and we know that a single Inventory entry
>> is approximately a fixed size. (a couple shared file ids, sha string,
>> size, etc).
>> That would be a lot more accurate than a fixed size of 10. Instead you
>> could say store approx 10,000 Inventory entries.
>
> Ok. What does that look like? Separately and speculatively I've been
> wondering about using tail-joining strategies some functional languages
> use for optimising multiple inventory extraction. That is reusing common
> inventory entries by having them in multiple dicts; and doing a copy
> when they are actually altered.
# Cache inventories, but hold no more than 1M InventoryEntries in memory.
# Inventory.__len__ returns the number of entries, so we can simply use
# len(inventory) as our size metric
inventory_cache = lru_cache.LRUSizeCache(max_size=1024*1024)
Basically, the LRUSizeCache just calls a function on each object, and figures
out how big it is. And tracks how much has been used so it knows when to
garbage collect. You can also supply 'after_cleanup_size=XX' to control how
much stuff it removes when it decides to GC (the default is to GC down to 1/2
the size of max_size). Also, any individual entry won't be cached if it is
greater than "after_cleanup_size".
I was using it for file texts, and I didn't want to have a 50MB file suddenly
wipe out my cache, and I wanted to avoid excessively running the GC code. So I
try to clear out a bit of a buffer with each GC.
...
>> + parent_heads = text_graph.heads(candidate_parents)
>> + new_parents = list(parent_heads)
>> + new_parents.sort(key=lambda
>> x:candidate_parents.index(x))
>>
>> ^- Interesting, I wonder if this is better than:
>> parent_heads = text_graph.heads(candidate_parents)
>> new_parents = sorted(parent_heads, key=...)
>
> This might be a tiny bit faster.
>
>> or my preferences
>> new_parents = [p for p in candidate_parents if p in parent_heads]
>
> This is wrong:
> candidate_parents = ['x', 'x']
> parent_heads = set(['x'])
> =>
> new_parents = ['x', 'x']
>
good catch.
...
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFHQeKsJdeBCYSNAAMRAvi+AJ9iuUZ7d4MknSOBIaUkZiFQLv7NGQCggQPs
KtjLBmQlLzCG3bep4yjuisM=
=N0gg
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list