[MERGE] _generate_text_key_index

Mon Nov 19 19:23:25 GMT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> Thanks for the review.. I'm going to reply to it as an action list for
> me to consult later when I make the changes. (I'll let all the component
> branches be review and then merge one result larger patch).
> 

...

>> ...
>> +        # Set a cache with a size of 10 - this suffices for bzr.dev but 
>> may be
>> +        # too small for large or very branchy trees. However, for 55K 
>> path
>> +        # trees, it would be easy to use too much memory trivially. 
>> Ideally we
>> +        # could gauge this by looking at available real memory etc, but 
>> this is
>> +        # always a tricky proposition.
>> +        inventory_cache = lru_cache.LRUCache(10)
>>
>> You could instead use a the LRUSizeCache of the inventories. Inventories 
>> already have __len__ defined, and we know that a single Inventory entry 
>> is approximately a fixed size. (a couple shared file ids, sha string, 
>> size, etc).
>> That would be a lot more accurate than a fixed size of 10. Instead you 
>> could say store approx 10,000 Inventory entries.
> 
> Ok. What does that look like? Separately and speculatively I've been
> wondering about using tail-joining strategies some functional languages
> use for optimising multiple inventory extraction. That is reusing common
> inventory entries by having them in multiple dicts; and doing a copy
> when they are actually altered.

# Cache inventories, but hold no more than 1M InventoryEntries in memory.
# Inventory.__len__ returns the number of entries, so we can simply use
# len(inventory) as our size metric
inventory_cache = lru_cache.LRUSizeCache(max_size=1024*1024)

Basically, the LRUSizeCache just calls a function on each object, and figures
out how big it is. And tracks how much has been used so it knows when to
garbage collect. You can also supply 'after_cleanup_size=XX' to control how
much stuff it removes when it decides to GC (the default is to GC down to 1/2
the size of max_size). Also, any individual entry won't be cached if it is
greater than "after_cleanup_size".

I was using it for file texts, and I didn't want to have a 50MB file suddenly
wipe out my cache, and I wanted to avoid excessively running the GC code. So I
try to clear out a bit of a buffer with each GC.

...

>> +                    parent_heads = text_graph.heads(candidate_parents)
>> +                    new_parents = list(parent_heads)
>> +                    new_parents.sort(key=lambda 
>> x:candidate_parents.index(x))
>>
>> ^- Interesting, I wonder if this is better than:
>> parent_heads = text_graph.heads(candidate_parents)
>> new_parents = sorted(parent_heads, key=...)
> 
> This might be a tiny bit faster.
> 
>> or my preferences
>> new_parents = [p for p in candidate_parents if p in parent_heads]
> 
> This is wrong:
> candidate_parents = ['x', 'x']
> parent_heads = set(['x'])
> =>
> new_parents = ['x', 'x']
> 

good catch.

...
John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHQeKsJdeBCYSNAAMRAvi+AJ9iuUZ7d4MknSOBIaUkZiFQLv7NGQCggQPs
KtjLBmQlLzCG3bep4yjuisM=
=N0gg
-----END PGP SIGNATURE-----