[MERGE] Add an InventoryEntry cache for xml deserialization

Mon Dec 15 15:15:35 GMT 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Vincent Ladeuil wrote:
>>>>>> "jam" == John Arbash Meinel <john at arbash-meinel.com> writes:
> 
>     jam> Vincent Ladeuil wrote:
>     jam> ...
> 
> <snip/>
> 
>     jam> Anyway, I'm guessing you missed my other email where I implemented (4)
>     jam> anyway.
> 
> Yes, that and the comments from Martin and Robert, they came in a
> later batch in my workflow that morning.
> 
>     >> 
>     jam> Which helps for us to know that the caching rules won't
>     jam> be violated. The main downside is that something that
>     jam> ends up dealing with two mostly-identical repositories
>     jam> will not benefit.
>     >> 
>     >> Should be rare enough to neglect. At worst, and only if really
>     >> worth it, the two repos can negotiate to share their caches.
>     >> 
>     jam> The original use-case I was trying to handle is the
>     jam> "extract all revision trees from the repository"
>     jam> which works just fine here, but there are other use
>     jam> cases where an entry cache would be helpful.
>     >> 
>     >> I'd love to have one for "bzr log -v"... but i"m pretty sure that
>     >> in that case I'd love to be able to clear it (or at least purge
>     >> it more aggressively), at various points.
> 
>     jam> There is always entry_cache.clear() for all of the
>     jam> caches I've written.
> 
> I was wondering about an intermediate purge (not the default one
> nor the full one).

so dict.clear() is always available, as is FIFOCache.clear() and
LRUCache.clear().

There is also FIFOCache.cleanup() and LRUCache.cleanup() which is "purge
down to "after_cleanup_size".

> 
>     jam> But I'm wondering what your specific use case is. Or are
>     jam> you thinking this is more something that is being
>     jam> written to disk?
> 
> Something along the lines of: 
> - I need the inventories for one revision and all of its parents
> - I want to keep the first parent entries in the cache more than
>   the other parents entries
> 
> So that I get the cache effect while processing the merged
> revisions but keep my mainline entries still in the cache.
> 
> I'm thinking aloud here (I still don't have code to support such
> a case, and I may never have), may be two caches will work better
> (but can the serializer handle several caches at once ?).
> 
> <snip/>

At that point, you are caching Inventory objects (collections of
InventoryEntry objects) and not the IE objects directly.

I'm a bit curious why you want mainline more than others. Do you *know*
that you need it, or you are guessing?

The serializer just does a __getitem__ lookup, so you could write an
object that splits requests across multiple other objects. Though where
do you put it when it comes time to add a new one? Both? First? etc.

A better way for *inventories* (IMO) is how I did it for annotate. Which
is that you build up the graph of all objects and their actual
dependencies. And you keep a 'reference counter' for each inventory
(based on what children will need it). Then as you iterate back the
other way, you know what actually needs to be cached, and what can be
thrown away because nothing else references it.

That would be a "perfect" cache, in that everything would get built
once, and any other time you need it, it would still be there. At the
expense of having no upper bound on how much gets cached, though you
could add a "and have no more that 100 of these" logic. You'd probably
want to figure out which one is going to be used the most, and keep that
over the one that is only needed by 1 other child. Or you change your
decoding logic to process that child earlier than it would have
otherwise, because you want to expire its parent from your cache.

Anyway, there are options.

> 
> On the tests front, far less failures but still some in
> fast-import:
> FAIL: bzrlib.plugins.fastimport.tests.test_processor.TestRename.test_move_to_new_dir
>     [] is renamed
> not equal:
> a = 0
> b = 1
> 
> 
> FAIL: bzrlib.plugins.fastimport.tests.test_processor.TestRename.test_rename_in_root
>     [] is renamed
> not equal:
> a = 0
> b = 1
> 
> 
> FAIL: bzrlib.plugins.fastimport.tests.test_processor.TestRename.test_rename_in_subdir
>     [] is renamed
> not equal:
> a = 0
> b = 1
> 
> Since they are around fast-import using an inventory for its own
> purposes and apparently being tricked by the cache when trying to
> handle renames, I'm not sure where the bug should be fixed, but
> since I saw the test failures, I thought I should at least report
> them.
> 
>   Vincent
> 

bzr-fastimport hijacks very directly into the Inventory space and IIRC
does some things that we wouldn't recommend a generic client do.

I have the feeling the test isn't written properly, and is doing
something like renaming a file without updating its
last-modified-revision property, which is a big no-no anyway.

The cache should be quite insulated from people modifying it (in that I
do .copy() before inserting the object, and .copy() when pulling the
object out).

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAklGdJcACgkQJdeBCYSNAAPAhACgyRrDTECv8gQRXodeuGIe7lKD
XvgAn0mFltTHAcNSWBpGniiAiY9y8wow
=9dSa
-----END PGP SIGNATURE-----