[MERGE] Add an InventoryEntry cache for xml deserialization

Sat Dec 13 09:36:35 GMT 2008

>>>>> "jam" == John Arbash Meinel <john at arbash-meinel.com> writes:

<snip/>

    jam> Anyway, I ended up running into some cache-lifetime
    jam> issues, and figured it would be reasonable to open it up
    jam> for discussion how we want to handle it.

    jam> 1) Many, many, many tests re-use the same revision ids
    jam> and file ids with different actual values.

Because we rely on the assumption that we don't use any shared
fixture in TDD parlance.

    jam> This is pretty trivial to fix by adding a
    jam> "_entry_cache.clear()" as part of TestCase.setUp()

Deadly sin (I'm sure you know that or you wouldn't bring that up
to discussion, don't you ? :-).

Are you really suggesting that all global caches should be
cleared there ? TestCase.setUp() should remain minimal IMHO.

The last time we introduced a global cache we end up getting rid
of it (sftp connections).

And we get rid of it mostly for test purposes (if I remember
right) on the basis that global caches are evil.

    jam> 2) There are a couple of tests that re-use file_ids and
    jam> revision_ids *within* the test, but in a different
    jam> tree.

And a different repository ?

    jam> I can fix this by clearing the cache in the inner loop
    jam> before the next permutation is run, by changing it to
    jam> not re-use the revision ids, etc.  But it did make me
    jam> wonder if we would do better with a different fix.

And you're right, this is clearly an indication that our policy
to *avoid* global caches has merits.

And it also explain why so many tests were failing. I wanted to
review your patch yesterday and was scared by the number of test
failures... I'm far less scared now that you mention the root
cause.

And to me, the fact that introducing a *cache* breaks so many
tests is a clear indication that something is wrong in the cache
implementation or usage, *not* in the test suite. 

    jam> 3) At one point I switched to having each XMLSerializer
    jam> have its own entry cache, in case that was actually
    jam> necessary. ATM I think it is actually a YAGNI.

    jam> 4) I've been considering restructuring things so that
    jam> the repository passes the entry_cache to
    jam> self._serializer.deserialize_inventory(), as this makes
    jam> it obvious that the lifetime of the cache is bound to
    jam> the Repository.

The heart of the problem as you rightly point it out here is the
lifetime of the cache.

By making it global you cheaply maximize the reuse rate at the
expense of memory consumption but doesn't really address the
lifetime problem. Who will clear the cache ? When ?

My vote would go to either 3) or 4) depending on who is really
responsible for the lifetime and what make managing (clearing,
resizing, sharing) the cache the most natural

    jam> Which helps for us to know that the caching rules won't
    jam> be violated. The main downside is that something that
    jam> ends up dealing with two mostly-identical repositories
    jam> will not benefit.

Should be rare enough to neglect. At worst, and only if really
worth it, the two repos can negotiate to share their caches.

    jam>    The original use-case I was trying to handle is the
    jam>    "extract all revision trees from the repository"
    jam>    which works just fine here, but there are other use
    jam>    cases where an entry cache would be helpful.

I'd love to have one for "bzr log -v"... but i"m pretty sure that
in that case I'd love to be able to clear it (or at least purge
it more aggressively), at various points.

Don't we have cases where we'd want two (or more) different
caches targeted at different history points ?

In that case maybe we'd want to be able to attach/detach caches
to the serializers themselves ?

And what if bzrlib is used for several repositories ? 

And what if we want to take the working tree size into account
when dimensioning the cache ?

Said otherwise: you just find a case when a change in the cache
policy was a net improvement, it's obvious to me that more cases
will appear in the future. The solution then will not be to
multiply global caches.

I want your patch to land quickly so I'm not asking for a perfect
solution *now*, use any solution you want without falling into
YAGNIs, but please avoid the global cache approach.

    jam> I've gone down a couple different routes, and right now
    jam> I'm leaning towards having a single entry cache, and
    jam> just clearing it out at appropriate times during the
    jam> test suite.

    jam> In the "real-world" if we had revision_id collisions, we
    jam> would have much more serious problems than just
    jam> inventory entry caching,

Sure.

    jam> so I'm considering the problem as mostly just a test
    jam> suite not being properly isolated between test cases
    jam> issue.

I disagree there, the test suite relies on the assumption that we
don't use shared fixtures[1], let's keep that assumption.

      Vincent

[1] http://xunitpatterns.com/Shared Fixture.html