RFC: caching in CHKInventory

Fri Oct 3 00:06:42 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I'm not 100% sure I understand the implications, but I'll try to give
some feedback.

Robert Collins wrote:
> I'm building up a CHKInventory class, which is a demand-loadable
> inventory representing a serialised inventory only. (So its not used, or
> intended to be used during e.g. commit).

So this is an in-memory structure, similar to Inventory? So that
RevisionTree.inventory = CHKInventory(some-filesystem-info)?

I guess I'm trying to understand what you will be caching. Are you
caching inventory pages? In serialized or unserialized form? What key
are you caching them under? Are you thinking of caching *between*
CHKInventory objects, or are you caching within an instance?

> 
> I'm wondering whether to cache, or not. The CHKMap class will cache
> (because writing nodes out before all changes are made is expensive, and
> we need a way to make-and-update the maps, and our experience with BTree
> suggests a small cache is really needed). The CHKMap is currently
> mutable to allow creation, but I may make it immutable and create it by
> 'update-from-empty' using the 'apply delta' operation. There will be 1
> or 2 maps per CHKInventory. (path->id, id->path, are the two maps, but
> only one is needed to be 'functional' - and I'm at 'make it work').

I guess we cache in an in-memory dictionary as you "add" items to BTree,
and then we flush them out to disk in sorted order. (With a bit extra to
spill early nodes to disk when we get memory constrained, and then
"merge sort" the indexes back together at finish() time.)

Or are you thinking about the "read" phase, where we do cache the
deserialized pages.

Is CHKMap the best name for what you are thinking of? (mapping paths =>
file-ids.) It is just being hard for me to directly understand what the
class does. I suppose it is similar to a BTreeIndex only where the page
keys are CHK's?

Note that I think the final key is going to be a path or a file id,
which *isn't* a CHK. Which is probably my confusion. CHK seems like an
intermediate and internal-only detail.

> 
> The apply-delta operation at the inventory level will generate a new
> CHKInventory rather than mutating the existing one; which is why I am
> pondering caching at all at this level: accessing individual entries is
> done by a lookup into a CHKMap, and then a parse operation.
> 
> Any strong opinions?
> 
> -Rob
> 
> 

So I guess the question is... should we be caching the internal pages of
the CHKInventory (from Root on down to the leaves which have the actual
inventory contents).

I would tend to say yes, for at least the structure nodes. (For btree,
this would be the _InternalNodes but not necessarily the _LeafNodes.)

Also, note that this cache is perfectly reasonable to keep between
instances, because they CHK means that it is the same page in both. I
suppose the stricter problem is knowing the lifetime for entries in the
cache. Though if you make it size constrained, then they expire based on
memory-pressure either way.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkjlVAIACgkQJdeBCYSNAAPcQQCeJKKEzWKoUsJd6Eiq4yFDUoQK
oL4An14jDY/Jgt4cKWerp/ryHbdlHjO2
=jxKW
-----END PGP SIGNATURE-----