index paging and caching

Thu Jul 26 14:13:52 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> I plan to test the performance again with john's 'difference_update
> sucks' patch. Early next week I hope to put in place a caching layer for
> my index's. This will operate on a page basis, with 4K pages on
> LocalTransports and 64/128K pages on remote transports.
> 
> There are some open questions in my mind:
>  - how big should the page cache be
>  - when should it be discarded
>  - when should it be invalidated
> 
> These indices are write-once, so I don't think it needs invalidation or
> dirty marking at all.

Agreed.

> 
> I'm inclined to discard the cache when the repository becomes unlocked,
> though if we were not worried about memory we could keep it until the
> repository object is discarded.

To date, we have always followed the "maintain a lock to maintain the cache".
So I would be fine with continuing that.
However, we have discussed caching with a way to detect out-of-date (such as by
stat'ing the dirstate file, etc). So that we don't have quite the same expense
when someone forgets to hold the lock.

If you do maintain it between locks, then I think we will want a way to clear
it. (Think Olive where it is a long running process that may access lots of
repositories).

I would tend to do a maximum cache size. Like 1MB or maybe up to 10MB.
(Naturally there should be an easy way to change this).

> 
> In terms of size, we have several different indices:
> signatures (texts, no deltas, no graph)
> inventories (texts, deltas, no graph)
> file texts (texts, deltas, graph)
> revisions (texts, no deltas, graph)
> 
> we will see a massively larger file text index than the rest combined -
> consider mozilla with 15K file ids, thats up to a 15K:1 ratio to the
> inventory index.

um, Mozilla has 55k file ids not 15k.

Doing a quick cat | wc -l, I get 177k lines in inventory.kndx, and 640140 lines
in knits/*.kndx

Now, wc -l has 1 extra line versus the actual number of revisions, and there
are 64087 file ids.

So we have (640140-64087) / (177,489) = 576053/177489 = 3.24:1

Which is actually more reasonable. Because any given revision only changes a
few files. (I found that for the Kernel about 95+% of revisions changed <10 files).

So it may not be as many more as you think. It *is* more, and you wouldn't
really want to mix the indicies (a 3:1 ratio is quite a bit of stuff to sift
through).

One of the big wins (IMO) is that inventories and revisions should be able to
share an index (since it uses the same keys). At least, it would be nice if
they could.

> 
> So I'm inclined to have separate caches for each type of index, but
> share the cache amongst the different component indicies of each type,
> and finally size it by a heuristic based on the number of different
> index files being used. e.g. if there are 100 index files, then allow
> 100 pages to be used.

That would be okay, but it would penalize having 1 large index versus lots of
small ones. Unless your plan is to always have them laid out in an exponential
backoff. Having it based on log(total revisions) seems like a good layout, and
exponentially sized indexes would do that.

But I think it is also reasonable to just have 1MB of cache space per
repository, use it as you will.

> 
> Thoughts?
> 
> -Rob

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGqJ4PJdeBCYSNAAMRAiZEAJ9CMcWlb9E416JINOJXiv9pfFO+XwCfZZLi
nKnn7j2Y87L6v3/Ae5Kce1I=
=2p5M
-----END PGP SIGNATURE-----