tortoise caching and crawling

Thu Apr 10 04:44:04 BST 2008

I've been looking in detail at the TortoiseSvn "cache" process (TSVNCache),
as described briefly in the document I've recently been sending to the list.
We are planning to build something similar for TortoiseBzr, and although we
plan a custom RPC mechanism, the rest of the program should be suitable for
use in a future, general-purpose RPC mechanism as mooted at the recent
London Sprint and on this list.  Therefore, the stuff discussed here is
relevant outside of the narrow confines of Tortoise (and indeed, most of
what TSVNCache does isn't actually subversion specific.)

I've included details of what TSVNCache does below, but in summary, much of
the complexity relates to the ability of TSVN to show the recursive status
of a folder without blocking while that folder is walked (ie, walk the
folders in background).  This requires TSVNCache to have a "crawler", which
may occasionally walk very large trees, and to cache the status of these
trees, even though many of the items are never actually viewed.

In light of this, and that some people previously indicated that the
crawling of sub-trees is a "mis-feature", I thought I'd solicit some
thoughts.  Is crawling a problem in itself, or only when it chews too many
resources/takes too long?  Is this something most people would disable if
given that option?  In other words, is this something we want to have, and
is it worth the complexity?  Personally, I think having the recursive status
for folders *is* valuable, but I may be in the minority (and will happily
avoid that complexity if I possibly can ;)

FYI, I'm taking 2 weeks holiday starting tomorrow evening - but I'll catch
up with any replies that I miss when I return.

Cheers,

Mark

A general overview of what TSVNCache does is:

* Accepts remote requests for the status of an SVN file or folder.
* If the status is cached and valid, return it.
* If request is for a file, synchronously request status of parent
directory, cache all items and return child status (need to confirm this).
* If request is for a folder, return "no status" and queue crawling of the
directory (ie, such a folder will initially show as unversioned in
Explorer.)
* For every item that is stored in a cache, a "file watcher" is potentially
created (see below)
* The cache persists itself to disk so it need not be rebuilt on restart,
but I believe the entire cache lives in memory.

The crawler works like:

* The crawler recurses directory trees to determine status.  It runs at idle
priority - but does end up crawling fairly large trees occasionally.  While
this generally shouldn't degrade performance too much, it does still cause
the process to consume 100% CPU and thrashes the disks when the machine is
otherwise idle, which may alarm some users.  It also has lots of other
performance tweaks to avoid crawling while the shell is asking for other
items, for example (but it's not clear why this is necessary when the thread
is at idle-priority...)

* As the crawler completes a sub-directory (ie, as it knows the complete
recursive status of a directory), it sends a "shell notification" for the
directory.  Any explorer windows currently showing that item will refresh
that item's details, including the icon overlays.  This causes TSVN to
perform a remote request for the folder, but this time the status is cached,
so a true recursive status for the folder is shown.

* The crawler has lots of smarts to avoid crawling directories multiple
times, etc, and has methods available for explicitly invalidating items
(which too is smart - eg, invalidating an item also invalidates all parent
folder statuses)

The "file watcher":

* This uses a win32 mechanism in which the OS allows you to watch an NTFS
directory tree and be asynchronously notified of changes.  Note that this is
not guaranteed reliable - too many concurrent changes will cause
notifications to be lost.  It is possible to watch trees recursively (see
MSDN for ReadDirectoryChangesW)

* The file watcher is smart enough to avoid watching multiple directories by
simply watching a parent directory recursively.  This generally means there
will be a single "watcher" for a single subversion checkout, rather than for
each directory in a checkout.  It's not clear if a common non-versioned
parent will be watched (meaning multiple checkouts can share the same
watcher), or if the "root" of each watcher is limited to versioned folders.

* As change notifications are processed, the cache is invalidated, and the
folder marked again for crawling.  It is important that as little as
possible is done in the notification itself, which is how TSVN implements
things.

* The watcher also registers for "device removal" notifications so removable
devices are handled gracefully.