cssc vs bzr weave

Tue Oct 11 09:46:00 BST 2005

> Anyway, what I found is that bzr weave is much faster than cssc, even
> though it is written in python.

That's nice to see. Unfortunately comparison to GNU CSSC is probably
faint praise; tridge reports that some of the code is very
inefficient.  A comparison to SourcePuller might give a better measure
of a really fast weave algorithm.

I'm not sure what the licence of original SCCS may be; it might be
possible to find a copy on a proprietary unix box.

> I've tried to think of ways to try and improve bzr's O(weavelines)
> behavior. It isn't that bad, but because inventory is now weaved, we
> will always see pathological behavior, because the inventory is modified
> *every* revision.
>
> With the current implementation, even "bzr status" is O(num changes to
> all files in entire history), because we have to extract the last
> committed inventory in order to do a comparison. (Which on my machines
> dwarfs the stat cost, especially with the hashcache).

Right, this is a definite ceiling on the current format.  To some
extent the inventory is a special case, but as you say it's an
unavoidably important special case.  Few other files will
be so large with so many revisions, though it may occur in for
example projects that choose to keep a manually maintained Changelog.

One tactical optimization is to keep a cache of recently accessed
full texts, but that doesn't really tackle the problem so let's
not worry for now.  Another is to have a special extraction method
that extracts as it reads from disk, rather than building everything
into memory.

> With inventory, one could certainly just "start over" every 100 revs or
> so, since we don't actually care about the weavelines (we won't weave
> merge), we just care about a decent method for compacting the history.
> This might be all we need to make bzr work with reasonable speed again.
> (Probably this would mean a slightly different weave format, maybe
> splitting of the revision list into a separate file, so that you could
> determine what texts were labeled with what number).

Right, this is perhaps a decent option for the inventory as a special
case.  It's a bit similar to splitting files across directories on
systems that can't have too many in a single one.

A related approach would be to keep track of where in the weave a
particular version starts and ends, so we don't have to scan the whole
thing but only a relevant subset.  This would tend to make the most
recent revisions faster to access (because they have less insertions
and deletions layered on top), which is nice.  But you'd still need to
read and write the whole thing, so it's not totally satisfactory.

> But this doesn't help weaves in general, it just helps bzr in a specific
> case. So I was wondering what tricks SCCS might have implemented in the
> past, rather than trying to rediscover everything.

As far as I know SCCS, and everything based on that format, has always
been O(weave_lines) and they've just tried to keep the constant
factors down.

> I thought about changing the in-memory representation. Potentially
> keeping some sort of tree data structure, which would let you skip
> regions that do not involve the lines you care about. Though I have some
> difficulty coming up with a structure that doesn't end up just unweaving
> every revision into memory. (Fast access, but you pay for it dearly).
> I don't know if there might be something similar to my append-only
> format, since you could keep a store of texts, and then just the
> add/delete operations. Then rather than scanning linearly through
> memory, you could just rebuild the text from those operations. (You
> would probably need weave independent, not my weave-dependent,
> operations.) Maybe Aaron Bentley's WeaveDiff format would be a better
> starting point.

I think something like weavediff would be a good idea, and we should
try to go to it fairly soon.  I'll write more on aaron's draft in
another thread.

--
Martin