[MERGE/RFC] partial delta generation

Wed Feb 4 11:49:12 GMT 2009

John Arbash Meinel wrote:
> Ian Clatworthy wrote:
>> For logging directories and multiple files, the key
>> technical challenge is fast generation of partial deltas.

> My initial feeling is that this is a bit of a hack, and time would be
> better spent finishing split inventory. Looking at the code changes, it
> is actually rather invasive, as you have a lot of layers you have to go
> through before you can get down to what you care about.

I agree. There's two things I don't like about this patch:

1. It's a pretty nasty violation of the current layering
2. The # of layers that need to change to pass down the new parameter.

The second issue is pretty easy to solve - the top layer could call a
new routine rather than pass a parameter down lots of layers. Each layer
is very thin currently so the new routine would still be quite small.
And we wouldn't need to change lots of *public* APIs and implicitly
commit to supporting them in extended form in the future.

The first issue is much harder. Quite frankly, we know we need to change
the lower layers to make the top layers perform and we're doing that via
a new format. Part of my motivation here is to weed out the requirements
of the lower layers further so that the new format is supported by the
necessary APIs to minimise inter-layer friction. We can't make the next
format perfect but it would be sad to introduce it and learn afterwards
that a small change could have made a big difference to (log) features
we hadn't implemented and fully understood.

> I *do* think we want to have an iter_changes() that can pre-filter, as
> that can benefit all formats (workingtree comparison, and split-inventory).

I'd like to know more about your thoughts on this. Part of my experience here
is that things like iter_changes() and changes_from() operate on trees and
it's the tree building which is half the performance problem. Right now, my
profiling of 2a is suggesting that the algorithm is working well - the
bottleneck is reading xml off disk, not the processing thereafter.

> I'm also mostly concerned about how you track what parent entries need
> to be returned, etc.

It's a little tricky, yes. We certainly need higher level parents in
order to build a partial inventory in order to build a partial tree.
But if a higher level parent changes it name, is that relevant when
logging files and directories below it? (I don't have an answer as
to what's correct semantically but the code will match that as a
relevant change right now.)

> All that said, you have implemented this, and it is potentially a win
> *today* rather than a theoretical win tomorrow (requiring an upgrade in
> the process).

That nicely summarises my work on log: deliver what we need today,
make it as fast as I can, and understand precisely what we need to
do in the future to remove the remaining bottlenecks. I'm conscious of
not spending time on current stuff if and when that time would be better
spent on brisbane-core. In this case, my initial cut (2a) took a day
and my follow-on exploration (2b) took another day and a half. I felt
that was time well spent for a 50% performance gain on a feature
(multiple file and directory logging) that was blocking adoption by
the Emacs developer community. I have *no* doubt as the importance of
getting the new format running well. I also feel pretty strongly though
that delivering a feature in the 1.12 timeframe on existing formats
is good for our users vs waiting several months until a new format
lands, becomes a default and gets migrated to.

> Again, you've done the work, so it is a sunk cost. It just doesn't feel
> like something we want to continue to maintain as time goes forward. But
> if the code is clear enough for now, I certainly wouldn't block it being
> merged.
> 
> BB:abstain

Thanks for the quick and objective feedback. "abstain with rationale" is
10x more valuable than silence. :-) FWIW, I don't feel the sunk cost was
high here and the development has helped *me* learn what I'd like to see
going forward. I'm very concerned about introducing code we don't want to
maintain though: that's always expensive in terms of both time and reduced
implementation flexibility going forward.

Ian C.