[MERGE/RFC] partial delta generation

Wed Feb 4 14:45:40 GMT 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Clatworthy wrote:
> John Arbash Meinel wrote:
>> Ian Clatworthy wrote:
>>> For logging directories and multiple files, the key
>>> technical challenge is fast generation of partial deltas.
> 
>> My initial feeling is that this is a bit of a hack, and time would be
>> better spent finishing split inventory. Looking at the code changes, it
>> is actually rather invasive, as you have a lot of layers you have to go
>> through before you can get down to what you care about.
> 
> I agree. There's two things I don't like about this patch:
> 
> 1. It's a pretty nasty violation of the current layering
> 2. The # of layers that need to change to pass down the new parameter.
> 
> The second issue is pretty easy to solve - the top layer could call a
> new routine rather than pass a parameter down lots of layers. Each layer
> is very thin currently so the new routine would still be quite small.
> And we wouldn't need to change lots of *public* APIs and implicitly
> commit to supporting them in extended form in the future.
> 
> The first issue is much harder. Quite frankly, we know we need to change
> the lower layers to make the top layers perform and we're doing that via
> a new format. Part of my motivation here is to weed out the requirements
> of the lower layers further so that the new format is supported by the
> necessary APIs to minimise inter-layer friction. We can't make the next
> format perfect but it would be sad to introduce it and learn afterwards
> that a small change could have made a big difference to (log) features
> we hadn't implemented and fully understood.
> 
>> I *do* think we want to have an iter_changes() that can pre-filter, as
>> that can benefit all formats (workingtree comparison, and split-inventory).
> 
> I'd like to know more about your thoughts on this. Part of my experience here
> is that things like iter_changes() and changes_from() operate on trees and
> it's the tree building which is half the performance problem. Right now, my
> profiling of 2a is suggesting that the algorithm is working well - the
> bottleneck is reading xml off disk, not the processing thereafter.
> 

Which is why we made CHKRevisionTree in brisbane-core *not* load its
whole inventory, and what we should have done a while ago to RevisionTree.

More notably, we don't currently have a "give me the changes between
revision X and Y" which doesn't require loading the whole inventories of
both revisions. (As iter_changes lives on Tree, and RT seems to expect
to always have a valid and fully populated self.inventory.)

So the more difficult to implement, but more correct solution is to find
a way to upgrade the intelligence of RevisionTree, to allow it to have a
lazily loaded inventory, etc.

An easier fix is to add Repository.iter_changes(rev1, rev2, ...), though
if that isn't the api we want to continue to use once brisbane-core
lands, then it is a bit of a dead-end.

Another reasonable solution is to have a new class "PartialRevisionTree"
which realizes that it is not a complete solution, and refuses to act
like a full RevisionTree. For example, it shouldn't allow direct access
to self.inventory because we know it is not a complete inventory.
(prt.inventory[random_file_id] may fail because it isn't loaded, not
because the file id really isn't in the revision).

Then change the api so you use:

repo.partial_revision_tree(revid, file_list)

Then again, we don't want to do any of this when brisbane-core lands.
Instead we just want:

tree1, tree2 = repo.revision_trees([rev1, rev2])
tree1.iter_changes(tree2, file_ids=XXX/paths=YYY)

Because the brisbane-core iter_changes can easily be written to filter
while searching the pages, rather than after computing everything.

For example:
  1) If we use hash-tries (likely) then if we have a set of file_ids
     that we actually care about, we can restrict our hash-trie
     comparison to only those pages that could be mapped by
     hash(file_ids)

  2) If we have paths, brisbane-core already has a convenient way to map
     those back into file-ids without having to load the whole
     inventory. And then apply them via step (1)

  3) If you want to track a path back over history, it is possible for
     brisbane-core to know if the tree shape under a given directory has
     changed without having to read anything. (if the sha1 hash of the
     root parent_id_basename=>file_id map is the same, then we know that
     there have been no adds, deletes, renames, etc.)

     So even if we want a 2-pass algorithm, the first pass can be done
     extremely quickly in the common cases by just looking at the one
     map. (In a common case, the root sha1 will be the same for all
     revisions, and we will only need to look at one tree-shape to map
     paths => file_ids)

Though in the end, this tells me that a 2-tree api
(Tree.iter_changes(other_tree)) is insufficient for something like the
"give me all trees that modified file X" sort of query.

Which, IMO, means that it should probably be a function on repository,
something akin to "get_deltas_for_revisions". (or possibly that function
with extra arguments.)

>> I'm also mostly concerned about how you track what parent entries need
>> to be returned, etc.
> 
> It's a little tricky, yes. We certainly need higher level parents in
> order to build a partial inventory in order to build a partial tree.
> But if a higher level parent changes it name, is that relevant when
> logging files and directories below it? (I don't have an answer as
> to what's correct semantically but the code will match that as a
> relevant change right now.)
> 
>> All that said, you have implemented this, and it is potentially a win
>> *today* rather than a theoretical win tomorrow (requiring an upgrade in
>> the process).
> 
> That nicely summarises my work on log: deliver what we need today,
> make it as fast as I can, and understand precisely what we need to
> do in the future to remove the remaining bottlenecks. I'm conscious of
> not spending time on current stuff if and when that time would be better
> spent on brisbane-core. In this case, my initial cut (2a) took a day
> and my follow-on exploration (2b) took another day and a half. I felt
> that was time well spent for a 50% performance gain on a feature
> (multiple file and directory logging) that was blocking adoption by
> the Emacs developer community. I have *no* doubt as the importance of
> getting the new format running well. I also feel pretty strongly though
> that delivering a feature in the 1.12 timeframe on existing formats
> is good for our users vs waiting several months until a new format
> lands, becomes a default and gets migrated to.
> 
>> Again, you've done the work, so it is a sunk cost. It just doesn't feel
>> like something we want to continue to maintain as time goes forward. But
>> if the code is clear enough for now, I certainly wouldn't block it being
>> merged.
>>
>> BB:abstain
> 
> Thanks for the quick and objective feedback. "abstain with rationale" is
> 10x more valuable than silence. :-) FWIW, I don't feel the sunk cost was
> high here and the development has helped *me* learn what I'd like to see
> going forward. I'm very concerned about introducing code we don't want to
> maintain though: that's always expensive in terms of both time and reduced
> implementation flexibility going forward.
> 
> Ian C.
> 

So it is abstain primarily because of the invasiveness of the change,
and the likelyhood that it isn't what we'll want in 6-months. If we can
refine it a bit to be closer to a 6-month goal, then I think I would be
happier about it. (Also, I didn't look closely at the actual code to
give it a proper review yet.)

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkmJqhQACgkQJdeBCYSNAANPzACcCelb7Vaw+xgnFoNXwJRmqKeh
BwAAnR0uqrRiO3xaYdqVBXjFybB6qbJt
=Cgft
-----END PGP SIGNATURE-----