[bug 277537] [REVIEW] A highly specific reconcile

Fri Oct 17 14:15:08 BST 2008

Thanks for the review ! I hope my remarks below clarify things,
but I need some explanations in return.

>>>>> "john" == John Arbash Meinel <john at arbash-meinel.com> writes:

    john> Vincent Ladeuil wrote:
    >> Hi all,
    >> 
    >> This needs to be reviewed without BB support (it's a plugin, not
    >> a patch against bzr itself), please keep that in mind if you want
    >> to track it.
    >> 
    >> Bug https://bugs.launchpad.net/bzr/+bug/277537 is about a
    >> repository in which some file revisions have inconsistent
    >> parents.
    >> 
    >> This can be addressed by bzr check/bzr reconcile except that in
    >> that case bzr check requires 32 hours to run and bzr reconcile 20
    >> hours (jam provided a patch to cache a bit more aggressively for
    >> both check and reconcile (thanks for that) but in the end, it
    >> made my VM trashed to the point I had to kill it (lucky/unlucky
    >> who knows ?)).

    john> Well the specific problem for "bzr reconcile" is the function
    john> "_generate_text_key_index".

Especially for that bug, yes. That's why the intent was to
totally avoid it.

    john> It is a problem because it has to extract the inventory
    john> for every revision in history, sometimes more than
    john> once.  We end up deserializing() the same xml entry
    john> many many times. And this does not scale well at
    john> all. Something like O(num_files * num_revisions)

    john> It would have been nice if we could have snuck in a
    john> form that used something more minimal, like
    john> "_find_text_key_references_from_xml_inventory_lines". I
    john> honestly think we *could* build something up on top of
    john> iter_lines_added_or_present_in_keys().

While working on check/reconcile and discussing with Martin, I
came to the following conclusion:

check/reconcile are slow *by design* and I don't think we should
even *try* to address that, the emphasis is on correctness, any
optimization will try to find shortcuts with the risk to miss
errors.

As every rule there can be exceptions... 

But I find more reasonable to address specific problems with
specific solutions in that case so that the effort goes into
solving the root causes that introduced the problem instead of
trying to make check/reconcile smarter and faster (which by
definition is contradictory with correctness and clarity).

    >> Since check/reconcile has to be done for any repository in the
    >> wild, I thought it may be wise to provide a quicker way.
    >> 

    john> I think your plugin is very reasonable for the mysql
    john> case, I'm not sure it is a great general case solution.

100% agreement, I'll go further: this is certainly not a great
general case solution (the plugin name is not random, it carry
the idea that nobody should even think about using it outside the
area of that bug). At best, one can argue that it can address
inconsistent parents of all kind (not only swapped pairs like
here). But I'm not even sure about that and I hope the review
process can make that clear.

Whether or not it can used elsewhere, what I care about here is
to fix the bug. This requires a fix that can be deployed without
wasting resources. If that means crappy design and crappy code, I
don't care as long as it *fixes* the issue.

I chose to write a plugin for two main reasons:

1) Since a bug is in the data I wanted a fix that can be applied
   without requiring the use of an updated bzr (i.e. people
   should be able to fix the problem with their actual bzr)

2) I wanted to be free to be as dirty as needed to be fast.

    >> An important point to understand is that this bug is about
    >> data produced during a conversion and not a bug about how
    >> bzr use that data. Even the if one can argue about the the
    >> later, what seemed more important to me was to fix the
    >> data at least because users encountering that bug may not
    >> want to upgrade bzr to get it fixed.
    >> 
    >> So I wrote lp:~vila/bzr/fix277537, a simple plugin that can fix
    >> the inconsistent parents by using a description of which file
    >> revisions need to be fixed.
    >> 
    >> These descriptions are acquired from bzr check which
    >> should still be run once (32 hours, but only once), then,
    >> the fix can be applied to all repositories:
    >> 
    >> time -p bzr fix277537 ~/.bazaar/plugins/fix277537/buggy-file-revids-081011 
    >> Will try to fix 46229 wrong parents
    >> Fixed 46229 wrong parents
    >> 0 wrong parents were already fixed
    >> real 312.96
    >> user 249.93
    >> sys 30.98
    >> 
    >> So 73919 (reconcile) / 312 (fix277537) gives a 23961% improvement
    >> in time :-)

    john> 312s... how long does a plain 'bzr pack' take?

230s, same ballpark (when compared to reconcile times), that
plugin is nothing more than a repack with a twist (a bit costly
regarding 'bzr pack' but quite cheap when compared to 'bzr
reconcile'). 'bzr pack' was the time limit I know I couldn't beat
but was hoping to approach.

    john> Probably about right since 'bzr pack' on bzr.dev takes
    john> about 40s for 100MB here, and they have at least 600MB.

yup 630M.

    >> 
    >> I was a bit concerned about modifying a repository that
    >> way until I realized that I was, indeed, doing the same
    >> work as reconcile itself: updating meta-data, not
    >> revisions themselves.

    john> Yeah, I think the basic overview is fine.

Ok. I'll update the plugin doc string then.

    >> 
    >> 312 seconds seems an acceptable delay for any user of
    >> these repositories during which he can't commit in any of
    >> his branches, it also sounds quick enough to be applied on
    >> shared repositories on servers without getting them
    >> off-line.
    >> 
    >> I checked that the annotations are now correct but I'd like that
    >> code to be reviewed before releasing it.
    >> 
    >> The most important revisions to be reviewed are:
    >> 
    >> - http://bazaar.launchpad.net/~vila/bzr/fix277537/revision/6
    >> which get the needed time from 20 hours down to 2h10,
    >> 
    >> - http://bazaar.launchpad.net/~vila/bzr/fix277537/revision/10
    >> which get it down to 312 seconds.
    >> 
    >> But an overall review will be welcome too regarding the way the
    >> plugin can be used (some of the broken repositories are hosted in
    >> launchpad and if some LOSA can fix them locally, that will be
    >> greatly appreciated).
    >> 
    >> Vincent
    >> 
    >> 

    john> Here are some comments:

    john>     def _copy_text_texts(self):
    john>         """generate what texts we should have and then copy."""
    john>         self.pb.update("Copying content texts", 3)
    john>         # we have three major tasks here:
    john>         # 1) generate the ideal index
    john>         repo = self._pack_collection.repo
    john>         ancestors = dict([(key[0], tuple(ref[0] for ref in refs[0])) for
    john>             _1, key, _2, refs in
    john>             self.new_pack.revision_index.iter_all_entries()])

    john> ^- You don't actually use this anymore, so you don't need to create it.

Yes, sorry, I should have been more cautious in my cleanup, on
the other hand I wanted the trail with the original
_copy_text_texts to stay obvious for review.

I'll fix that.

    john> Fundamentally it comes down to having the right data here:
    john>         ideal_index = {}
    john>         wrong_index = {}
    john>         for key, correct, wrong in self._file_rev_ids:
    john>             ideal_index[key] = correct
    john>             wrong_index[key] = wrong

    john> Whatever is put in _file_rev_ids is pretty critical,
    john> but I assume you generated that from the
    john> "_generate_text_key_index()" and comparing it to the
    john> actual text key index.

Yes, the data is acquired from a 'bzr check' .bzr.log by a script
I will add that to the plugin doc string.

    john>         # 2) generate a text_nodes list that contains all the deltas
    john> that can
    john>         #    be used as-is, with corrected parents.
    john>         ok_nodes = []
    john>         bad_texts = []
    john>         discarded_nodes = []
    john>         NULL_REVISION = _mod_revision.NULL_REVISION
    john>         text_index_map, text_nodes = self._get_text_nodes()

    john> ^- You don't actually use "bad_texts" or "discarded_nodes" anymore.
    john> And ok_nodes is a bit of a misnomer, as it is just the nodes you are
    john> going to end up processing, with parents fixed.

Same as above, will fix too. But some ok nodes are not touched at
all so I may need to fix it better than just renaming.

    john> There is some concern here:
    john>                         ok_nodes.append((node[0], node[1], node[2],
    john>                                          (ideal_parents, node[3][1])))

    john> ^- Basically, you've now changed our previous
    john> invariant, and line deltas will now possibly be against
    john> a non left-hand parent. Am I correct in guessing that?

Yes and no. 

The bug is indeed that parents were swapped (I should have
mentioned that *all* inconsistent parents identified for that
bug are tuples of size 2 and swapped).

So the fix restore the correct order, relying on valid data in
_file_rev_ids (that's for the 'yes').

Now you can use it to create more instances of this bug (that's
for the 'no').

But since the optimization is based on executing
_generate_text_key_index() only once (during check) to be able to
re-use the produced data several times there is a strong
requirement on providing the right data.

The safe guards I put in place are:

- the input data described both expected (wrong) parents and
  correct (as determine by check) ones, so that we fix the
  parents only if we find the wrong ones,

- an additional check that we are not trying to fix it twice.

    john> I would guess that a change like this has not permeated
    john> the rest of our codebase. We would *like* it to, but I
    john> don't know if it has yet.

    john> I'm thinking of stuff like "annotate" passing in
    john> 'left_matching_blocks', only now those actually match
    john> the right-hand-parent.

I think that's what happen *now* and the key thing confusing
annotate.

    john> I think we can fix annotate with a line like:

    john>    fulltext = self._add_fulltext_content(rev_id, fulltext_content)
    john>    if compression_parent == parent_ids[0]:
    john>      blocks = KnitContent.get_line_delta_blocks(delta, parent_fulltext,
    john>                   fulltext)
    john>    else:
    john>      # We can only re-use blocks if they match the left parent
    john>      blocks = None

    john> One way to check this would be to inspect the .tix
    john> files directly, or to debug 'ideal_parents' versus
    john> 'node[3][1]')

    john> Also, a more correct fix would probably be:

    john>     ok_nodes.append((node[0], node[1], node[2],
    john>                      (ideal_parents,) + node[3][1:]))
    john> Just in case there are ever more than 2 ref lists.

Gee... That one need to be back-ported to reconcile !

    john> So I'm not opposed to allowing non-left-hand
    john> compression parents, as it can help shake out any other
    john> bugs we might have in the code. However, we need to at
    john> least think about where bugs might crop up.

I'm not sure I follow you there, we need to discuss that. Char
sounds more appropriate, see you there.

    john> And back to the original topic about fixing
    john> _generate_text_key_index, consider this... We already
    john> know what nodes we want *present* in the graph, because
    john> we have the 'text_refs'. We already know the overall
    john> shape of the graph, because it is the whole-tree
    john> revision graph.

    john> So what we want is to have an algorithm that takes a
    john> whole tree graph, disables all but certain nodes, and
    john> leaves the topology intact. Couldn't that be done
    john> something like:

May be, but keep in mind that for *that* bug, we're talking about
~46.000 inconsistent parents spread among 11066 (out of 62471)
revisions for 3495 (out of 18940) files.

So whatever trick we find to speed up _generate_text_key_index(),
I'm afraid it will still need *some* time.

Now, considering that not a single occurrence out of those 46229
inconsistent parents were produced by bzr itself but by the
converter (ignoring the fact that we can consider a bug the fact
that bzr was tricked by the converter), I will very much prefer
that we spend the effort on fixing the converter (or making sure
current bzr is not tricked) instead of making check/reconcile
smarter as said above.

<snip/>

    john> I'm stopping this short because Martin and I just
    john> chatted briefly and I'd like to start a new email for
    john> those ideas.

          Vincent