[bug 277537] [REVIEW] A highly specific reconcile

John Arbash Meinel john at arbash-meinel.com
Thu Oct 16 23:19:32 BST 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Vincent Ladeuil wrote:
> Hi all,
> 
> This needs to be reviewed without BB support (it's a plugin, not
> a patch against bzr itself), please keep that in mind if you want
> to track it.
> 
> Bug https://bugs.launchpad.net/bzr/+bug/277537 is about a
> repository in which some file revisions have inconsistent
> parents.
> 
> This can be addressed by bzr check/bzr reconcile except that in
> that case bzr check requires 32 hours to run and bzr reconcile 20
> hours (jam provided a patch to cache a bit more aggressively for
> both check and reconcile (thanks for that) but in the end, it
> made my VM trashed to the point I had to kill it (lucky/unlucky
> who knows ?)).

Well the specific problem for "bzr reconcile" is the function
"_generate_text_key_index". It is a problem because it has to extract
the inventory for every revision in history, sometimes more than once.
We end up deserializing() the same xml entry many many times. And this
does not scale well at all. Something like O(num_files * num_revisions)

It would have been nice if we could have snuck in a form that used
something more minimal, like
"_find_text_key_references_from_xml_inventory_lines". I honestly think
we *could* build something up on top of
iter_lines_added_or_present_in_keys().

> 
> Since check/reconcile has to be done for any repository in the
> wild, I thought it may be wise to provide a quicker way.
> 

I think your plugin is very reasonable for the mysql case, I'm not sure
it is a great general case solution.

> An important point to understand is that this bug is about data
> produced during a conversion and not a bug about how bzr use that
> data. Even the if one can argue about the the later, what seemed
> more important to me was to fix the data at least because users
> encountering that bug may not want to upgrade bzr to get it
> fixed.
> 
> So I wrote lp:~vila/bzr/fix277537, a simple plugin that can fix
> the inconsistent parents by using a description of which file
> revisions need to be fixed.
> 
> These descriptions are acquired from bzr check which should still
> be run once (32 hours, but only once), then, the fix can be
> applied to all repositories:
> 
> time -p bzr fix277537 ~/.bazaar/plugins/fix277537/buggy-file-revids-081011 
> Will try to fix 46229 wrong parents
> Fixed 46229 wrong parents
> 0 wrong parents were already fixed
> real 312.96
> user 249.93
> sys 30.98
> 
> So 73919 (reconcile) / 312 (fix277537) gives a 23961% improvement
> in time :-)

312s... how long does a plain 'bzr pack' take?

Probably about right since 'bzr pack' on bzr.dev takes about 40s for
100MB here, and they have at least 600MB.

> 
> I was a bit concerned about modifying a repository that way until
> I realized that I was, indeed, doing the same work as reconcile
> itself: updating meta-data, not revisions themselves.

Yeah, I think the basic overview is fine.

> 
> 312 seconds seems an acceptable delay for any user of these
> repositories during which he can't commit in any of his branches,
> it also sounds quick enough to be applied on shared repositories
> on servers without getting them off-line.
> 
> I checked that the annotations are now correct but I'd like that
> code to be reviewed before releasing it.
> 
> The most important revisions to be reviewed are:
> 
> - http://bazaar.launchpad.net/~vila/bzr/fix277537/revision/6
>   which get the needed time from 20 hours down to 2h10,
> 
> - http://bazaar.launchpad.net/~vila/bzr/fix277537/revision/10
>   which get it down to 312 seconds.
> 
> But an overall review will be welcome too regarding the way the
> plugin can be used (some of the broken repositories are hosted in
> launchpad and if some LOSA can fix them locally, that will be
> greatly appreciated).
> 
>      Vincent
> 
> 

Here are some comments:

    def _copy_text_texts(self):
        """generate what texts we should have and then copy."""
        self.pb.update("Copying content texts", 3)
        # we have three major tasks here:
        # 1) generate the ideal index
        repo = self._pack_collection.repo
        ancestors = dict([(key[0], tuple(ref[0] for ref in refs[0])) for
            _1, key, _2, refs in
            self.new_pack.revision_index.iter_all_entries()])

^- You don't actually use this anymore, so you don't need to create it.

Fundamentally it comes down to having the right data here:
        ideal_index = {}
        wrong_index = {}
        for key, correct, wrong in self._file_rev_ids:
            ideal_index[key] = correct
            wrong_index[key] = wrong

Whatever is put in _file_rev_ids is pretty critical, but I assume you
generated that from the "_generate_text_key_index()" and comparing it to
the actual text key index.


        # 2) generate a text_nodes list that contains all the deltas
that can
        #    be used as-is, with corrected parents.
        ok_nodes = []
        bad_texts = []
        discarded_nodes = []
        NULL_REVISION = _mod_revision.NULL_REVISION
        text_index_map, text_nodes = self._get_text_nodes()

^- You don't actually use "bad_texts" or "discarded_nodes" anymore.
And ok_nodes is a bit of a misnomer, as it is just the nodes you are
going to end up processing, with parents fixed.


There is some concern here:
                        ok_nodes.append((node[0], node[1], node[2],
                                         (ideal_parents, node[3][1])))

^- Basically, you've now changed our previous invariant, and line deltas
will now possibly be against a non left-hand parent. Am I correct in
guessing that?

I would guess that a change like this has not permeated the rest of our
codebase. We would *like* it to, but I don't know if it has yet.

I'm thinking of stuff like "annotate" passing in 'left_matching_blocks',
only now those actually match the right-hand-parent.

I think we can fix annotate with a line like:

   fulltext = self._add_fulltext_content(rev_id, fulltext_content)
   if compression_parent == parent_ids[0]:
     blocks = KnitContent.get_line_delta_blocks(delta, parent_fulltext,
                  fulltext)
   else:
     # We can only re-use blocks if they match the left parent
     blocks = None

One way to check this would be to inspect the .tix files directly, or to
debug 'ideal_parents' versus 'node[3][1]')

Also, a more correct fix would probably be:

    ok_nodes.append((node[0], node[1], node[2],
                     (ideal_parents,) + node[3][1:]))

Just in case there are ever more than 2 ref lists.


So I'm not opposed to allowing non-left-hand compression parents, as it
can help shake out any other bugs we might have in the code. However, we
need to at least think about where bugs might crop up.




And back to the original topic about fixing _generate_text_key_index,
consider this... We already know what nodes we want *present* in the
graph, because we have the 'text_refs'. We already know the overall
shape of the graph, because it is the whole-tree revision graph.

So what we want is to have an algorithm that takes a whole tree graph,
disables all but certain nodes, and leaves the topology intact. Couldn't
that be done something like:

1) Start with whole graph, nodes which modified the file in question are
CAPS.

 A
 |\
 b c
 | |\
 D e F
 | |/
 g H
 |/
 I

2) Derive the child information for each node. So we now have A => (b,
c) as well as b => (A,), c => (A,)

3) For each node, if it is not in the per-file set, remove it by
replacing references in its children with references to its parents. The
final per-file graph needs to look like:

 A
 |\
 D F
 | |
 | H
 |/
 I

Note that there is a small issue for 'H'. The whole tree graph has 2
parents, but there is only 1 direct parent for (file, H). So we probably
 need to do something like a 'heads()' check, so we realize that A
should not be included as a parent.

3-alt) Alternatively, we could start with the final nodes (since it is
likely to be a smaller set). And then track back into ancestors until we
get to nodes that are in the file set.


I'm stopping this short because Martin and I just chatted briefly and
I'd like to start a new email for those ideas.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkj3vfQACgkQJdeBCYSNAAPFYgCdE1Bps+c56BIHcvhvQaf8d2s9
ObIAn1M+NGgtm/aPB/zVoYa9ADn6gQfu
=infx
-----END PGP SIGNATURE-----



More information about the bazaar mailing list