[MERGE] Reconcile can fix bad parent references

Mon Aug 27 01:58:12 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> The NEWS entry could be a little more informative. Perhaps add in that
> these were caused back in bzr 0.8 timeframes and any branches that were
> created back then should run 'bzr check' asap.

OK

> This idiom:
> Won't work with packs, because there is no write group. If you insert
> repo.lock_write()
> repo.start_write_group()
> ...
> repo.commit_write_group()
> repo.unlock()
> 
> around your manual data insertion it will work fine.

I can do that, but I thought we wouldn't need this for packs.

> A note here that the revisions are checked first so that XXXX would be
> good. I'm guessing that the reason is so that the data in the revision
> index is trustable.

It is so that the revision_versions dict is pre-populated.

>>      def plan_revisions(self):
>>          repository = self.repository
>> -        self.planned_revisions = set(repository.all_revision_ids())
>> +        self.planned_revisions = repository.all_revision_ids()
>>          self.progress.clear()
>>          inventoried = set(self.inventory_weave.versions())
>> -        awol = self.planned_revisions - inventoried
>> +        awol = set(self.planned_revisions) - inventoried
>>          if len(awol) > 0:
>>              raise BzrCheckError('Stored revisions missing from
>> inventory'
>>                  '{%s}' % ','.join([f for f in awol]))
>> -        self.planned_revisions = list(self.planned_revisions)
> 
> What do the changes to plan_revisions accomplish?

They reduce silly conversions.  In the old code, we start with a list,
convert it to a set, and then convert it back to a list.

This just leaves it as a list.

> This uses the nodes-of-file graphs are nodes-of-revision graph property
> that John and I are debating whether we can use - because the edges are
> different.

> Until we decide that that will give identical results we
> should stay with the per file graph.

If I didn't know the knit data was untrustworthy, I could use that.  But
I don't see how I can use corrupt parents data to repair itself.  I
can't even know if I'm iterating in topological order...

>> === modified file 'bzrlib/repository.py'
>> --- bzrlib/repository.py        2007-08-22 05:28:32 +0000
>> +++ bzrlib/repository.py        2007-08-24 18:15:50 +0000
>> @@ -1112,6 +1112,66 @@
>>                  [parents_provider,
>> other_repository._make_parents_provider()])
>>          return graph.Graph(parents_provider)
>>  
>> +    def _add_revision_text_version(self, tree, revision_versions):
>> +        inv_revisions = {}
>> +        revision_versions[tree.get_revision_id()] = inv_revisions
>> +        for path, entry in tree.iter_entries_by_dir():
>> +            inv_revisions[entry.file_id] = entry.revision
>> +        return inv_revisions
> 
> This doesn't access self at all: What made repository seem like the best
> place to put it ?

It was a helper for Repository.find_bad_ancestors.

> This is what, a cache of revisions -> text version
> mappings ? I'd be inclined to either have a standalone function

I never thought I'd see the day.

>, or an
> object to represent the cache if you have more than 1 function related
> to this data structure. Looking at find_bad_ancestors there the cache is
> used and added to as well, so I would really prefer to see a class for
> the cache, probably constructed with the repository as a data source:
> 
> FooCache:
>     init(repository)
>     cache(tree)
>     get_text_version(file_id, revision_id)
> 
> This gives a home for the inner method in find_bad_ancestors

Sigh.  Alright.

>> +    def find_bad_ancestors(self, revision_ids, file_id,
>> versionedfile,
>> +                          revision_versions, parents_provider=None):
>> +        """Search the versionedfile for ancestors that are not 
> 
> Just speculating here:
> likewise find_bad_ancestors doesn't seem to fit the general Repository
> API to me. Specifically, for packs we won't want to do it by versioned
> file anyway, rather by revision - check all the texts introduced by
> revision id X for their parent details.

find_bad_ancestors was designed to fit into Check._check_weaves.  So it
needs to work on a single versionedfile at a time.

I thought for packs this would be an irrelevant issue, because we'd use
a fetch that won't tolerate this kind of brokenness.

> Is :
> 
>> +class _RevisionParentsProvider(object):
> 
> Really needed? If we check the revision is cached correctly at the start
> of check

We would also need to do the same in reconcile, and make sure that they
never get out of sync.  I thought it was just simpler to use the data
that we actually trust, instead of a proxy.

> I think its better to check for an attribute like I did in my recent
> reconcile patch - _reconcile_does_inventory_gc is what I used. This is
> better than a subclass check because it allows foreign formats to be
> tested without requiring them to spuriously subclass Knits.

As we discussed it, only knits would support this, so I check for knits.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG0iGk0F+nu1YWqI0RAsnDAJ9Si2s97lwgCjS3+Nc9j6UHW9+TLACfYF5r
y82p5zC7jPI+sqOc7l7UISI=
=6EAp
-----END PGP SIGNATURE-----