[MERGE] Update to bzr.dev.

Wed Jun 18 09:33:32 BST 2008

On Wed, 2008-06-18 at 17:57 +1000, Martin Pool wrote:
> On Tue, Jun 17, 2008 at 5:05 PM, Andrew Bennetts <andrew at canonical.com> wrote:
> > Here's a review of knit.py.  Lots of deleted and moved code, which makes it a
> > large and awkward diff, but that's life.  Deleted code is a good thing, of
> > course :)
> 
> I've done most of these but need answers from Robert on some of them:
> >> +    def annotate(self, knit, key):
> >> +        content = knit._get_content(key)
> >> +        # adjust for the fact that serialised annotations are only key suffixes
> >> +        # for this factory.
> >> +        if type(key) == tuple:
> >> +            prefix = key[:-1]
> >> +            origins = content.annotate()
> >> +            result = []
> >> +            for origin, line in origins:
> >> +                result.append((prefix + (origin,), line))
> >> +            return result
> >> +        else:
> >> +            return content.annotate()
> >
> > This smells a bit.  Why would key ever be a non-tuple here?  Aren't keys defined
> > to be tuples?

They wouldn't be, I was doing compatibility as I transitioned, I think.
So it should not affect tests to remove this.

> >> +def cleanup_pack_knit(versioned_files):
> >> +    versioned_files.stream.close()
> >> +    versioned_files.writer.end()
> >
> > Docstring please.  (And again, mention that it is a test helper).
> >
> > Or just move make_file_factory, make_pack_factory and cleanup_pack_knit into a
> > separate module... (bzrlib.tests.knit_helpers maybe?)
> 
> In general I would like if the code intended only for use in testing
> was only inside the test module.  I think Robert may disagree.

So the question is 'is this intended *only* for testing.

I think that its good when an abstraction/helper is useful to not
arbitrarily restrict where it can be used.

Testing often wants either more-simple-than-usual interfaces, or
richer-than-usual ones... either way unless they are dangerous or
problematic for normal use I see no reason to consider them test-only.

> >> +class KnitVersionedFiles(VersionedFiles):
> > [...]
> >> +    def add_lines(self, key, parents, lines, parent_texts=None,
> >> +        left_matching_blocks=None, nostore_sha=None, random_id=False,
> >> +        check_content=True):
> >> +        """See VersionedFiles.add_lines()."""
> >> +        self._index._check_write_ok()
> >> +        self._check_add(key, lines, random_id, check_content)
> >> +        if parents is None:
> >> +            # For no-graph knits, have the public interface use None for
> >> +            # parents.
> >> +            parents = ()
> >
> > I don't understand that comment.  What does "have the public interface use
> > None" mean?  If we're here, then None was already passed, so the future tense
> > doesn't make sense to me.

get_parent_map([('foo',)]) on a pack based VersionedFiles where the
GraphIndex has been parameterised to have no references to other nodes
(e.g. signatures, and hopefully inventories in some future format) will
return {('foo',): None}. For knits, you cannot configure this, so you
get
{('foo',): ()} 

> >> +        # Technically this could be avoided if we are happy to allow duplicate
> >> +        # id insertion when other things than bzr core insert texts, but it
> >> +        # seems useful for folk using the knit api directly to have some safety
> >> +        # blanket that we can disable.
> >> +        ##if not random_id and self.has_version(key):
> >> +        ##    raise RevisionAlreadyPresent(key, self)
> >
> > What's the status here?  Are you going to delete the commented out code (and the
> > comment), or reinstate the check?

Oh. Uhm. lets discuss. I think that the commment and lines should be
deleted. Instead, I think that inserting a duplicate key (without
random_id) should trigger a check that the value of the inserted text is
identical. And if not an exception should be raised. This makes a lot of
code such as the bundle code much simpler.

> >> -            delta_parents = self._index.get_parent_map([parent])[parent]
> >> +            # No exception here because we stop at first fulltext anyway, an
> >> +            # absent parent indicates a corrupt knit anyway.
> >> +            # TODO: This should be asking for compression parent, not graph
> >> +            # parent.
> >> +            parent = self._index.get_parent_map([parent])[parent][0]
> >
> > No exception where?
> 
> I think he means he's not explicitly going to check that the parent is present?

Yes. Its in an inner loop, and we've had performance issues in that
region before.

> > I'm not sure why the cache dict might be invalid... but this is strictly better
> > than the previous code (which did the same sort of thing without any comment),
> > so I'm not complaining :)
> >
> > Hmm, there is one small change that might help, though: you've replaced
> > "has_version(foo)" calls with "get_parent_map([foo])".  I think the code would
> > be a little easier to read in various places if we had an equivalent of
> > has_version, maybe "has_graph_key(foo)".  I know you don't want to encourage
> > people to write inefficient code that uses one key at a time by not providing
> > APIs that work on single keys rather than collections of keys.  But it seems
> > there's enough legitimate reason for checking for existence of a single key's
> > existence at a time that this might be worth it?
> 
> I know Robert and Andrew talked about this a bit but I'm not sure what
> the final result was.  I'd be ok to do that as a later cleanup.

I think that later would be fine. we did agree that possibly just
tweaking the format of the code would be enough.

> >> +            for key in keys:
> >> +                # the delta chain
> >> +                try:
> >> +                    chain = [key, positions[key][2]]
> >
> > For those of us that don't have the layout of positions tuples memorised, a
> > comment in this function to remind the reader what [2] in a positions tuple is
> > would be helpful.  Especially seeing as it's not the only sequence being indexed
> > in this function.  Or perhaps it would be even better to immediately generate a
> > dict of dict((key, next_key) for (method, index_memo, next_key, parents) in
> > positions.iteritems()) before this loop.  After all, you only want the one
> > element from items of the position map if include_delta_closure == True.
> >
> >> +                        try:
> >> +                            chain.append(positions[chain[-1]][2])
> >> +                        except KeyError:
> >
> > This would be clearer as:
> >
> >                        try:
> >                            position = positions[chain[-1]]
> >                        except KeyError:
> >                            ...
> >                        else:
> >                            chain.append(position[2])
> >
> > But if take my earlier suggestion to trim the position dict down to something
> > more appropriate then this wouldn't matter.
> 
> This seems like a good idea but the code is a bit hard for me to
> reinterpret.  Maybe that means your comment is very good but I should
> call it a day :-)
> 
> >
> > [...]
> >> +        if include_delta_closure:
> >> +            # XXX: get_content_maps performs its own index queries; allow state
> >> +            # to be passed in.
> >> +            text_map, _ = self._get_content_maps(present_keys)
> >> +            for key in present_keys:
> >> +                yield FulltextContentFactory(key, parent_map[key], None,
> >> +                    ''.join(text_map[key]))
> >
> > Huh.  So get_record_stream(..., include_delta_closure=True) will return only
> > FulltextContentFactories (well, and AbsentContentFactories)?  That seems odd to
> > me.  Is that right?

It was the simplest way to implement it, and in the local-data case is
approximately optimal.

For remote network calls this may well want to ship deltas over the
wire, and allow optional full text calls at the end point. Or variations
on a them.

> >> +    def iter_lines_added_or_present_in_keys(self, keys, pb=None):
> > [...]
> >> +        for key in keys:
> >> +            key_records.append((key, build_details[key][0]))
> >
> > Again, give the [0] a name to help the reader:
> >
> >        for key in keys:
> >            index_memo = build_details[key][0]
> 
> 
> > [...]
> >> +            prefix = key[:-1]
> >> +            try:
> >> +                suffix_parents = self._kndx_cache[prefix][0][key[-1]][4]
> >> +            except KeyError:
> >> +                pass
> >
> > Ow.  Which of the 5 separate getitem operators on that line is that KeyError
> > meant to be guarding?
> >
> > Also, what the heck is the structure of _kndx_cache?  I tried to find it, but
> > the best I could find was:
> >
> >    def _reset_cache(self):
> >        # Possibly this should be a LRU cache. A dictionary from key_prefix to
> >        # (cache_dict, history_vector) for parsed kndx files.
> >        self._kndx_cache = {}
> >        ...
> >
> > Unfortunately, that doesn't tell me anything about the structure or meaning of
> > the cache_dict or history_vector.  The KndxIndex docstring talks about the
> > _cache and _history instance variables, which I assume are probably related in
> > some way, but I don't know how.  Please update the KndxIndex docstring.
> 
> robert, halp! :)
> 
> Here is my incremental patch addressing the other issues from Andrew and myself.

_kndx_cache is a dict from prefix to the old state of KnitIndex objects.
Where prefix is (e.g. the (fileid,) for .texts instances, or () for
constant-mapped things like .revisions, and the old state is a
tuple(history-vector, cache_dict). This is used to prevent having an ABI
change with the C extension that reads .kndx files (partly due to time,
and partly the high cost that that would incur).

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080618/ac229a8f/attachment.pgp