[MERGE] Update to bzr.dev.

Tue Jun 17 08:05:17 BST 2008

Here's a review of knit.py.  Lots of deleted and moved code, which makes it a
large and awkward diff, but that's life.  Deleted code is a good thing, of
course :)

Robert Collins wrote:
> === modified file 'bzrlib/knit.py'
[...]
> @@ -183,7 +187,7 @@
>      def get_bytes(self, factory, annotated_compressed_bytes):
>          rec, contents = \
>              self._data._parse_record_unchecked(annotated_compressed_bytes)
> -        content, delta = self._annotate_factory.parse_record(factory.key[0],
> +        content, delta = self._annotate_factory.parse_record(factory.key[-1],

(This is in FTAnnotatedToFullText.)

Eep! :)

I think this particular change is ok.  It's no worse that what was there before
(-1 isn't any more arbitrary than 0).  It did take me a disproportionate amount
of time to convince myself that this change is probably ok though.

[...]
> +        compression_parent = factory.parents[0]
> +        basis_entry = self._basis_vf.get_record_stream(
> +            [compression_parent], 'unordered', True).next()
> +        if basis_entry.storage_kind == 'absent':
> +            raise errors.RevisionNotPresent(compression_parent, self._basis_vf)
> +        basis_lines = split_lines(basis_entry.get_bytes_as('fulltext'))
[...]
> +        compression_parent = factory.parents[0]
> +        # XXX: string splitting overhead.
> +        basis_entry = self._basis_vf.get_record_stream(
> +            [compression_parent], 'unordered', True).next()
> +        if basis_entry.storage_kind == 'absent':
> +            raise errors.RevisionNotPresent(compression_parent, self._basis_vf)
> +        basis_lines = split_lines(basis_entry.get_bytes_as('fulltext'))

And elsewhere in repository.py you do:

> +        stream = self.signatures.get_record_stream([(revision_id,)],
> +            'unordered', True)
> +        record = stream.next()
> +        if record.storage_kind == 'absent':
> +            raise errors.NoSuchRevision(self, revision_id)

Some tests also want to read just one record from a stream, too.  Perhaps it's
worth adding a get_one_record or get_first_record helper.  I'm happy to wait
until we have a bit more data to triangulate the refactoring before we worry
about it, just thought it was worth suggesting.

> -        self.key = (version,)
> -        self.parents = tuple((parent,) for parent in parents)
> +        self.key = key
> +        self.parents = parents

Given how much noise there's been about adding "(foo,)" in some places, I feel
it's only fair to highlight some code that has gone the other way, making us all
happier :)

> +    def annotate(self, knit, key):
> +        content = knit._get_content(key)
> +        # adjust for the fact that serialised annotations are only key suffixes
> +        # for this factory.
> +        if type(key) == tuple:
> +            prefix = key[:-1]
> +            origins = content.annotate()
> +            result = []
> +            for origin, line in origins:
> +                result.append((prefix + (origin,), line))
> +            return result
> +        else:
> +            return content.annotate()

This smells a bit.  Why would key ever be a non-tuple here?  Aren't keys defined
to be tuples?

> +def make_file_factory(annotated, mapper):
> +    """Create a factory for creating a file based KnitVersionedFiles.
> +    
> +    :param annotated: knit annotations are wanted.
> +    :param mapper: The mapper from keys to paths.
> +    """
> +    def factory(transport):
> +        index = _KndxIndex(transport, mapper, lambda:None, lambda:True, lambda:True)
> +        access = _KnitKeyAccess(transport, mapper)
> +        return KnitVersionedFiles(index, access, annotated=annotated)
> +    return factory

This docstring should make it clear that this is a helper for tests.
make_pack_factory's docstring for instance makes it clear that it's a test
facility rather than something you'd want to use in normal use.

> +def cleanup_pack_knit(versioned_files):
> +    versioned_files.stream.close()
> +    versioned_files.writer.end()

Docstring please.  (And again, mention that it is a test helper).

Or just move make_file_factory, make_pack_factory and cleanup_pack_knit into a
separate module... (bzrlib.tests.knit_helpers maybe?)

> +class KnitVersionedFiles(VersionedFiles):
[...]
> +    def add_lines(self, key, parents, lines, parent_texts=None,
> +        left_matching_blocks=None, nostore_sha=None, random_id=False,
> +        check_content=True):
> +        """See VersionedFiles.add_lines()."""
> +        self._index._check_write_ok()
> +        self._check_add(key, lines, random_id, check_content)
> +        if parents is None:
> +            # For no-graph knits, have the public interface use None for
> +            # parents.
> +            parents = ()

I don't understand that comment.  What does “have the public interface use
None” mean?  If we're here, then None was already passed, so the future tense
doesn't make sense to me.

> +    def _check_add(self, key, lines, random_id, check_content):
> +        """check that version_id and lines are safe to add."""
> +        if contains_whitespace(key[-1]):
> +            raise InvalidRevisionId(key[-1], self.filename)
> +        self.check_not_reserved_id(key[-1])

This would benefit from “version_id = key[-1]”.

> +        # Technically this could be avoided if we are happy to allow duplicate
> +        # id insertion when other things than bzr core insert texts, but it
> +        # seems useful for folk using the knit api directly to have some safety
> +        # blanket that we can disable.
> +        ##if not random_id and self.has_version(key):
> +        ##    raise RevisionAlreadyPresent(key, self)

What's the status here?  Are you going to delete the commented out code (and the
comment), or reinstate the check?

> +    def _check_header_version(self, rec, version_id):
> +        """Checks the header version on original format knit records.
> +        
> +        These have the last component of the key embedded in the record.
> +        """

Thanks for adding this docstring.

[...]
> -            delta_parents = self._index.get_parent_map([parent])[parent]
> +            # No exception here because we stop at first fulltext anyway, an
> +            # absent parent indicates a corrupt knit anyway.
> +            # TODO: This should be asking for compression parent, not graph
> +            # parent.
> +            parent = self._index.get_parent_map([parent])[parent][0]

No exception where?

> +    def _get_components_positions(self, keys, noraise=False):
[...]
> +
> +        :param noraise: If True do not raise an error on a missing component,
> +            just ignore it.
> +        """

“noraise” is an overly broad name for that.  Something more specific like
“allow_missing” would be better.

> +    def _get_content(self, key, parent_texts={}):
> +        """Returns a content object that makes up the specified
> +        version."""
> +        cached_version = parent_texts.get(key, None)
> +        if cached_version is not None:
> +            # Ensure the cache dict is valid.
> +            if not self.get_parent_map([key]):
> +                raise RevisionNotPresent(key, self)
> +            return cached_version

I'm not sure why the cache dict might be invalid... but this is strictly better
than the previous code (which did the same sort of thing without any comment),
so I'm not complaining :)

Hmm, there is one small change that might help, though: you've replaced
“has_version(foo)” calls with “get_parent_map([foo])”.  I think the code would
be a little easier to read in various places if we had an equivalent of
has_version, maybe “has_graph_key(foo)”.  I know you don't want to encourage
people to write inefficient code that uses one key at a time by not providing
APIs that work on single keys rather than collections of keys.  But it seems
there's enough legitimate reason for checking for existence of a single key's
existence at a time that this might be worth it?

> +    def _get_content_maps(self, keys):
> +        """Produce maps of text and KnitContents
> +        
> +        :return: (text_map, content_map) where text_map contains the texts for
> +        the requested versions and content_map contains the KnitContents.
> +        """

This is a pre-existing flaw, but that's not valid ReST.  The second line of that
:return: block ought to be indented.

> +        absent_keys = keys.difference(set(positions))
> +        # There may be more absent keys : if we're missing the basis component
> +        # and are trying to include the delta closure.
> +        if include_delta_closure:
> +            # key:True means key can be reconstructed
> +            checked_keys = {}

It wasn't clear to me until I read this function carefully what dictionary that
comment was referring to.  Please phrase it something like “Build up
checked_keys dict.  key:True in this dict means the key can be reconstructed.”
This also tells the reader what the loop on the following 23 lines are for!

Maybe it should be called “reconstructable_keys” too?

I also think this 66 line function would be more readable if you extracted this
section into a separate function.

> +            for key in keys:
> +                # the delta chain
> +                try:
> +                    chain = [key, positions[key][2]]

For those of us that don't have the layout of positions tuples memorised, a
comment in this function to remind the reader what [2] in a positions tuple is
would be helpful.  Especially seeing as it's not the only sequence being indexed
in this function.  Or perhaps it would be even better to immediately generate a
dict of dict((key, next_key) for (method, index_memo, next_key, parents) in
positions.iteritems()) before this loop.  After all, you only want the one
element from items of the position map if include_delta_closure == True.

> +                        try:
> +                            chain.append(positions[chain[-1]][2])
> +                        except KeyError:

This would be clearer as:

                        try:
                            position = positions[chain[-1]]
                        except KeyError:
                            ...
                        else:
                            chain.append(position[2])

But if take my earlier suggestion to trim the position dict down to something
more appropriate then this wouldn't matter.

[...]
> +        if include_delta_closure:
> +            # XXX: get_content_maps performs its own index queries; allow state
> +            # to be passed in.
> +            text_map, _ = self._get_content_maps(present_keys)
> +            for key in present_keys:
> +                yield FulltextContentFactory(key, parent_map[key], None,
> +                    ''.join(text_map[key]))

Huh.  So get_record_stream(..., include_delta_closure=True) will return only
FulltextContentFactories (well, and AbsentContentFactories)?  That seems odd to
me.  Is that right?

>                  if 'fulltext' not in options:
>                      basis_parent = parents[0]
> -                    if not self.has_version(basis_parent):
> +                    # Note that pack backed knits don't need to buffer here
> +                    # because they buffer all writes to the transaction level,
> +                    # but we don't expose that differnet at the index level. If

“difference” rather than “differnet”.

> +                    if basis_parent not in self.get_parent_map([basis_parent]):

Here's another place where “has_graph_key” might be nice.

> -                self.add_lines(record.key[0], parents,
> +                self.add_lines(record.key, parents,

Again, it's nice to see some [0] disappearing.  :)

> +    def iter_lines_added_or_present_in_keys(self, keys, pb=None):
[...]
> +        for key in keys:
> +            key_records.append((key, build_details[key][0]))

Again, give the [0] a name to help the reader:

        for key in keys:
            index_memo = build_details[key][0]
            key_records.append((key, index_memo))

> +        for key_idx, (key, data, sha_value) in \
> +            enumerate(self._read_records_iter(key_records)):

This would be better as:

        records_iter = enumerate(self._read_records_iter(key_records)
        for key_idx, (key, data, sha_value) in records_iter:

> +            if content._lines and content._lines[-1][1][-1] != '\n':

*Ow*. :)

We briefly discussed this on IRC, and you say this line can be deleted, so
that's good :)

Also, I think it would be good if AnnotatedKnitContent's docstring described the
structure of its _lines instance variable.  That's not a requirement for this
patch, though.

> +class _KndxIndex(object):
> +    """Manages knit index files
> +
> +    The index is kept in memorya  already kept in memory and read on startup, to enable

There's some sort of editor fart on this line.

> +    def get_build_details(self, keys):
> +        """Get the method, index_memo and compression parent for keys.
>  
>          Ghosts are omitted from the result.
>  
> -        :param version_ids: An iterable of version_ids.
> -        :return: A dict of version_id:(index_memo, compression_parent,
> -                                       parents, record_details).
> +        :param keys: An iterable of keys.
> +        :return: A dict of key:(access_memo, compression_parent, parents,
> +            record_details).
>              index_memo
>                  opaque structure to pass to read_records to extract the raw
>                  data

You only updated “index_memo” to “access_memo” in one place in this docstring.
Also, there's another get_build_details docstring that still says “index_memo”;
should it get updated too?

> +    def get_parent_map(self, keys):
> +        """Get a map of the parents of keys.
> +
> +        :param keys: The keys to look up parents for.
> +        :return: A mapping from keys to parents. Absent keys are absent from
> +            the mapping.
> +        """
> +        # Parse what we need to up front, this potentially trades off I/O
> +        # locality (.kndx and .knit in the same block group for the same file
> +        # id) for less checking in inner loops.
> +        prefixes = set()
> +        prefixes.update(key[:-1] for key in keys)

Why not just “prefixes = set(key[:-1] for key in keys)”?

[...]
> +            prefix = key[:-1]
> +            try:
> +                suffix_parents = self._kndx_cache[prefix][0][key[-1]][4]
> +            except KeyError:
> +                pass

Ow.  Which of the 5 separate getitem operators on that line is that KeyError
meant to be guarding?

Also, what the heck is the structure of _kndx_cache?  I tried to find it, but
the best I could find was:

    def _reset_cache(self):
        # Possibly this should be a LRU cache. A dictionary from key_prefix to
        # (cache_dict, history_vector) for parsed kndx files.
        self._kndx_cache = {}
        ...

Unfortunately, that doesn't tell me anything about the structure or meaning of
the cache_dict or history_vector.  The KndxIndex docstring talks about the
_cache and _history instance variables, which I assume are probably related in
some way, but I don't know how.  Please update the KndxIndex docstring.

> +class _KnitKeyAccess(object):
[...]
>      def get_raw_records(self, memos_for_retrieval):
[...]
> +        # first pass, group into same-index request to minimise readv's issued.
> +        request_lists = []
> +        current_prefix = None
> +        for (key, offset, length) in memos_for_retrieval:
> +            if current_prefix == key[:-1]:
> +                current_list.append((offset, length))
> +            else:
> +                if current_prefix is not None:
> +                    request_lists.append((current_prefix, current_list))
> +                current_prefix = key[:-1]
> +                current_list = [(offset, length)]
> +        # handle the last entry
> +        if current_prefix is not None:
> +            request_lists.append((current_prefix, current_list))
> +        for prefix, read_vector in request_lists:
> +            path = self._mapper.map(prefix) + '.knit'
> +            for pos, data in self._transport.readv(path, read_vector):
> +                yield data

That's awfully similar to _DirectPackAccess.get_raw_records:

        # first pass, group into same-index requests
        request_lists = []
        current_index = None
        for (index, offset, length) in memos_for_retrieval:
            if current_index == index:
                current_list.append((offset, length))
            else:
                if current_index is not None:
                    request_lists.append((current_index, current_list))
                current_index = index
                current_list = [(offset, length)]
        # handle the last entry
        if current_index is not None:
            request_lists.append((current_index, current_list))
        for index, offsets in request_lists:
            transport, path = self._indices[index]
            reader = pack.make_readv_reader(transport, path, offsets)
            for names, read_func in reader.iter_records():
                yield read_func(None)

Do you think it's worth the effort to reuse code here?

-Andrew.