[MERGE] Eliminate the use of VersionedFile.join when fetching data.

Fri May 2 05:57:47 BST 2008

Robert Collins wrote:
> This patch eliminates all use of join() in bzrlib's fetch routines.
> 
> It does this by introducing a new streaming api - the previous api had a
> couple of flaws that made it unsuitable to use as a direct replacement
> for join (amongst them the fact that it internally used join() :)).
> 
> This is the penultimate step in preparing for the introduction of a
> VersionedFiles (note the plural - a better name would be welcome) which
> will provide less artificial partitioning of the datastore. The final
> step is removing the methods deprecated before 1.5, to make the
> transition easier. While we don't _have_ to remove them I think it will
> make the creation of the new interface much clearer.

bb:tweak

Comments inline.

(I haven't looked at your updated patch yet; I'll send an update after checking
the interdiff.)

> === modified file 'bzrlib/fetch.py'
> +        # Fetch all the texts.
> +        to_weave.insert_record_stream(from_weave.get_record_stream(required_versions,
> +            'topological', False))

This ought to be wrapped to be < 80 columns.

> === modified file 'bzrlib/knit.py'
> +        if build_details[0] == 'line-delta':
> +            kind = 'delta'
> +        else:
> +            kind = 'ft'
> +        if annotated:
> +            annotated_kind = 'annotated-'
> +        else:
> +            annotated_kind = ''
> +        self.storage_kind = 'knit-%s%s-gz' % (annotated_kind, kind)

I wonder a little if a tuple would be better than a str for storage_kind.

> +    def get_record_stream(self, versions, ordering, include_delta_closure):
> +        """Get a stream of records for versions.
> +
> +        :param versions: The versions to include. Each version is a tuple
> +            (version,).

I'm curious about why this is a sequence of one element tuples, rather than just
of strings.

> +        :param ordering: Either 'unordered' or 'topological'. A topologically
> +            sorted stream has compression parents strictly before their
> +            children.
> +        :param include_delta_closure: If True then the closure across any
> +            compression parents will be included (in the opaque data).
> +        :return: An iterator of ContentFactory objects, each of which is only
> +            valid until the iterator is advanced.
> +        """

You use the term “opaque data” in this docstring without really explaining what
you mean by it.

> +        # Double index lookups here : need a unified api ?
> +        parent_map = self.get_parent_map(versions)

Where?  Do you mean an index lookup has already happened by the time
get_record_stream is called, and now get_record_stream is doing another one?

> +        position_map = self._get_components_positions(present_versions)
> +        # c = component_id, r = record_details, i_m = index_memo, n = next

Which code this comment meant to be a legend for?

> +        records = [(version, position_map[version][1]) for version in
> +            present_versions]
> +        record_map = {}
> +        for version in absent_versions:
> +            yield AbsentContentFactory((version,))

Why not yield the absent_versions as soon as you calculate them?

> +    def insert_record_stream(self, stream):
> +        """Insert a record stream into this versioned file.
> +
> +        :param stream: A stream of records to insert. 
> +        :return: None
> +        :seealso VersionedFile.get_record_stream:
> +        """

Using :seealso: here is good idea.  Thanks!

> +        if self.factory.annotated:
> +            # self is annotated, we need annotated knits to use directly.
> +            annotated = "annotated-"
> +            convertibles = []
> +        else:
> +            # self is not annotated, but we can strip annotations cheaply.
> +            annotated = ""
> +            convertibles = set(["knit-annotated-delta-gz",
> +                "knit-annotated-ft-gz"])
> +        native_types = set()
> +        native_types.add("knit-%sdelta-gz" % annotated)
> +        native_types.add("knit-%sft-gz" % annotated)
> +        knit_types = native_types.union(convertibles)

It might be good to extract this section into its own method.
insert_record_stream is quite long.

What exactly is the distinction between a “native” type and a “knit” type?
Native seems obvious enough: types that this knit vf uses itself (and can thus
insert directly from a stream (assuming basis parents are present for a delta)),
but “knit” types seem a bit fuzzier.  They are types that are cheaply
translateable to native types, I guess?  “knit_types” doesn't seem like a great
name for that concept.

> +        adapters = {}

This line ought to be closer to the definition of get_adapter; it's really an
implementation detail of get_adapter.

> +        # Buffered index entries that we can't add immediately because their
> +        # basis parent is missing. We don't buffer all because generating

That sentence no verb.  ;)

Maybe it should read “buffered_index_entries holds entries that...”?

(Short comments like “foo = {}  # widgets to frob” often make perfectly good
sense, but an incomplete sentence in a comment that is otherwise a full
paragraph breaks my mental parser.)

> +        # annotations may require access to some of the new records. However we
> +        # can't generate annotations from new deltas until their basis parent
> +        # is present anyway, so we get away with not needing an index that
> +        # reports on the new keys.

I don't quite this last sentence.  It's not immediately obvious to me why we
might have needed “an index that reports on the new keys”, or what that means.

> +        # key = basis_parent, value = index entry to add
> +        buffered_index_entries = {}
> +        for record in stream:
> +            # Raise an error when a record is missing.
> +            if record.storage_kind == 'absent':
> +                raise RevisionNotPresent([record.key[0]], self)
> +            # adapt to non-tuple interface
> +            parents = [parent[0] for parent in record.parents]
> +            if record.storage_kind in knit_types:

Hmm, you're using “kind” and “type” interchangeably.  It'd be nice if we could
be consistent.

Some brief comments for the various cases here would be helpful, e.g. “# This
record's type is the same as this knit's type, so we can use add_raw_records”,
“# This record is a fulltext, so we can use add_lines.”, “# This knit can't
trivially handle this record type, and the record is not a fulltext.  So all we
can do is try to adapt this record to a fulltext.”

I suggest those because I spent a fair while trying to understand what the
significant differences were between the cases — i.e. why it is more complicated
than just:

    # psuedo-code
    adapter = get_best_adapter(record.storage_kind, native_types)
    compatible_record = adapter.adapt(record)
    self.insert_record(compatible_record)

Which is what I had been expecting.

> +                if record.storage_kind not in native_types:
> +                    try:
> +                        adapter_key = (record.storage_kind, "knit-delta-gz")

You always use record.storage_kind as the first element of the key you pass to
get_adapter.  You could simplify the callsites of get_adapter a little by making
get_adapter take (record, desired_kind) args rather than (adapter_key).

(If you wanted to be really sneaky, just have it take desired_kind; record will
be found automatically from the nested scope...)

Not a big deal, though.

> +            elif record.storage_kind == 'fulltext':
> +                self.add_lines(record.key[0], parents,
> +                    split_lines(record.get_bytes_as('fulltext')))
> +            else:
> +                adapter_key = record.storage_kind, 'fulltext'
> +                adapter = get_adapter(adapter_key)
> +                lines = split_lines(adapter.get_bytes(
> +                    record, record.get_bytes_as(record.storage_kind)))
> +                try:
> +                    self.add_lines(record.key[0], parents, lines)
> +                except errors.RevisionAlreadyPresent:
> +                    pass

I'm don't really understand why the record-is-full-text and
record-is-adaptable-to-fulltext cases are slightly different to each other.

>      def check(self, progress_bar=None):
>          """See VersionedFile.check()."""
> +        # This doesn't actually test extraction of everything, but that will

s/but/as/, I think.

> +    def _parse_record_unchecked(self, data):

A docstring to make the difference from _parse_record explicit would be good.
i.e. what checks are skipped?

> === modified file 'bzrlib/tests/test_versionedfile.py'

>  
> +def get_diamond_vf(f, trailing_eol=True, left_only=False):
> +    """Get a diamond graph to exercise deltas and merges.
> +    
> +    :param trailing_eol: If True end the last line with \n.
> +    """

The docstring ought to describe the left_only param too.

> +    def capture_stream(self, f, entries, on_seen, parents):
> +        """Capture a stream for testing."""

What is the “f” param?  The others are somewhat guessable, but one-letter
parameters are just too terse.

> +        for factory in entries:
> +            on_seen(factory.key)
> +            self.assertValidStorageKind(factory.storage_kind)
> +            self.assertEqual(f.get_sha1s([factory.key[0]])[0], factory.sha1)
> +            self.assertEqual(parents[factory.key[0]], factory.parents)
> +            self.assertIsInstance(factory.get_bytes_as(factory.storage_kind),
> +                str)

Hmm, this doesn't just capture a stream, it also does a bunch of assertions.
That's ok, but I feel the docstring ought to mention that.  Probably the name
should reflect that too, otherwise reading the implementation of e.g.
test_get_record_stream_interface it's not obvious where “each item ... has to
provide a regular interface” (as promised by the docstring) is being verified.

> +    def test_get_record_stream_interface(self):
> +        """each item in a stream has to provide a regular interface."""

Just a nitpick: sentences ought to start with a captial letter.

> +        f, parents = get_diamond_vf(self.get_file())
> +        entries = f.get_record_stream(['merged', 'left', 'right', 'base'],
> +            'unordered', False)

Why use this particular subset of versions?  (ObXUnitPatterns: Mystery Guest)

> +    def test_get_record_stream_interface_ordered(self):
> +        """each item in a stream has to provide a regular interface."""

Ideally I think different test methods would have different docstrings...

> +        f, parents = get_diamond_vf(self.get_file())
> +        entries = f.get_record_stream(['merged', 'left', 'right', 'base'],
> +            'topological', False)
> +        seen = []
> +        self.capture_stream(f, entries, seen.append, parents)
> +        self.assertSubset([tuple(seen)],
> +            (
> +             (('base',), ('left',), ('right',), ('merged',)),
> +             (('base',), ('right',), ('left',), ('merged',)),
> +            ))

It took me ages to make sense of this assertion, then I realised it read
“assertSubset”, not “assertEqual”.  D'oh.  Not your fault though (unless I can
blame your nested tuples for sending me cross-eyed...).

> +    def test_get_record_stream_interface_ordered_with_delta_closure(self):
> +        """each item in a stream has to provide a regular interface."""
> +        f, parents = get_diamond_vf(self.get_file())
> +        entries = f.get_record_stream(['merged', 'left', 'right', 'base'],
> +            'topological', True)
> +        seen = []
> +        for factory in entries:
> +            seen.append(factory.key)
> +            self.assertValidStorageKind(factory.storage_kind)
> +            self.assertEqual(f.get_sha1s([factory.key[0]])[0], factory.sha1)
> +            self.assertEqual(parents[factory.key[0]], factory.parents)
> +            self.assertEqual(f.get_text(factory.key[0]),
> +                factory.get_bytes_as('fulltext'))
> +            self.assertIsInstance(factory.get_bytes_as(factory.storage_kind),
> +                str)

You appear to have a half-done refactoring here.  Re-use capture_stream, maybe
by adding a parameter to for an function to run on the factory.  Is the extra
assertion this does vs. capture_stream significant?

Also, you don't appear to have any tests anywhere that delta_closure=True gives
different results to False.

> +    def test_get_record_stream_unknown_storage_kind_raises(self):
> +        """Asking for a storage kind that the stream cannot supply raises."""
> +        f, parents = get_diamond_vf(self.get_file())
> +        entries = f.get_record_stream(['merged', 'left', 'right', 'base'],
> +            'unordered', False)
> +        # We track the contents because we should be able to try, fail a
> +        # particular kind and then ask for one that works and continue.
> +        seen = set()
> +        for factory in entries:
> +            seen.add(factory.key)
> +            self.assertValidStorageKind(factory.storage_kind)
> +            self.assertEqual(f.get_sha1s([factory.key[0]])[0], factory.sha1)
> +            self.assertEqual(parents[factory.key[0]], factory.parents)
> +            # currently no stream emits mpdiff
> +            self.assertRaises(errors.UnavailableRepresentation,
> +                factory.get_bytes_as, 'mpdiff')
> +            self.assertIsInstance(factory.get_bytes_as(factory.storage_kind),
> +                str)
> +        self.assertEqual(set([('base',), ('left',), ('right',), ('merged',)]),
> +            seen)

Definitely extend capture_stream to make this neater.  The key assertion of this
test (the assertRaises) is buried by boilerplate.

> +    def test_get_record_stream_missing_records_are_absent(self):
> +        f, parents = get_diamond_vf(self.get_file())
> +        entries = f.get_record_stream(['merged', 'left', 'right', 'or', 'base'],

Shouldn't it be “origin”, not “or”?  Oh, no, that's intentional.  Please use a
more obvious value, like “absent-key”.

Also, I don't understand why you need to use all of merged, left, right and base
here.

> +    def test_filter_absent_records(self):
> +        """Requested missing records can be filter trivially."""

s/filter/filtered/

> +        f, parents = get_diamond_vf(self.get_file())
> +        entries = f.get_record_stream(['merged', 'left', 'right', 'extra', 'base'],
> +            'unordered', False)

Again “extra” isn't the most intent-revealing value.

> +    def test_insert_record_stream_fulltexts_noeol(self):
> +        """Any file should accept a stream of fulltexts."""

The docstring ought to mention the noeols aspect of this test.

> +    def test_insert_record_stream_annotated_knits_noeol(self):
> +        """Any file should accept a stream from plain knits."""

Ditto.

Also, the docstring says “plain” but the test name says “annotated”.

> +    def test_insert_record_stream_out_of_order(self):
> +        """An out of order stream can either error or work."""

That docstring doesn't quite say anything useful :)

“Inserting an out of order stream will either raise RevisionNotPresent (without
corrupting the file) or work.”

(Also, be definite: say “will” rather than “can” in test descriptions.)

> +class TestContentFactoryAdaption(TestCaseWithMemoryTransport):
> +

> +    def helpGetBytes(self, f, ft_adapter, delta_adapter):
> +        """grab the interested adapted texts for tests."""

Capital letter at the start of a sentence.  Also, what are “interested adapted
texts”?

> === modified file 'bzrlib/versionedfile.py'
[...]
> +class RecordingVersionedFileDecorator(object):
> +    """A minimal versioned file that records calls made on it.
> +    
> +    Only enough methods have been added to support tests using it to date.

(In my ideal world, this sort of class would live under bzrlib.tests somewhere,
rather than in the production code.  That's not the policy atm though, so don't
worry about it.)

-Andrew.