[MERGE] Eliminate the use of VersionedFile.join when fetching data.

Fri May 2 06:44:06 BST 2008

On Fri, 2008-05-02 at 14:57 +1000, Andrew Bennetts wrote:
> Robert Collins wrote:
> > This patch eliminates all use of join() in bzrlib's fetch routines.
> > 
> > It does this by introducing a new streaming api - the previous api had a
> > couple of flaws that made it unsuitable to use as a direct replacement
> > for join (amongst them the fact that it internally used join() :)).
> > 
> > This is the penultimate step in preparing for the introduction of a
> > VersionedFiles (note the plural - a better name would be welcome) which
> > will provide less artificial partitioning of the datastore. The final
> > step is removing the methods deprecated before 1.5, to make the
> > transition easier. While we don't _have_ to remove them I think it will
> > make the creation of the new interface much clearer.
> 
> bb:tweak
> 
> Comments inline.
> 
> (I haven't looked at your updated patch yet; I'll send an update after checking
> the interdiff.)

Thanks.

> > === modified file 'bzrlib/fetch.py'
> > +        # Fetch all the texts.
> > +        to_weave.insert_record_stream(from_weave.get_record_stream(required_versions,
> > +            'topological', False))
> 
> This ought to be wrapped to be < 80 columns.

Will do.

> > === modified file 'bzrlib/knit.py'
> > +        if build_details[0] == 'line-delta':
> > +            kind = 'delta'
> > +        else:
> > +            kind = 'ft'
> > +        if annotated:
> > +            annotated_kind = 'annotated-'
> > +        else:
> > +            annotated_kind = ''
> > +        self.storage_kind = 'knit-%s%s-gz' % (annotated_kind, kind)
> 
> I wonder a little if a tuple would be better than a str for storage_kind.

Perhaps, but it would be a little less opaque and I think that would be
unhelpful.

> > +    def get_record_stream(self, versions, ordering, include_delta_closure):
> > +        """Get a stream of records for versions.
> > +
> > +        :param versions: The versions to include. Each version is a tuple
> > +            (version,).
> 
> I'm curious about why this is a sequence of one element tuples, rather than just
> of strings.

Compatability with VersionedFiles which is coming up, which uses tuple
keys everywhere. VersionedFile only has one element in its keyspace,
which is why it used strings.

> > +        :param ordering: Either 'unordered' or 'topological'. A topologically
> > +            sorted stream has compression parents strictly before their
> > +            children.
> > +        :param include_delta_closure: If True then the closure across any
> > +            compression parents will be included (in the opaque data).
> > +        :return: An iterator of ContentFactory objects, each of which is only
> > +            valid until the iterator is advanced.
> > +        """
> 
> You use the term “opaque data” in this docstring without really explaining what
> you mean by it.

Hmm. What I mean is that whatever magic is needed is not shown in the
public api.

> > +        # Double index lookups here : need a unified api ?
> > +        parent_map = self.get_parent_map(versions)
> 
> Where?  Do you mean an index lookup has already happened by the time
> get_record_stream is called, and now get_record_stream is doing another one?

No. get_parent_map and the following lines both end up querying the
index.

> > +        position_map = self._get_components_positions(present_versions)
> > +        # c = component_id, r = record_details, i_m = index_memo, n = next
> 
> Which code this comment meant to be a legend for?

I don't know - you've stripped too much contents as far as I can tell.

> > +        records = [(version, position_map[version][1]) for version in
> > +            present_versions]
> > +        record_map = {}
> > +        for version in absent_versions:
> > +            yield AbsentContentFactory((version,))
> 
> Why not yield the absent_versions as soon as you calculate them?

Do you mean just move this block higher up? we could, I don't think it
makes any difference.

> > +        if self.factory.annotated:
> > +            # self is annotated, we need annotated knits to use directly.
> > +            annotated = "annotated-"
> > +            convertibles = []
> > +        else:
> > +            # self is not annotated, but we can strip annotations cheaply.
> > +            annotated = ""
> > +            convertibles = set(["knit-annotated-delta-gz",
> > +                "knit-annotated-ft-gz"])
> > +        native_types = set()
> > +        native_types.add("knit-%sdelta-gz" % annotated)
> > +        native_types.add("knit-%sft-gz" % annotated)
> > +        knit_types = native_types.union(convertibles)
> 
> It might be good to extract this section into its own method.
> insert_record_stream is quite long.
> 
> What exactly is the distinction between a “native” type and a “knit” type?
> Native seems obvious enough: types that this knit vf uses itself (and can thus
> insert directly from a stream (assuming basis parents are present for a delta)),
> but “knit” types seem a bit fuzzier.  They are types that are cheaply
> translateable to native types, I guess?  “knit_types” doesn't seem like a great
> name for that concept.

they are the entire set of knit types. when the encoding matches that of
self then its a native type, when it doesn't then as long as we can
translate the record with no additional data its able to be optimised.
Improved name suggestions are welcome.

> > +        # Buffered index entries that we can't add immediately because their
> > +        # basis parent is missing. We don't buffer all because generating
> 
> That sentence no verb.  ;)
> 
> Maybe it should read “buffered_index_entries holds entries that...”?
> 
> (Short comments like “foo = {}  # widgets to frob” often make perfectly good
> sense, but an incomplete sentence in a comment that is otherwise a full
> paragraph breaks my mental parser.)

"buffer all index entries" is the only change needed there. I don't see
an incomplete sentence.

> > +        # annotations may require access to some of the new records. However we
> > +        # can't generate annotations from new deltas until their basis parent
> > +        # is present anyway, so we get away with not needing an index that
> > +        # reports on the new keys.
> 
> I don't quite this last sentence.  It's not immediately obvious to me why we
> might have needed “an index that reports on the new keys”, or what that means.

Because we may need to generate fulltexts using inserted data, and that
means it needs to show up in the index for the current knit.

> > +        # key = basis_parent, value = index entry to add
> > +        buffered_index_entries = {}
> > +        for record in stream:
> > +            # Raise an error when a record is missing.
> > +            if record.storage_kind == 'absent':
> > +                raise RevisionNotPresent([record.key[0]], self)
> > +            # adapt to non-tuple interface
> > +            parents = [parent[0] for parent in record.parents]
> > +            if record.storage_kind in knit_types:
> 
> Hmm, you're using “kind” and “type” interchangeably.  It'd be nice if we could
> be consistent.
> 
> Some brief comments for the various cases here would be helpful, e.g. “# This
> record's type is the same as this knit's type, so we can use add_raw_records”,
> “# This record is a fulltext, so we can use add_lines.”, “# This knit can't
> trivially handle this record type, and the record is not a fulltext.  So all we
> can do is try to adapt this record to a fulltext.”
> 
> I suggest those because I spent a fair while trying to understand what the
> significant differences were between the cases — i.e. why it is more complicated
> than just:

Thanks, I'll add them in.

>     # psuedo-code
>     adapter = get_best_adapter(record.storage_kind, native_types)
>     compatible_record = adapter.adapt(record)
>     self.insert_record(compatible_record)
> 
> Which is what I had been expecting.

In a .pack repository this will work fine (it will throw on an unordered
stream where we need fulltexts during insertion, but packs are not
annotated so don't need that). But this code is currently used for knits
and packs. In VersionedFiles I will likely have a good home to split it
out into.

> > +            elif record.storage_kind == 'fulltext':
> > +                self.add_lines(record.key[0], parents,
> > +                    split_lines(record.get_bytes_as('fulltext')))
> > +            else:
> > +                adapter_key = record.storage_kind, 'fulltext'
> > +                adapter = get_adapter(adapter_key)
> > +                lines = split_lines(adapter.get_bytes(
> > +                    record, record.get_bytes_as(record.storage_kind)))
> > +                try:
> > +                    self.add_lines(record.key[0], parents, lines)
> > +                except errors.RevisionAlreadyPresent:
> > +                    pass
> 
> I'm don't really understand why the record-is-full-text and
> record-is-adaptable-to-fulltext cases are slightly different to each other.

Going through the adapter is slightly different. The try:except: block
probably needs to be present in both sides, which suggests room for
improving testing, and I have been thinking that get_bytes_as should be
'get_bytes' with no parameters. I'm not sure how to avoid abstraction
violations for network streams at that point though :(.

> > +    def capture_stream(self, f, entries, on_seen, parents):
> > +        """Capture a stream for testing."""
> 
> What is the “f” param?  The others are somewhat guessable, but one-letter
> parameters are just too terse.

yah, copy n paste code propogating - its a versioned file.

> > +        for factory in entries:
> > +            on_seen(factory.key)
> > +            self.assertValidStorageKind(factory.storage_kind)
> > +            self.assertEqual(f.get_sha1s([factory.key[0]])[0], factory.sha1)
> > +            self.assertEqual(parents[factory.key[0]], factory.parents)
> > +            self.assertIsInstance(factory.get_bytes_as(factory.storage_kind),
> > +                str)
> 
> Hmm, this doesn't just capture a stream, it also does a bunch of assertions.
> That's ok, but I feel the docstring ought to mention that.  Probably the name
> should reflect that too, otherwise reading the implementation of e.g.
> test_get_record_stream_interface it's not obvious where “each item ... has to
> provide a regular interface” (as promised by the docstring) is being verified.

good point.

> > +    def test_get_record_stream_interface(self):
> > +        """each item in a stream has to provide a regular interface."""
> 
> Just a nitpick: sentences ought to start with a captial letter.

ITYM Capital :).

> > +        f, parents = get_diamond_vf(self.get_file())
> > +        entries = f.get_record_stream(['merged', 'left', 'right', 'base'],
> > +            'unordered', False)
> 
> Why use this particular subset of versions?  (ObXUnitPatterns: Mystery Guest)

I will comment.

> > +    def test_get_record_stream_interface_ordered(self):
> > +        """each item in a stream has to provide a regular interface."""
> 
> Ideally I think different test methods would have different docstrings...

These test methods I'd rather not docstring at all; the name is IMNSHO
sufficient, but we have a coding standard to docstring them all. My
ingenuity ran out.

> > +    def test_get_record_stream_interface_ordered_with_delta_closure(self):
> > +        """each item in a stream has to provide a regular interface."""
> > +        f, parents = get_diamond_vf(self.get_file())
> > +        entries = f.get_record_stream(['merged', 'left', 'right', 'base'],
> > +            'topological', True)
> > +        seen = []
> > +        for factory in entries:
> > +            seen.append(factory.key)
> > +            self.assertValidStorageKind(factory.storage_kind)
> > +            self.assertEqual(f.get_sha1s([factory.key[0]])[0], factory.sha1)
> > +            self.assertEqual(parents[factory.key[0]], factory.parents)
> > +            self.assertEqual(f.get_text(factory.key[0]),
> > +                factory.get_bytes_as('fulltext'))
> > +            self.assertIsInstance(factory.get_bytes_as(factory.storage_kind),
> > +                str)
> 
> You appear to have a half-done refactoring here.  Re-use capture_stream, maybe
> by adding a parameter to for an function to run on the factory.  Is the extra
> assertion this does vs. capture_stream significant?

Very. There is no guarantee that get_bytes_as('fulltext') will work
without passing True to the get_record_stream call.

> Also, you don't appear to have any tests anywhere that delta_closure=True gives
> different results to False.

Thats because its allowable for it to not give different results at this
point. When someone (you :)) adds serialised streams, I would expect a
parameterised version of this test that gets the versioned file to test
from a smart server, and in that case adding a specific test for it that
delta_closure=False causes fulltexts to be inaccessible.

> > +    def test_get_record_stream_unknown_storage_kind_raises(self):
> > +        """Asking for a storage kind that the stream cannot supply raises."""
> > +        f, parents = get_diamond_vf(self.get_file())
> > +        entries = f.get_record_stream(['merged', 'left', 'right', 'base'],
> > +            'unordered', False)
> > +        # We track the contents because we should be able to try, fail a
> > +        # particular kind and then ask for one that works and continue.
> > +        seen = set()
> > +        for factory in entries:
> > +            seen.add(factory.key)
> > +            self.assertValidStorageKind(factory.storage_kind)
> > +            self.assertEqual(f.get_sha1s([factory.key[0]])[0], factory.sha1)
> > +            self.assertEqual(parents[factory.key[0]], factory.parents)
> > +            # currently no stream emits mpdiff
> > +            self.assertRaises(errors.UnavailableRepresentation,
> > +                factory.get_bytes_as, 'mpdiff')
> > +            self.assertIsInstance(factory.get_bytes_as(factory.storage_kind),
> > +                str)
> > +        self.assertEqual(set([('base',), ('left',), ('right',), ('merged',)]),
> > +            seen)
> 
> Definitely extend capture_stream to make this neater.  The key assertion of this
> test (the assertRaises) is buried by boilerplate.

Review-Meta: When suggesting something that is likely 'obvious' please
sketch out 'how' to avoid round trips. 'Write better prose' << 'Your
paragraph will be more clear if you remove the double negatives and
start with the current closing sentence.'

> > +    def test_get_record_stream_missing_records_are_absent(self):
> > +        f, parents = get_diamond_vf(self.get_file())
> > +        entries = f.get_record_stream(['merged', 'left', 'right', 'or', 'base'],
> 
> Shouldn't it be “origin”, not “or”?  Oh, no, that's intentional.  Please use a
> more obvious value, like “absent-key”.
> 
> Also, I don't understand why you need to use all of merged, left, right and base
> here.

To guard against stupid code. Such as only returning the absent key.

> > +    def test_insert_record_stream_fulltexts_noeol(self):
> > +        """Any file should accept a stream of fulltexts."""
> 
> The docstring ought to mention the noeols aspect of this test.

Yes :(. I reiterate my objection to redundant docstrings in tests. 

> > +    def test_insert_record_stream_annotated_knits_noeol(self):
> > +        """Any file should accept a stream from plain knits."""
> 
> Ditto.
> 
> Also, the docstring says “plain” but the test name says “annotated”.

Thrice!

> > +class TestContentFactoryAdaption(TestCaseWithMemoryTransport):
> > +
> 
> > +    def helpGetBytes(self, f, ft_adapter, delta_adapter):
> > +        """grab the interested adapted texts for tests."""
> 
> Capital letter at the start of a sentence.  Also, what are “interested adapted
> texts”?

interesting ;).

> > === modified file 'bzrlib/versionedfile.py'
> [...]
> > +class RecordingVersionedFileDecorator(object):
> > +    """A minimal versioned file that records calls made on it.
> > +    
> > +    Only enough methods have been added to support tests using it to date.
> 
> (In my ideal world, this sort of class would live under bzrlib.tests somewhere,
> rather than in the production code.  That's not the policy atm though, so don't
> worry about it.)

I think it would be a mistake to put code like this in bzrlib.tests.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080502/65679b4e/attachment.pgp