[MERGE, 165304] Allow insert_data_stream to insert differently annotated stream.

Fri Nov 30 07:50:41 GMT 2007

At Martin's request, I'm working on satisfying the review comments for this
branch.  I've attached a diff of my work against Robert's merge directive, and
pushed the equivalent branch to
http://people.ubuntu.com/~andrew/bzr/knit.datastreamjoin

John Arbash Meinel wrote:
> John Arbash Meinel has voted tweak.
> Status is now: Conditionally approved
> Comment:
> +    def get_ancestry(self, versions, topo_sorted):
[...]
>
> ^- you are building up a dictionary just to get a set. I think it would be 
> better do just do:
> ancestry = set()
> while pending:
>   version = pending.pop()
>   ...
>   pending.update([p for p in parents if not in ancestry])
>   ancestry.add(version)
> return list(ancestry)
>
> The only reason to build up a dict is if you wanted to do topo_sorted=True.

I agree.  I've done this.

> ...
>
> +            # Strictly speaking this should checkin in the backing knit, 
> but
> +            # until we have a test to discriminate, this will do.
>
> "check in the backing knit"

Fixed.

>
> However, _StreamIndex doesn't have access to the backing knit...

That's true at the moment.  But I don't think that invalidates the comment; if
we decide to do this, then we refactor the code as necessarily to make it
possible.  So I think the comment is fine as is.

> ...
>
> You are testing "add_lines_with_ghosts"
> +    def test_newline_empty_lines(self):
> +        # ensure that ["\n"] round trips ok.
> +        knit = self.make_test_knit()
> +        knit.add_lines('a', [], ["\n"])
> +        knit.add_lines_with_ghosts('b', [], ["\n"])
> +        self.assertEqual(["\n"], knit.get_lines('a'))
> +        self.assertEqual(["\n"], knit.get_lines('b'))
> +        self.assertEqual(['fulltext'], knit._index.get_options('a'))
> +        self.assertEqual(['fulltext'], knit._index.get_options('b'))
> +        knit.add_lines('c', ['a'], ["\n"])
> +        knit.add_lines_with_ghosts('d', ['b'], ["\n"])
> +        self.assertEqual(["\n"], knit.get_lines('c'))
> +        self.assertEqual(["\n"], knit.get_lines('d'))
> +        self.assertEqual(['line-delta'], knit._index.get_options('c'))
> +        self.assertEqual(['line-delta'], knit._index.get_options('d'))
>
> but none of those parents are actually ghosts...

I believe the intention here was to make sure that add_lines_with_ghosts round
trips ["\n"] the same way as add_lines.  Obviously the line-delta records can't
have ghost parents, but Robert explicitly tests the line-delta case for both
add_lines and add_lines_with_ghosts.  I suppose the 'b' version could be given a
ghost parent, but I think that's tangential to the purpose of this test.

It would be nice if the purpose was unambiguous though...

> ...
> +    def test_get_ancestry(self):
> +        knit = self.make_knit_with_4_versions_2_dags()
> +        self.assertIndexAncestry(knit, ['a'], ['a'], ['a'])
> +        self.assertIndexAncestry(knit, ['b'], ['b'], ['b'])
> +        self.assertIndexAncestry(knit, ['c'], ['c'], ['c'])
> +        self.assertIndexAncestry(knit, ['c'], ['a', 'b', 'c'],
> +            {'a':[], 'b':[], 'c':[]}.keys())
> +        self.assertIndexAncestry(knit, ['c', 'd'], ['a', 'b', 'c', 'd'],
> +            {'a':[], 'b':[], 'c':[], 'd':[]}.keys())
>
> ^- It doesn't really seem right to use a dict to get a list of keys. I know 
> we are somewhat concerned with the sort order, but why not just compare the 
> set of the return value.

Yeah, I can't see any reason not to use set(['a', 'b', 'c']) here instead of
{...}.keys().  Changed.

> you have:
> +    def assertGetPosition(self, knit, versions, version, result):
> +        index = self.get_index(knit, knit.get_data_stream(versions))
> +        self.assertEqual(result, index.get_position(version))
> +
> +    def assertGetPosition(self, knit, versions, version, result):
> +        index = self.get_index(knit, knit.get_data_stream(versions))
> +        if result[0] is None:
> +            result = (index, result[1], result[2])
> +        self.assertEqual(result, index.get_position(version))
> +
>
> That is the function defined 2 times, right?

Right.  I've deleted the first definition, which is clearly unused :)

> ...
> +    def test_get_parents_with_ghosts(self):
> +        knit = self.make_knit_with_4_versions_2_dags()
> +        self.assertGetParentsWithGhosts(knit, ['a'], 'a', [])
> +        self.assertGetParentsWithGhosts(knit, ['c'], 'c', ['b', 'a'])
> +
>
> ^- this doesn't check anything with a node that actually has ghosts ('d', 
> for example)

I've added:

    self.assertGetParentsWithGhosts(knit, ['d'], 'd', ['e', 'f'])

>
> +        try:
> +            start, end = self._by_version[version_id][1]
> +            return self, start, end
> +        except KeyError:
> +            # Signal to the access object to handle this from the backing 
> knit.
> +            return ('thunk:%s' % version_id, None, None)
>
> ^- this assumes that we will never have a user named "thunk:" While I think 
> that is likely, it will break.
> You really need a 'here:%s' % version_id
> versus a 'nothere:%s' % version_id
>
> Otherwise you are assuming your version_ids will never have "thunk:" to 
> start with.
>
> On the flip side, you could do:
> '%s:thunk:' % version_id
>
> because we *have* asserted that revision-ids ending in ':' are reserved. 
> (we might even check for this in a couple places.) 'null:', and 'current:' 
> are the obvious ones.
>
> A small correction is that you are using the fact that the object is a 
> string (not self.stream_index) as the distinguishing factor. So why use 
> 'thunk:' in the first place? (I'm not a huge fan of obj is obj2 as an if 
> statement, but it is what you are using.)

I agree.  I don't think we need to try squeeze this information into the
version_id; we can simply add a new field to the memo to explicitly flag if
thunking is needed, and avoid munging strings at all.  The structure of the memo
is internal to a particular Index/Access pair after all, so there's no reason
not to vary the structure if it would be helpful.

So now the memo items for _StreamAccess/Index are: (thunk_flag, index, pos,
end).

>
> +            if (index[:6] != 'thunk:' or
> +                self.orig_factory.__class__ != KnitPlainFactory):
> ^- index.startswith('thunk:')
> is usually the preferred way to do that (you don't have to bother counting 
> the length of 'thunk:' or forget to update it if you change it).

This is now irrelevant :)

> +        :param memos_for_retrieval: An iterable containing the (index, 
> pos,
> +            length) memo for retrieving the bytes. The .knit method 
> ignores
> +            the index as there is always only a single file.
>
> ^- This should be updated, as you don't ignore the index. You could 
> document the 'thunk:' versus obj is self._stream_index here.

Updated.

> ...
>
> This seems overly restrictive:
> +    def test_asking_for_thunk_stream_is_not_plain_errors(self):
> +        knit = self.make_test_knit(name='annotated', annotate=True)
> +        knit.add_lines("A", [], ["Foo\n"])
> +        index, access = self.get_index_access(knit,
> +            knit.get_data_stream([]))
> +        self.assertRaises(errors.KnitCorrupt,
> +            list, access.get_raw_records([("thunk:A", None, None)]))
>
> You could just strip the annotations off, couldn't you?
>
> That said, if you don't need the functionality now, a simple comment that 
> it could be done would suffice.
>
> It seems like you should actually be testing that trying to create a data 
> stream on an annotated source fails at the beginning. Not at the thunk 
> layer. Oh... shouldn't the data_stream have the same annotation data as the 
>  knit you are thunking into? (either both are true, or both are false)
>
> I don't really see why you need to thunk for .join() Since it shouldn't be 
> copying nodes it already has.

I don't feel confident answering this part on Robert's behalf.  It does sound
like something we can comfortably address post-merge though.

Btw, I've also addressed the review comments from my review.  So I think with my
patch this is good to merge.

-Andrew.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: knit.datastreamjoin-update.diff
Type: text/x-diff
Size: 9878 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071130/01b5193f/attachment-0001.bin