[MERGE] Support delta_closure=True with NetworkRecordStream to transmit deltas over the wire when full text extraction is required on the far end.

Fri Feb 20 05:40:49 GMT 2009

I have a bunch of tweaks and no fundamental objections, so I guess this is:
bb:tweak

I am a bit nervous about committing a new encoding without explicit tests
for that encoding.

Robert Collins wrote:
> This patch is conceptually simple: allow serialisation of
> get_record_stream to bytes and back.
> 
> Complicating this a little is the fact that some clients of
> get_record_stream want to use it to insert_record_stream, and some want
> to get the texts out.
> 
> So this tightens up the behaviour a little - its not really documented
> clearly yet outside of the tests, so I'm expecting to roundtrip this
> review with other review feedback to improve on that.
> 
> The tightened up behaviour is that 
>  - record.get_bytes_as('chunked') or 'fulltext' returns what it does
> today: the content the user requested.

(Although today you sometimes need an adapter; e.g. knit-ft-gz records for
some reason can't figure out how to give 'fulltext' on their own.)

>  - record.get_bytes_as(record.storage_kind) now always returns a network
> suitable bytestring.

Hooray!

>  - its now explicit (in that we do it :)) that you cannot filter records
>    out of a networkstream when you are transmitting the
>    record.get_bytes_as(record.storage_kind) bytes over the network,
>    because a stream generator may group many records together for
>    efficiency/logical coherency on the wire. A trivial example would be
>    group compress, where one bytestring encodes many texts. The use made
>    of this approach in this patch is to encode the record_map that 
>    get_content_maps needs on the wire. This works great, but because
>    we are encoding many different texts into the stream adding bytes for
>    each call to record.get_bytes_as doesn't fit that well.
> 
> It may be a good idea to map this up in get_record_stream - by having a
> stream-of-streams approach, but I think its ok as-is.

I think it's ok too.

> The good thing about this is that infinite-buffering isn't needed on the
> server - the server buffers one raw_record_map, which is exactly what
> we'd buffer on the client. It also means we don't have two
> implementations of 'how much to buffer' and 'how to schedule IO' -
> improvements we make on the client in terms of scheduling text
> reconstruction should immediately benefit servers. Note that the encoded
> record map *is* incrementally parsable, so if we get to the point of
> having a streaming window and discarding things, we can hook that up
> server side without a wire format change.
> 
> This is defining a wire protocol, so possibly we want explicit tests
> about the encoding, though for now I'm comfortable with 'do not
> change' :P.

Hmm.  Are we comfortable with this as the final encoding?

The lack of explicit tests for the encoding definitely makes me nervous.

> === modified file 'bzrlib/knit.py'
[...]
> +    def _create_network_bytes(self):
> +        """Create a fully serialised network version for transmission."""
> +        # storage_kind, key, parents, Noeol, raw_record
> +        key_bytes = '\x00'.join(self.key)
> +        if self.parents is None:
> +            parent_bytes = 'None:'
> +        else:
> +            parent_bytes = '\t'.join('\x00'.join(key) for key in self.parents)
> +        if self._build_details[1]:
> +            noeol = 'N'
> +        else:
> +            noeol = ' '
> +        network_bytes = "%s\n%s\n%s\n%s%s" % (self.storage_kind, key_bytes,
> +            parent_bytes, noeol, self._raw_record)
> +        self._network_bytes = network_bytes

So \n delimited fields, keys are \0 delimited internally, and the parent
keys are delimited by \t?

That seems nice and compact, although it might be simpler to just throw the
whole struct into bencode.  I don't mind much, although it'd be nice to have
less ad hoc serialisers and deserialisers...

> +class LazyKnitContentFactory(ContentFactory):
[...]
> +        self.storage_kind = "knit-delta-closure"
> +        if not first:
> +            self.storage_kind = self.storage_kind + "-ref"

Feel free to use the += operator ;)

> +def knit_delta_closure_to_records(storage_kind, bytes, line_end):
> +    """Convert a network record to a iterator over stream records.
> +
> +    :param storage_kind: The storage kind of the record.
> +        Must be 'knit-delta-closure'.
> +    :param bytes: The bytes of the record on the network.
> +    """
> +    generator = _NetworkContentMapGenerator(bytes, line_end)
> +    return generator.get_record_stream()

Is it worth an explicit assertion here that storage_kind is what you expect?
I guess it'll probably blow up if it's not, so it probably doesn't matter.

> +def knit_network_to_record(storage_kind, bytes, line_end):
> +    """Convert a network record to a record object.
> +
> +    :param storage_kind: The storage kind of the record.
> +    :param bytes: The bytes of the record on the network.
> +    """
> +    start = line_end
> +    line_end = bytes.find('\n', start)
> +    key = tuple(bytes[start:line_end].split('\x00'))
> +    start = line_end + 1
> +    line_end = bytes.find('\n', start)
> +    parent_line = bytes[start:line_end]

The obvious way to get N lines from a string would be bytes.split('\n', N).
ISTR to recall you talking about how you were writing some code that was
careful in how it was slicing strings to avoid unnecessary copies... was
that this code?  I can see how using str.split would end up copying the
raw_record segment of bytes more times than necessary.  If this does matter,
add a comment so someone doesn't accidentally "fix" this code to use
str.split.  :)

> +    if parent_line == 'None:':
> +        parents = None
> +    else:
> +        parents = tuple(
> +            [tuple(segment.split('\x00')) for segment in parent_line.split('\t')
> +             if segment])
> +    start = line_end + 1
> +    noeol = bytes[start] == 'N'
> +    if 'ft' in storage_kind:
> +        method = 'fulltext'
> +    else:
> +        method = 'line-delta'
> +    build_details = (method, noeol)
> +    start = start + 1

You don't like +=, do you?  :)

> +    raw_record = bytes[start:]

If you really want to avoid copies, you could use buffer(bytes, start)...

> +    def _get_record_map_unparsed(self, keys, allow_missing=False):
[...]
> -        # anyway. Also, can the build chains change as part of a pack
> +        # anyway. Also, can the build chains change as art of a pack

I think you want to revert that typo.

> +class _ContentMapGenerator(object):
> +    """Generate texts or expose raw deltas for a set of texts."""

It's funny that this class has no __init__ even though it has methods that
assume there are various instance attributes.  Is it an abstract base class?

> +    def _get_content(self, key):
> +        """Get the content object for key."""
> +        if key in self.nonlocal_keys:
> +            record = self.get_record_stream().next()
> +            # Create a content object on the fly
> +            lines = osutils.chunks_to_lines(record.get_bytes_as('chunked'))
> +            return PlainKnitContent(lines, record.key)

Just grab the first record from the stream?  That looks a bit funny!  It's
not clear to me how that PlainKnitContent will be the right thing for the
requested key.

[...]
> +    def _get_one_work(self, requested_key):
[...]
> +        # To simply things, parse everything at once - code that wants one text

s/simply/simplify/

> +    def _wire_bytes(self):
> +        """Get the bytes to put on the wire for 'key'.
[...]

Yet another mini serialiser :/

Maybe the thing that bugs me most is that it's difficult to parse strings
really efficiently in Python...

> +class _NetworkContentMapGenerator(_ContentMapGenerator):
[...]
> +        # Get access to record parsing facilities
> +        self.vf = KnitVersionedFiles(None, None)
> +        start = line_end
> +        # Annotated or not
> +        line_end = bytes.find('\n', start)
> +        line = bytes[start:line_end]
[...58 lines...]
> +            self._raw_record_map[key] = (record_bytes, (method, noeol), next)

Gosh it would be nice not to need ~60 lines of fiddly code to do that.  Oh well.

> === modified file 'bzrlib/tests/test_versionedfile.py'
[...]
> +    def assertRecordHasContent(self, record, bytes):
> +        """Assert that record has the bytes bytes."""

(Incidentally, thinking of the recent mailing list thread, I sometimes write
that sort of docstring as “...has the given bytes.” which I think
communicates the same thing without confusing word-doubling or funny
`markup`.)

> +        self.assertEqual(bytes, record.get_bytes_as('fulltext'))
> +        self.assertEqual(bytes, ''.join(record.get_bytes_as('fulltext')))

That second assertion is a bit weird.  Did you mean get_bytes_as('chunked')?

> +    def test_get_record_stream_native_formats_are_wire_ready_ft_delta(self):
[...31 lines of test method...]
> +    def test_get_record_stream_native_formats_are_wire_ready_delta(self):

Ouch.  test_get_record_stream_native_formats_are_wire_ready_delta is almost
exactly the same as
test_get_record_stream_native_formats_are_wire_ready_ft_delta, and so is
test_get_record_stream_wire_ready_delta_closure_included.  I think
refactoring away some more duplication would help reduce the
eyes-glazing-over factor!

> === modified file 'bzrlib/versionedfile.py'
[...]
> +from bzrlib.util import bencode

I don't think you need this import; you don't appear to use bencode anywhere
else in this diff.

-Andrew.