[MERGE] Support delta_closure=True with NetworkRecordStream to transmit deltas over the wire when full text extraction is required on the far end.

Robert Collins robert.collins at canonical.com
Fri Feb 20 05:59:39 GMT 2009


On Fri, 2009-02-20 at 16:40 +1100, Andrew Bennetts wrote:
> I have a bunch of tweaks and no fundamental objections, so I guess this is:
> bb:tweak
> 
> I am a bit nervous about committing a new encoding without explicit tests
> for that encoding.

I am too, but I could go either way.

> (Although today you sometimes need an adapter; e.g. knit-ft-gz records for
> some reason can't figure out how to give 'fulltext' on their own.)

I think that is something we could tweak.

> >  - record.get_bytes_as(record.storage_kind) now always returns a network
> > suitable bytestring.
> 
> Hooray!

> > This is defining a wire protocol, so possibly we want explicit tests
> > about the encoding, though for now I'm comfortable with 'do not
> > change' :P.
> 
> Hmm.  Are we comfortable with this as the final encoding?

Likely not. I wanted some stuff from your loom which I didn't have, so I
just invented-as-I-went, in particular using a binary length would
probably make a lot of sense.

> The lack of explicit tests for the encoding definitely makes me nervous.

It would make me more nervous if the encode and decode routines were not
separate (which makes it hard to skew accidentally).

> > === modified file 'bzrlib/knit.py'
> [...]
> > +    def _create_network_bytes(self):
> > +        """Create a fully serialised network version for transmission."""
> > +        # storage_kind, key, parents, Noeol, raw_record
> > +        key_bytes = '\x00'.join(self.key)
> > +        if self.parents is None:
> > +            parent_bytes = 'None:'
> > +        else:
> > +            parent_bytes = '\t'.join('\x00'.join(key) for key in self.parents)
> > +        if self._build_details[1]:
> > +            noeol = 'N'
> > +        else:
> > +            noeol = ' '
> > +        network_bytes = "%s\n%s\n%s\n%s%s" % (self.storage_kind, key_bytes,
> > +            parent_bytes, noeol, self._raw_record)
> > +        self._network_bytes = network_bytes
> 
> So \n delimited fields, keys are \0 delimited internally, and the parent
> keys are delimited by \t?

Yes.

> That seems nice and compact, although it might be simpler to just throw the
> whole struct into bencode.  I don't mind much, although it'd be nice to have
> less ad hoc serialisers and deserialisers...

The one above is used in btree index and graph index.

> > +def knit_delta_closure_to_records(storage_kind, bytes, line_end):
> > +    """Convert a network record to a iterator over stream records.
> > +
> > +    :param storage_kind: The storage kind of the record.
> > +        Must be 'knit-delta-closure'.
> > +    :param bytes: The bytes of the record on the network.
> > +    """
> > +    generator = _NetworkContentMapGenerator(bytes, line_end)
> > +    return generator.get_record_stream()
> 
> Is it worth an explicit assertion here that storage_kind is what you expect?
> I guess it'll probably blow up if it's not, so it probably doesn't matter.

That was my thought.

> > +def knit_network_to_record(storage_kind, bytes, line_end):
> > +    """Convert a network record to a record object.
> > +
> > +    :param storage_kind: The storage kind of the record.
> > +    :param bytes: The bytes of the record on the network.
> > +    """
> > +    start = line_end
> > +    line_end = bytes.find('\n', start)
> > +    key = tuple(bytes[start:line_end].split('\x00'))
> > +    start = line_end + 1
> > +    line_end = bytes.find('\n', start)
> > +    parent_line = bytes[start:line_end]
> 
> The obvious way to get N lines from a string would be bytes.split('\n', N).
> ISTR to recall you talking about how you were writing some code that was
> careful in how it was slicing strings to avoid unnecessary copies... was
> that this code?  I can see how using str.split would end up copying the
> raw_record segment of bytes more times than necessary.  If this does matter,
> add a comment so someone doesn't accidentally "fix" this code to use
> str.split.  :)

Ok.

> > +    if parent_line == 'None:':
> > +        parents = None
> > +    else:
> > +        parents = tuple(
> > +            [tuple(segment.split('\x00')) for segment in parent_line.split('\t')
> > +             if segment])
> > +    start = line_end + 1
> > +    noeol = bytes[start] == 'N'
> > +    if 'ft' in storage_kind:
> > +        method = 'fulltext'
> > +    else:
> > +        method = 'line-delta'
> > +    build_details = (method, noeol)
> > +    start = start + 1
> 
> You don't like +=, do you?  :)

It doesn't work everywhere, so using it is harder than not using it.

> > +    raw_record = bytes[start:]
> 
> If you really want to avoid copies, you could use buffer(bytes, start)...

I can look into that; I think its ok to copy each thing we want out,
which should be a lot less than copying everything we haven't parsed -
but yes, definitely worth improving.

> > +class _ContentMapGenerator(object):
> > +    """Generate texts or expose raw deltas for a set of texts."""
> 
> It's funny that this class has no __init__ even though it has methods that
> assume there are various instance attributes.  Is it an abstract base class?

yes.

> > +    def _get_content(self, key):
> > +        """Get the content object for key."""
> > +        if key in self.nonlocal_keys:
> > +            record = self.get_record_stream().next()
> > +            # Create a content object on the fly
> > +            lines = osutils.chunks_to_lines(record.get_bytes_as('chunked'))
> > +            return PlainKnitContent(lines, record.key)
> 
> Just grab the first record from the stream?  That looks a bit funny!  It's
> not clear to me how that PlainKnitContent will be the right thing for the
> requested key.

_get_content is only ever used for one text at a time. I had a comment
but its been elided through refactoring, readded.

> > +    def test_get_record_stream_native_formats_are_wire_ready_ft_delta(self):
> [...31 lines of test method...]
> > +    def test_get_record_stream_native_formats_are_wire_ready_delta(self):
> 
> Ouch.  test_get_record_stream_native_formats_are_wire_ready_delta is almost
> exactly the same as
> test_get_record_stream_native_formats_are_wire_ready_ft_delta, and so is
> test_get_record_stream_wire_ready_delta_closure_included.  I think
> refactoring away some more duplication would help reduce the
> eyes-glazing-over factor!

Certainly the leadin is similar but not quite the same. Uhm. Given that
your branch is blocked on this, lets look at doing some cleanup next
week once the excitement is past.

Thanks, and landing with most of tweaks tweaked.

-Rob
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20090220/18df078f/attachment.pgp 


More information about the bazaar mailing list