[MERGE] Container format basic implementation

Fri Jun 15 07:52:26 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> This bundle implements the container format described in
> doc/developers/container-format.txt.
> 
> It's not yet used by anything (although Aaron has a branch where he's
> experimenting with it), but as described in container-format.txt I'm intending
> to use it in the smart server protocol quite soon.

If this is an experiment, you'd better keep your torches and pitchforks
ready, 'cause it's alive and it'll be terrorizing the countryside any
day now.

> I'm not sure if we want to merge this before it's used by something else in the
> codebase, but it is ready to be reviewed.  The implementation, and especially
> the tests, are quite thoroughly documented so hopefully it is easy to
> understand the code.

I'm in favour of merging things we're certain we'll use.  It's hard
enough to review large changes-- why make them larger?

That said, actual use does bring up issues that may not be apparent in
design.  Here are some classes from my bundle work:

class BundleWriter(object):

    def __init__(self, fileobj):
        self._container = pack.ContainerWriter(self._write_encoded)
        self._fileobj = fileobj
        self._compressor = bz2.BZ2Compressor()
        self._base64_buffer = ''

    def begin(self):
        self._fileobj.write(serializer._get_bundle_header('1.0alpha'))
        self._fileobj.write('#\n')
        self._container.begin()

    def _write_encoded(self, bytes):
        self._base64_buffer += self._compressor.compress(bytes)
        if len(self._base64_buffer) >=  BASE64_LINE_BYTES:
            to_leave = len(self._base64_buffer) % BASE64_LINE_BYTES
            self._fileobj.write(self._base64_buffer[:-to_leave].encode(
                'base-64'))
            self._base64_buffer = self._base64_buffer[-to_leave:]

    def end(self):
        self._container.end()
        tail = self._base64_buffer+self._compressor.flush()
        self._fileobj.write(tail.encode('base-64'))

    def add_multiparent_record(self, mp_bytes, parents, repo_kind,
                               revision_id, file_id):
        self._add_record(mp_bytes, parents, repo_kind, revision_id, file_id)

    def add_fulltext_record(self, bytes, parents, repo_kind, revision_id,
                            file_id):
        self._add_record(bytes, parents, repo_kind, revision_id, file_id)

    @staticmethod
    def encode_parents(parents):
        return ' '.join(parents) + '\n'

    @staticmethod
    def encode_name(name_kind, revision_id, file_id=None):
        assert name_kind in ('revision', 'file', 'inventory', 'testament')
        if name_kind in ('revision', 'inventory', 'testament'):
            assert file_id is None
        else:
            assert file_id is not None
        if file_id is not None:
            file_tail = '/' + file_id
        else:
            file_tail = ''
        return name_kind + ':' + revision_id + file_tail

    def _add_record(self, bytes, parents, repo_kind, revision_id, file_id):
        name = self.encode_name(repo_kind, revision_id, file_id)
        parents = self.encode_parents(parents)
        bytes = parents + bytes
        self._container.add_bytes_record(bytes, [name])

If this looks creepily familiar, it should.  This has a lot in common
with ContainerWriter-- it begins, ends, has records added, etc.

At present, fulltext records and multiparent records are identical, but
I'll be adding a kind indicator to records soon.

I'm somewhat disappointed here, because all this format gives me is
sequence.  It would be much nicer if records had user-defined metadata,
so that I could stick e.g. the parent listings there.

> === added file bzrlib/pack.py // file-id:container.py-20070607160755-tr8zc26q18

^^^ Is "pack file" meant to become the term for these files?

> +    def add_bytes_record(self, bytes, names):

^^^ Should we be checking whether these are bytes or unicode strings?

> +class BaseReader(object):
> +
> +    def __init__(self, reader_func):
> +        """Constructor.
> +
> +        :param reader_func: a callable that takes one optional argument,
> +            ``size``, and returns at most that many bytes.  When the callable
> +            returns less than the requested number of bytes, then the end of the
> +            file/stream has been reached.
> +        """
> +        self.reader_func = reader_func

I think that reader_func is a poor choice, because it requires a certain
number of bytes to be read, and many of our streaming  output functions
don't provide a way to retrieve a certain number of bytes.  Which means
buffering, etc.

bz2.BZ2Decompressor.decompress is an example of this.  Here's how I
managed to make this setup do streaming reads of base64-encoded bzipped
containers:

class BundleReader(object):

    def __init__(self, fileobj):
        line = fileobj.readline()
        if line != '\n':
            fileobj.readline()
        self._container = pack.ContainerReader(
            iterablefile.IterableFile(self.iter_decode(fileobj)).read)

    @staticmethod
    def iter_decode(fileobj):
        decompressor = bz2.BZ2Decompressor()
        for line in fileobj:
            yield decompressor.decompress(line.decode('base-64'))

I'm not sure whether you've seen IterableFile before, but it's a wrapper
that provides a file-like interface for any iterable of strings.

If I could pass ContainerReader an iterable of strings, then Container
could use the IterableFile wrapper, and then it wouldn't have to
implement _read_line, because IterableFile already does that, and
efficiently too.

Or alternatively, if ContainerReader demanded a file-like object, then
you wouldn't need to implement read_line, either.  Seek would also be a
great thing to support, for skipping undesired records.  This option
really makes the most sense to me.
> +class ContainerReader(BaseReader):
> +    """A class for reading Bazaar's container format."""
> +
> +    def iter_records(self):
> +    def iter_record_objects(self):

It seems a shame that there's no way to iterate through records without
invoking a callable for the record bodies.

> +class BytesRecordReader(BaseReader):
> +
> +    def read(self):
> +        """Read this record.
> +
> +        You can either validate or read a record, you can't do both.
> +
> +        :returns: A tuple of (names, callable).  The callable can be called
> +            repeatedly to obtain the bytes for the record, with a max_length
> +            argument.  If max_length is None, returns all the bytes.  Because
> +            records can be arbitrarily large, using None is not recommended
> +            unless you have reason to believe the content will fit in memory.
> +        """
> +        # Read the content length.
> +        length_line = self._read_line()
> +        try:
> +            length = int(length_line)
> +        except ValueError:
> +            raise errors.InvalidRecordError(
> +                "%r is not a valid length." % (length_line,))
> +        
> +        # Read the list of names.
> +        names = []
> +        while True:
> +            name = self._read_line()
> +            if name == '':
> +                break
> +            _check_name(name)
> +            names.append(name)

AIUI, names and length are common to all record types, so this
functionality seems to make more sense as part of the caller, IMHO.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGcjcq0F+nu1YWqI0RAjBDAJ9WN9dTxbjSQ1KH9yDzg+KPdbORkQCffIJY
tlUKr6eGZJPbLhB98oS5yTg=
=pbIV
-----END PGP SIGNATURE-----