[MERGE] file-object based container format

Thu Jun 28 22:52:59 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> Hi all,
> 
> I've nearly got the new merge directive and bundle formats ready to
> merge.  But they depend on the container format.
> 
> The previous container format submission contained a really inefficient
> implementation of file.readline that was built on file.read.  Andrew and
> I agree that the ultimate best API would be a push-based one.  But
> Andrew hasn't been able to work on that, so we've agreed to compromise
> by having containers take file-like objects instead of file.read functions.
> 
> Aaron

+0 overall, with these general comments:

I like where this is going, but I think quite a bit of the api is clumsy.

ContainerReader uses a 'file' object, but your ContainerWriter uses
'file.write'. I think they should be made the same. Either they both
take callables, or they both take file-like objects.

This came up several times, so I tried not to comment on it too much.
Your current 'default' api for ContainerReader is to iterate over
(names, callable) tuples. But you've already gone through all the effort
to build up custom "RecordReader" objects. Why not just iterate over
those record reader objects. It seems like a much nicer api.

for record in reader.iter_records():
  print record.names
  print record.read_bytes()
  print record.type  # ??

You call the file bzrlib/pack.py, but all the code talks about
Container. So shouldn't it be 'bzrlib/container.py' ? And "Bazaar
container format 1" ?

I kind of like using a distinct name, rather than calling them 'pack'
files. Especially since they won't follow the format of git pack files.

Arguably, we should be following the sewing analogies and calling them
"Baskets" or something like that. I'm not pushing for *that* though. :)
(Weaves => Knits => Baskets ?)

...

+FORMAT_ONE = "Bazaar pack format 1"
+
+
+_whitespace_re = re.compile('[\t\n\x0b\x0c\r ]')

^- I don't think that '\x0b' and '\x0c' are actually whitespace
characters. I guess it is arguable, as u'\x0b' is "Vertical Tabulation"
and u'\x0c' is "Form Feed".

Arguably, though, they really should just be considered control
characters, and we should be filtering out \x00-\x20. \x20 is a blank
space, so we could certainly include it in the illegal characters.

At least, that is what unicode.org says:
http://www.unicode.org/charts/PDF/U0000.pdf

+
+
+def _check_name(name):
+    """Do some basic checking of 'name'.
+
+    At the moment, this just checks that there are no whitespace
characters in a
+    name.
+
+    :raises InvalidRecordError: if name is not valid.
+    :seealso: _check_name_encoding
+    """
+    if _whitespace_re.search(name) is not None:
+        raise errors.InvalidRecordError("%r is not a valid name." %
(name,))
+
+
+def _check_name_encoding(name):
+    """Check that 'name' is valid UTF-8.
+
+    This is separate from _check_name because UTF-8 decoding is relatively
+    expensive, and we usually want to avoid it.
+
+    :raises InvalidRecordError: if name is not valid UTF-8.
+    """
+    try:
+        name.decode('utf-8')
+    except UnicodeDecodeError, e:
+        raise errors.InvalidRecordError(str(e))
+
+
+class ContainerWriter(object):
+    """A class for writing containers."""
+
+    def __init__(self, write_func):
+        """Constructor.
+
+        :param write_func: a callable that will be called when this
+            ContainerWriter needs to write some bytes.
+        """
+        self.write_func = write_func

^- Why are you taking a 'file' so you can do 'file.read()' but turning
around and directly asking for 'file.write'?

+
+    def begin(self):
+        """Begin writing a container."""
+        self.write_func(FORMAT_ONE + "\n")
+
+    def end(self):
+        """Finish writing a container."""
+        self.write_func("E")
+
+    def add_bytes_record(self, bytes, names):
+        """Add a Bytes record with the given names."""
+        # Kind marker
+        self.write_func("B")
+        # Length
+        self.write_func(str(len(bytes)) + "\n")
+        # Names
+        for name in names:
+            # Make sure we're writing valid names.  Note that we will
leave a
+            # half-written record if a name is bad!
+            _check_name(name)
+            self.write_func(name + "\n")
+        # End of headers
+        self.write_func("\n")
+        # Finally, the contents.
+        self.write_func(bytes)

^- This looks pretty inefficient, since you have 4+ calls to
'write_func()' for every call. And file.write() (iirc) is a syscall,
even if it is buffered. I would rather see:

out = ['B']
out.append(str(len(bytes)) + '\n')
for name in names:
...

self.write_func(''.join(out))
self.write_func(bytes)

I also find it a little odd that you terminate the name list with
'\n\n', rather than something more obvious. But it isn't a big deal.

+
+
+class BaseReader(object):
+
+    def __init__(self, source_file):
+        """Constructor.
+
+        :param reader_func: a callable that takes one optional argument,
+            ``size``, and returns at most that many bytes.  When the
callable
+            returns less than the requested number of bytes, then the
end of the
+            file/stream has been reached.
+        """
+        self._source = source_file
+
+    def reader_func(self, length=None):
+        return self._source.read(length)

^- Why is self.reader_func() abstracted out from self._source.read()?
It seems like just an extra abstraction for minimal gain. Unless it is
useful for testing. (Or maybe because the api was using it in the past?)

+
+    def _read_line(self):
+        line = self._source.readline()
+        if not line.endswith('\n'):
+            raise errors.UnexpectedEndOfContainerError()
+        return line.rstrip('\n')

^- EOF doesn't seem like quite the right error here, but I suppose you
know that you will always end with a newline? (So if you don't you
somehow reached EOF). To *me* EOF is when 'readline()' returns the empty
string.

+
+
+class ContainerReader(BaseReader):
+    """A class for reading Bazaar's container format."""
+
+    def iter_records(self):
+        """Iterate over the container, yielding each record as it is read.
+
+        Each yielded record will be a 2-tuple of (names, callable),
where names
+        is a ``list`` and bytes is a function that takes one argument,
+        ``max_length``.
+
+        You **must not** call the callable after advancing the
interator to the
+        next record.  That is, this code is invalid::
+
+            record_iter = container.iter_records()
+            names1, callable1 = record_iter.next()
+            names2, callable2 = record_iter.next()
+            bytes1 = callable1(None)
+
+        As it will give incorrect results and invalidate the state of the
+        ContainerReader.
+
+        :raises ContainerError: if any sort of containter corruption is
+            detected, e.g. UnknownContainerFormatError is the format of the
+            container is unrecognised.
+        :seealso: ContainerReader.read
+        """

^- I think the :seealso: is incorrect here, but I'm not sure.

+        self._read_format()
+        return self._iter_records()
+

^v- I don't quite understand why you have one interface which returns
the names + a callable, and another which returns "nice" objects, when
the "nice" objects are being used to generate the simple ones. Meaning,
if you were trying to do names+callable to be fast (not generating
objects) this fails, since it is defined in terms of the "slow" one.

+    def iter_record_objects(self):
+        """Iterate over the container, yielding each record as it is read.
+
+        Each yielded record will be an object with ``read`` and
``validate``
+        methods.  Like with iter_records, it is not safe to use a
record object
+        after advancing the iterator to yield next record.
+
+        :raises ContainerError: if any sort of containter corruption is
+            detected, e.g. UnknownContainerFormatError is the format of the
+            container is unrecognised.
+        :seealso: iter_records
+        """
+        self._read_format()
+        return self._iter_record_objects()
+
+    def _iter_records(self):
+        for record in self._iter_record_objects():
+            yield record.read()
+
+    def _iter_record_objects(self):
+        while True:
+            record_kind = self.reader_func(1)
+            if record_kind == 'B':
+                # Bytes record.
+                reader = BytesRecordReader(self._source)
+                yield reader
+            elif record_kind == 'E':
+                # End marker.  There are no more records.
+                return
+            elif record_kind == '':
+                # End of stream encountered, but no End Marker record
seen, so
+                # this container is incomplete.
+                raise errors.UnexpectedEndOfContainerError()
+            else:
+                # Unknown record type.
+                raise errors.UnknownRecordTypeError(record_kind)

^- Whenever I see an "if/elseif/elseif/else" statement, I get the
feeling that we should be using a registry of some sort, so it can be
easily extended. So you would do:

record_kind = self.reader_func(1)
try:
    next_action = self._actions[record_kind]
except KeyError:
    raise errors.UnknownRecordTypeError(record_kind)
return_obj = next_action(self)
if return_obj is None:
  return
yield return_obj

And then you would have

self._actions = {
  '': lambda self: raise errors.UnexpectedEndOfContainerError(),
  'B': lambda self: return BytesRecordReader(self._source),
  'E': lambda self: return None,
}

There are alternative ways of doing it, like having them all be member
functions so you don't have to pass 'self'. Also, I'm not specifically
advocating using 'lambda' I just used it because it was easier to show
my example.

+
+    def _read_format(self):
+        format = self._read_line()
+        if format != FORMAT_ONE:
+            raise errors.UnknownContainerFormatError(format)
+
+    def validate(self):
+        """Validate this container and its records.
+
+        Validating consumes the data stream just like iter_records and
+        iter_record_objects, so you cannot call it after
+        iter_records/iter_record_objects.
+
+        :raises ContainerError: if something is invalid.
+        """
+        all_names = set()
+        for record_names, read_bytes in self.iter_records():
+            read_bytes(None)
+            for name in record_names:
+                _check_name_encoding(name)
+                # Check that the name is unique.  Note that Python will
refuse
+                # to decode non-shortest forms of UTF-8 encoding, so
there is no
+                # risk that the same unicode string has been encoded two
+                # different ways.

"Refuse to decode non-shortest forms of UTF-8" ??
Meaning that you can encode a utf-8 string with a longer number of bytes
than minimal?

What about non-normalized Unicode? Don't worry about it?

+                if name in all_names:
+                    raise errors.DuplicateRecordNameError(name)
+                all_names.add(name)
+        excess_bytes = self.reader_func(1)
+        if excess_bytes != '':
+            raise errors.ContainerHasExcessDataError(excess_bytes)
+

^- It makes more sense to me to use self.iter_record_objects() and have
each object validate itself. That way if a given record type has more
structure than just a string of bytes, it has the opportunity to
validate that structure.
With this code, the only thing that can be globally validated is the names.

+
+class BytesRecordReader(BaseReader):
+
+    def read(self):
+        """Read this record.
+
+        You can either validate or read a record, you can't do both.
+
+        :returns: A tuple of (names, callable).  The callable can be called
+            repeatedly to obtain the bytes for the record, with a
max_length
+            argument.  If max_length is None, returns all the bytes.
Because
+            records can be arbitrarily large, using None is not recommended
+            unless you have reason to believe the content will fit in
memory.
+        """

^- I would also comment that it is not illegal to request more bytes
than there are remaining, and that it will be restricted to the number
of bytes left.

I also feel like passing around a callable isn't nearly as nice as just
returning the "BytesRecordReader" object. So you do:

for reader in container.iter_objects():
  names = reader.names # I would expect this part to be read already
		       # But we could have a 'reader.read()' if you want
  bytes = reader.read(1024*1024)
  while bytes:
    # Do something with these bytes
    bytes = reader.read(1024*1024)

+        # Read the content length.
+        length_line = self._read_line()
+        try:
+            length = int(length_line)
+        except ValueError:
+            raise errors.InvalidRecordError(
+                "%r is not a valid length." % (length_line,))
+
+        # Read the list of names.
+        names = []
+        while True:
+            name = self._read_line()
+            if name == '':
+                break
+            _check_name(name)
+            names.append(name)
+
+        self._remaining_length = length
+        return names, self._content_reader
+
+    def _content_reader(self, max_length):
+        if max_length is None:
+            length_to_read = self._remaining_length
+        else:
+            length_to_read = min(max_length, self._remaining_length)
+        self._remaining_length -= length_to_read
+        bytes = self.reader_func(length_to_read)
+        if len(bytes) != length_to_read:
+            raise errors.UnexpectedEndOfContainerError()
+        return bytes
+
+    def validate(self):
+        """Validate this record.
+
+        You can either validate or read, you can't do both.
+
+        :raises ContainerError: if this record is invalid.
+        """
+        names, read_bytes = self.read()
+        for name in names:
+            _check_name_encoding(name)
+        read_bytes(None)
+

^- Why not have validate return the names and the bytes?

Also, you mention quite strongly that you shouldn't use
'read_bytes(None)' because it creates an un-bounded size string. And it
still creates it, even if it isn't used, so this should be turned into a
limited memory loop.

bytes = read_bytes(XXX)
while bytes:
  bytes = read_bytes(XXX)

I think you have the same "bug" in the global 'validate()' command.

v-- As I said earlier, I think all of these should be passing in the
'output' object itself, rather than 'output.write'

You also should test non-ascii names.

+class TestContainerWriter(tests.TestCase):
+
+    def test_construct(self):
+        """Test constructing a ContainerWriter.
+
+        This uses None as the output stream to show that the
constructor doesn't
+        try to use the output stream.
+        """
+        writer = pack.ContainerWriter(None)
+
+    def test_begin(self):
+        """The begin() method writes the container format marker line."""
+        output = StringIO()
+        writer = pack.ContainerWriter(output.write)
+        writer.begin()
+        self.assertEqual('Bazaar pack format 1\n', output.getvalue())
+

...

+class TestContainerReader(tests.TestCase):
+
+    def get_reader_for(self, bytes):
+        stream = StringIO(bytes)
+        reader = pack.ContainerReader(stream)
+        return reader
+
+    def test_construct(self):
+        """Test constructing a ContainerReader.
+
+        This uses None as the output stream to show that the
constructor doesn't
+        try to use the input stream.
+        """
+        reader = pack.ContainerReader(None)
+
+    def test_empty_container(self):
+        """Read an empty container."""
+        reader = self.get_reader_for("Bazaar pack format 1\nE")
+        self.assertEqual([], list(reader.iter_records()))
+
+    def test_unknown_format(self):
+        """Unrecognised container formats raise
UnknownContainerFormatError."""
+        reader = self.get_reader_for("unknown format\n")
+        self.assertRaises(
+            errors.UnknownContainerFormatError, reader.iter_records)
+

^- Are you asserting that reader.iter_records *must* parse the header
before it returns the generator? The alternative would be:

self.assertListRaises(errors.....,
		      reader.iter_records())

Though I'm not sure if 'assertListRaises' is in the base TestCase class.

v- hmm... wouldn't it be better to use "list(iterator)", .next is okay,
I guess.

+    def test_unexpected_end_of_container(self):
+        """Containers that don't end with an End Marker record should cause
+        UnexpectedEndOfContainerError to be raised.
+        """
+        reader = self.get_reader_for("Bazaar pack format 1\n")
+        iterator = reader.iter_records()
+        self.assertRaises(
+            errors.UnexpectedEndOfContainerError, iterator.next)
+
+    def test_unknown_record_type(self):
+        """Unknown record types cause UnknownRecordTypeError to be
raised."""
+        reader = self.get_reader_for("Bazaar pack format 1\nX")
+        iterator = reader.iter_records()
+        self.assertRaises(
+            errors.UnknownRecordTypeError, iterator.next)

+
+    def test_container_with_one_unnamed_record(self):
+        """Read a container with one Bytes record.
+
+        Parsing Bytes records is more thoroughly exercised by
+        TestBytesRecordReader.  This test is here to ensure that
+        ContainerReader's integration with BytesRecordReader is working.
+        """
+        reader = self.get_reader_for("Bazaar pack format 1\nB5\n\naaaaaE")
+        expected_records = [([], 'aaaaa')]
+        self.assertEqual(
+            expected_records,
+            [(names, read_bytes(None))
+             for (names, read_bytes) in reader.iter_records()])

doesn't:

  self.assertEqual(expected_records,
		   [(r.names, r.read_bytes())
                    for r in reader.iter_records()])

Seem like a better api? Maybe it is just me.

+
+    def test_validate_empty_container(self):
+        """validate does not raise an error for a container with no
records."""
+        reader = self.get_reader_for("Bazaar pack format 1\nE")
+        # No exception raised
+        reader.validate()
+

...

+class ContainerError(BzrError):
+    """Base class of container errors."""
+
+
+class UnknownContainerFormatError(ContainerError):
+
+    _fmt = "Unrecognised container format: %(container_format)r"
+
+    def __init__(self, container_format):
+        self.container_format = container_format
+
+

v- Shouldn't this at least take the Container object to display. It
might be nice for it to display Container._source.name if it is
available. So you know what file you were reading when you failed.

This is especially true if you have a repository based on containers,
since it may fail because of a corrupted file, and you have no idea
where the corruption is. It *could* be done by catching the error, and
raising a different one. But in that case, this should be a
internal_error = True.

+class UnexpectedEndOfContainerError(ContainerError):
+
+    _fmt = "Unexpected end of container stream"
+
+    internal_error = False
+
+
+class UnknownRecordTypeError(ContainerError):
+
+    _fmt = "Unknown record type: %(record_type)r"
+
+    def __init__(self, record_type):
+        self.record_type = record_type
+
+
+class InvalidRecordError(ContainerError):
+
+    _fmt = "Invalid record: %(reason)s"
+
+    def __init__(self, reason):
+        self.reason = reason
+
+
+class ContainerHasExcessDataError(ContainerError):
+
+    _fmt = "Container has data after end marker: %(excess)r"
+
+    def __init__(self, excess):
+        self.excess = excess
+
+
+class DuplicateRecordNameError(ContainerError):
+
+    _fmt = "Container has multiple records with the same name:
\"%(name)s\""
+
+    def __init__(self, name):
+        self.name = name
+
+

^- The invalid stream could contain control codes, which seems like it
could at least corrupt your terminal settings. Wouldn't %(name)r be
better than (\"%(name)s\") ?

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGhC26JdeBCYSNAAMRAqJaAKCZCMGcRXQ1UInkNwMImTZpDwj1HACgv5Sl
yby91wwj0fvImmywOtQfMDo=
=YR6X
-----END PGP SIGNATURE-----