[MERGE] file-object based container format

Fri Jun 29 04:20:20 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This is a partial reply, because Andrew said he'll also be replying.

John Arbash Meinel wrote:
> Aaron Bentley wrote:
> ContainerReader uses a 'file' object, but your ContainerWriter uses
> 'file.write'. I think they should be made the same. Either they both
> take callables, or they both take file-like objects.

There is a practical reason why ContainerReader takes a file.  There is
no practical reason why ContainerWriter should.

There ought to be a "principle of least privilege" in API design-- APIs
should only demand what they need to work efficiently.  ContainerWriter
doesn't need a file-like object.  ContainerReader does.

Adding a requirement that ContainerWriter be passed a file-like object
would impose more on ContainerWriter clients.  IterableFile can be used
to convert most any input stream into a file-like object.  There's no
equivalent for ContainerWriter (save StringIO, and that's not
memory-efficient).

> This came up several times, so I tried not to comment on it too much.
> Your current 'default' api for ContainerReader is to iterate over
> (names, callable) tuples. But you've already gone through all the effort
> to build up custom "RecordReader" objects. Why not just iterate over
> those record reader objects. It seems like a much nicer api.
> 
> for record in reader.iter_records():
>   print record.names
>   print record.read_bytes()
>   print record.type  # ??

I'll leave this one to Andrew.

> You call the file bzrlib/pack.py, but all the code talks about
> Container. So shouldn't it be 'bzrlib/container.py' ? And "Bazaar
> container format 1" ?

Well, "Bazaar pack format 1" is the documented value in
doc/developers/container-format.txt

I agree it would be nice to use a single name.  I kinda prefer 'pack',
though.

> I kind of like using a distinct name, rather than calling them 'pack'
> files. Especially since they won't follow the format of git pack files.

I think "container" is even more generic than "pack".

> ^- I don't think that '\x0b' and '\x0c' are actually whitespace
> characters. I guess it is arguable, as u'\x0b' is "Vertical Tabulation"
> and u'\x0c' is "Form Feed".

I have changed this to use osutils.contains_whitespace, to be consistent
with how illegal revision identifiers are defined.

> +    def __init__(self, write_func):
> +        """Constructor.
> +
> +        :param write_func: a callable that will be called when this
> +            ContainerWriter needs to write some bytes.
> +        """
> +        self.write_func = write_func
> 
> ^- Why are you taking a 'file' so you can do 'file.read()' but turning
> around and directly asking for 'file.write'?

I'm taking a file so I can do file.read AND file.readline.  For output,
I only need file.write.

> +    def add_bytes_record(self, bytes, names):
> +        """Add a Bytes record with the given names."""
> +        # Kind marker
> +        self.write_func("B")
> +        # Length
> +        self.write_func(str(len(bytes)) + "\n")
> +        # Names
> +        for name in names:
> +            # Make sure we're writing valid names.  Note that we will
> leave a
> +            # half-written record if a name is bad!
> +            _check_name(name)
> +            self.write_func(name + "\n")
> +        # End of headers
> +        self.write_func("\n")
> +        # Finally, the contents.
> +        self.write_func(bytes)
> 
> ^- This looks pretty inefficient, since you have 4+ calls to
> 'write_func()' for every call. And file.write() (iirc) is a syscall,
> even if it is buffered. I would rather see:
> 
> out = ['B']
> out.append(str(len(bytes)) + '\n')
> for name in names:
> ...
> 
> self.write_func(''.join(out))
> self.write_func(bytes)

Done.

> I also find it a little odd that you terminate the name list with
> '\n\n', rather than something more obvious. But it isn't a big deal.

Don't smtp and http both work that way?

> +    def reader_func(self, length=None):
> +        return self._source.read(length)
> 
> 
> ^- Why is self.reader_func() abstracted out from self._source.read()?
> It seems like just an extra abstraction for minimal gain. Unless it is
> useful for testing. (Or maybe because the api was using it in the past?)

It was because I wanted to make minimal changes to Andrew's work.  I
have eliminated it now.

> +    def _read_line(self):
> +        line = self._source.readline()
> +        if not line.endswith('\n'):
> +            raise errors.UnexpectedEndOfContainerError()
> +        return line.rstrip('\n')
> 
> ^- EOF doesn't seem like quite the right error here

It's not EOF, it's EndOfContainer, which is a container-level rather
than file-level issue.

> but I suppose you
> know that you will always end with a newline? (So if you don't you
> somehow reached EOF). To *me* EOF is when 'readline()' returns the empty
> string.

Yes, it's an UnexpectedEndOfContainer, because bytes records are \n
terminated.

> +        :raises ContainerError: if any sort of containter corruption is
> +            detected, e.g. UnknownContainerFormatError is the format of the
> +            container is unrecognised.
> +        :seealso: ContainerReader.read
> +        """
> 
> 
> ^- I think the :seealso: is incorrect here, but I'm not sure.

It seems to be an acceptable alias for :see:.

> ^v- I don't quite understand why you have one interface which returns
> the names + a callable, and another which returns "nice" objects, when
> the "nice" objects are being used to generate the simple ones. Meaning,
> if you were trying to do names+callable to be fast (not generating
> objects) this fails, since it is defined in terms of the "slow" one.

I think the "nice" interface is not as nice as the tuple one, myself.
Given my druthers, I'd zap the iter_record_objects interface.

> +    def _iter_record_objects(self):
> +        while True:
> +            record_kind = self.reader_func(1)
> +            if record_kind == 'B':
> +                # Bytes record.
> +                reader = BytesRecordReader(self._source)
> +                yield reader
> +            elif record_kind == 'E':
> +                # End marker.  There are no more records.
> +                return
> +            elif record_kind == '':
> +                # End of stream encountered, but no End Marker record
> seen, so
> +                # this container is incomplete.
> +                raise errors.UnexpectedEndOfContainerError()
> +            else:
> +                # Unknown record type.
> +                raise errors.UnknownRecordTypeError(record_kind)
> 
> 
> ^- Whenever I see an "if/elseif/elseif/else" statement, I get the
> feeling that we should be using a registry of some sort, so it can be
> easily extended.

This format cannot be easily extended.  Extending it would make it
impossible for unextended readers to use it.  I think this is a YAGNI.

It has been made very clear to me that they don't want a lot of record
types.  They didn't even want a different record type for multi-parent
diffs.  So new record types should be few and far between.

> So you would do:
> 
> record_kind = self.reader_func(1)
> try:
>     next_action = self._actions[record_kind]
> except KeyError:
>     raise errors.UnknownRecordTypeError(record_kind)
> return_obj = next_action(self)
> if return_obj is None:
>   return
> yield return_obj

^^^ This is arguably less clear than the existing code.

> "Refuse to decode non-shortest forms of UTF-8" ??
> Meaning that you can encode a utf-8 string with a longer number of bytes
> than minimal?

Yes.  Unicode specifies a single canonical utf-8 form for each
codepoint, by saying that the shortest possible encoding must be used.
Absent this restriction, there would be multiple ways of encoding the
same codepoint in UTF-8, and in fact, some UTF-8 decoders will accept
these non-minimal forms.

> What about non-normalized Unicode? Don't worry about it?

I think that's a higher-level issue.

> 
> +                if name in all_names:
> +                    raise errors.DuplicateRecordNameError(name)
> +                all_names.add(name)
> +        excess_bytes = self.reader_func(1)
> +        if excess_bytes != '':
> +            raise errors.ContainerHasExcessDataError(excess_bytes)
> +
> 
> ^- It makes more sense to me to use self.iter_record_objects() and have
> each object validate itself. That way if a given record type has more
> structure than just a string of bytes, it has the opportunity to
> validate that structure.
> With this code, the only thing that can be globally validated is the names.

I don't think we want to pay the price of validation here.  If we're
just going to skip the record, we shouldn't have to validate it.

> I also feel like passing around a callable isn't nearly as nice as just
> returning the "BytesRecordReader" object. So you do:
> 
> for reader in container.iter_objects():
>   names = reader.names # I would expect this part to be read already
> 		       # But we could have a 'reader.read()' if you want
>   bytes = reader.read(1024*1024)
>   while bytes:
>     # Do something with these bytes
>     bytes = reader.read(1024*1024)

I'll leave this one to Andrew.

> +    def validate(self):
> +        """Validate this record.
> +
> +        You can either validate or read, you can't do both.
> +
> +        :raises ContainerError: if this record is invalid.
> +        """
> +        names, read_bytes = self.read()
> +        for name in names:
> +            _check_name_encoding(name)
> +        read_bytes(None)
> +
> 
> ^- Why not have validate return the names and the bytes?

Well, that would be read, wouldn't it?

> 
> Also, you mention quite strongly that you shouldn't use
> 'read_bytes(None)' because it creates an un-bounded size string. And it
> still creates it, even if it isn't used, so this should be turned into a
> limited memory loop.

Okay.

> You also should test non-ascii names.

Will do.

> +    def test_unknown_format(self):
> +        """Unrecognised container formats raise
> UnknownContainerFormatError."""
> +        reader = self.get_reader_for("unknown format\n")
> +        self.assertRaises(
> +            errors.UnknownContainerFormatError, reader.iter_records)
> +
> 
> ^- Are you asserting that reader.iter_records *must* parse the header
> before it returns the generator?

Don't you think it's nicer to get before you start iterating?

> v- hmm... wouldn't it be better to use "list(iterator)", .next is okay,
> I guess.

iterator.next() is simpler, in my view.

> +        reader = self.get_reader_for("Bazaar pack format 1\nB5\n\naaaaaE")
> +        expected_records = [([], 'aaaaa')]
> +        self.assertEqual(
> +            expected_records,
> +            [(names, read_bytes(None))
> +             for (names, read_bytes) in reader.iter_records()])
> 
> doesn't:
> 
>   self.assertEqual(expected_records,
> 		   [(r.names, r.read_bytes())
>                     for r in reader.iter_records()])
> 
> Seem like a better api? Maybe it is just me.

Well, by reading the names unconditionally, you lose the possibility of
skipping records...

> +class ContainerError(BzrError):
> +    """Base class of container errors."""
> +
> +
> +class UnknownContainerFormatError(ContainerError):
> +
> +    _fmt = "Unrecognised container format: %(container_format)r"
> +
> +    def __init__(self, container_format):
> +        self.container_format = container_format
> +
> +
> 
> v- Shouldn't this at least take the Container object to display. It
> might be nice for it to display Container._source.name if it is
> available. So you know what file you were reading when you failed.

Sure.

> +class DuplicateRecordNameError(ContainerError):
> +
> +    _fmt = "Container has multiple records with the same name:
> \"%(name)s\""
> +
> +    def __init__(self, name):
> +        self.name = name
> +
> +
> 
> ^- The invalid stream could contain control codes, which seems like it
> could at least corrupt your terminal settings. Wouldn't %(name)r be
> better than (\"%(name)s\") ?

Makes sense.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGhHp00F+nu1YWqI0RAgsGAJ9Xw8A721iM22wu7vFbXb/bammQ6ACfSdVR
t/my9zLe1OnYdDO5yb52HFY=
=OnqY
-----END PGP SIGNATURE-----