[MERGE] file-object based container format

Fri Jun 29 04:51:59 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> This is a partial reply, because Andrew said he'll also be replying.
> 
> John Arbash Meinel wrote:
>> Aaron Bentley wrote:
>> ContainerReader uses a 'file' object, but your ContainerWriter uses
>> 'file.write'. I think they should be made the same. Either they both
>> take callables, or they both take file-like objects.
> 
> There is a practical reason why ContainerReader takes a file.  There is
> no practical reason why ContainerWriter should.
> 
> There ought to be a "principle of least privilege" in API design-- APIs
> should only demand what they need to work efficiently.  ContainerWriter
> doesn't need a file-like object.  ContainerReader does.

There is also a principle of consistency. That having two classes that act
virtually the same end up taking different objects. And confusing users of that
API because they are always looking up "Well, did *this* one take a file or a
function".

> 
> Adding a requirement that ContainerWriter be passed a file-like object
> would impose more on ContainerWriter clients.  IterableFile can be used
> to convert most any input stream into a file-like object.  There's no
> equivalent for ContainerWriter (save StringIO, and that's not
> memory-efficient).

Actually, it is.

>>> x = ' '*(100*1000*1000)
  UID   PID  PPID CPU PRI NI      VSZ    RSS WCHAN  STAT  TT       TIME COMMAND
  501  2300  2168   0  31  0   133916 100688 -      S+    p1    0:00.47 python

>>> import cStringIO
>>> y = cStringIO.StringIO(x)
  UID   PID  PPID CPU PRI NI      VSZ    RSS WCHAN  STAT  TT       TIME COMMAND
  501  2300  2168   0  31  0   134004 100868 -      S+    p1    0:00.56 python

Memory consumption went up by a couple hundred bytes, not 100MB.
>>> y = x + ' '
  UID   PID  PPID CPU PRI NI      VSZ    RSS WCHAN  STAT  TT       TIME COMMAND
  501  2300  2168   0  31  0   231664 198528 -      S+    p1    0:01.04 python

So actually, cStringIO.StringIO() just creates a file pointer around the string
buffer. And is perfectly memory efficient.

But honestly, it is just that the API for classes that seem to be doing
something very similar end up being very different. And that is a confusing api
(IMO).

> 
>> This came up several times, so I tried not to comment on it too much.
>> Your current 'default' api for ContainerReader is to iterate over
>> (names, callable) tuples. But you've already gone through all the effort
>> to build up custom "RecordReader" objects. Why not just iterate over
>> those record reader objects. It seems like a much nicer api.
> 
>> for record in reader.iter_records():
>>   print record.names
>>   print record.read_bytes()
>>   print record.type  # ??
> 
> I'll leave this one to Andrew.
> 
>> You call the file bzrlib/pack.py, but all the code talks about
>> Container. So shouldn't it be 'bzrlib/container.py' ? And "Bazaar
>> container format 1" ?
> 
> Well, "Bazaar pack format 1" is the documented value in
> doc/developers/container-format.txt
> 
> I agree it would be nice to use a single name.  I kinda prefer 'pack',
> though.
> 
>> I kind of like using a distinct name, rather than calling them 'pack'
>> files. Especially since they won't follow the format of git pack files.
> 
> I think "container" is even more generic than "pack".

I agree that it is more generic, but we haven't used the name in our code. And
you *are* calling it ContainerReader not PackReader. If you want to call it a
Pack file, then we should have a PackReader (and having the module be pack.py
makes sense)

> 
>> ^- I don't think that '\x0b' and '\x0c' are actually whitespace
>> characters. I guess it is arguable, as u'\x0b' is "Vertical Tabulation"
>> and u'\x0c' is "Form Feed".
> 
> I have changed this to use osutils.contains_whitespace, to be consistent
> with how illegal revision identifiers are defined.

True, but that still lets you put a '\0' in the middle of a name, and have it
be considered valid. I don't know what your goals are for what names are
considered valid, but I would probably be more strict than whether or not it
contains whitespace. Then again, the only strict requirement from the pack
layout is that it doesn't contain '\n'. And it might be worthwhile to expand
the requirements here. So you could use the name "revision_id foo" or "foo
revision_id". Otherwise you have to introduce arbitrary other characters if you
are trying to combine things with a revision id. While the revision id itself
is "guaranteed" not to contain whitespace. So it makes a really good delimiter.

> 
>> +    def __init__(self, write_func):
>> +        """Constructor.
>> +
>> +        :param write_func: a callable that will be called when this
>> +            ContainerWriter needs to write some bytes.
>> +        """
>> +        self.write_func = write_func
> 
>> ^- Why are you taking a 'file' so you can do 'file.read()' but turning
>> around and directly asking for 'file.write'?
> 
> I'm taking a file so I can do file.read AND file.readline.  For output,
> I only need file.write.

At the very least, we need to document what members we need (read + readline).

> 
>> +    def add_bytes_record(self, bytes, names):
>> +        """Add a Bytes record with the given names."""
>> +        # Kind marker
>> +        self.write_func("B")
>> +        # Length
>> +        self.write_func(str(len(bytes)) + "\n")
>> +        # Names
>> +        for name in names:
>> +            # Make sure we're writing valid names.  Note that we will
>> leave a
>> +            # half-written record if a name is bad!
>> +            _check_name(name)
>> +            self.write_func(name + "\n")
>> +        # End of headers
>> +        self.write_func("\n")
>> +        # Finally, the contents.
>> +        self.write_func(bytes)
> 
>> ^- This looks pretty inefficient, since you have 4+ calls to
>> 'write_func()' for every call. And file.write() (iirc) is a syscall,
>> even if it is buffered. I would rather see:
> 
>> out = ['B']
>> out.append(str(len(bytes)) + '\n')
>> for name in names:
>> ...
> 
>> self.write_func(''.join(out))
>> self.write_func(bytes)
> 
> Done.
> 
>> I also find it a little odd that you terminate the name list with
>> '\n\n', rather than something more obvious. But it isn't a big deal.
> 
> Don't smtp and http both work that way?

Well, IIRC they use \r\n\r\n. At least I'm pretty sure HTTP does.

I also don't think it makes the most efficient parsing code, which is what we
are trying to do here. I'm not strictly opposed to it, I just think having a
clear end of header is a bit nicer.

> 
>> +    def reader_func(self, length=None):
>> +        return self._source.read(length)
> 
> 
>> ^- Why is self.reader_func() abstracted out from self._source.read()?
>> It seems like just an extra abstraction for minimal gain. Unless it is
>> useful for testing. (Or maybe because the api was using it in the past?)
> 
> It was because I wanted to make minimal changes to Andrew's work.  I
> have eliminated it now.
> 
>> +    def _read_line(self):
>> +        line = self._source.readline()
>> +        if not line.endswith('\n'):
>> +            raise errors.UnexpectedEndOfContainerError()
>> +        return line.rstrip('\n')
> 
>> ^- EOF doesn't seem like quite the right error here
> 
> It's not EOF, it's EndOfContainer, which is a container-level rather
> than file-level issue.

But are you actually at the end of the container? I guess you are as long as
readline() has the strict definition that it will never return unless you get
to an '\n'.

I wonder about '\r', though.

> 
>> but I suppose you
>> know that you will always end with a newline? (So if you don't you
>> somehow reached EOF). To *me* EOF is when 'readline()' returns the empty
>> string.
> 
> Yes, it's an UnexpectedEndOfContainer, because bytes records are \n
> terminated.

The headers are, but the bytes themselves are not. But this location is reading
the header.

> 
>> +        :raises ContainerError: if any sort of containter corruption is
>> +            detected, e.g. UnknownContainerFormatError is the format of the
>> +            container is unrecognised.
>> +        :seealso: ContainerReader.read
>> +        """
> 
> 
>> ^- I think the :seealso: is incorrect here, but I'm not sure.
> 
> It seems to be an acceptable alias for :see:.

I'm saying that "ContainerReader.read" is an invalid reference. Since the
function doesn't exist.

I don't know what function you are thinking, perhaps RecordReader.read() but
there is no such base class. So you can only really reference
BytesRecordReader.read()

> 
>> ^v- I don't quite understand why you have one interface which returns
>> the names + a callable, and another which returns "nice" objects, when
>> the "nice" objects are being used to generate the simple ones. Meaning,
>> if you were trying to do names+callable to be fast (not generating
>> objects) this fails, since it is defined in terms of the "slow" one.
> 
> I think the "nice" interface is not as nice as the tuple one, myself.
> Given my druthers, I'd zap the iter_record_objects interface.

Obviously a point of disagreement, then. I find working in tuples to be very
ugly unless you actually have a reason to.

And so far in this code, you don't. You actually create objects to throw them
away. Well, only partially because you return a member (foo.read_bytes?)

> 
>> +    def _iter_record_objects(self):
>> +        while True:
>> +            record_kind = self.reader_func(1)
>> +            if record_kind == 'B':
>> +                # Bytes record.
>> +                reader = BytesRecordReader(self._source)
>> +                yield reader
>> +            elif record_kind == 'E':
>> +                # End marker.  There are no more records.
>> +                return
>> +            elif record_kind == '':
>> +                # End of stream encountered, but no End Marker record
>> seen, so
>> +                # this container is incomplete.
>> +                raise errors.UnexpectedEndOfContainerError()
>> +            else:
>> +                # Unknown record type.
>> +                raise errors.UnknownRecordTypeError(record_kind)
> 
> 
>> ^- Whenever I see an "if/elseif/elseif/else" statement, I get the
>> feeling that we should be using a registry of some sort, so it can be
>> easily extended.
> 
> This format cannot be easily extended.  Extending it would make it
> impossible for unextended readers to use it.  I think this is a YAGNI.
> 
> It has been made very clear to me that they don't want a lot of record
> types.  They didn't even want a different record type for multi-parent
> diffs.  So new record types should be few and far between.
> 

As you wish. I still feel like if/elif/else is usually an indication of
something that should be done differently. But it isn't a deal breaker.

>> So you would do:
> 
>> record_kind = self.reader_func(1)
>> try:
>>     next_action = self._actions[record_kind]
>> except KeyError:
>>     raise errors.UnknownRecordTypeError(record_kind)
>> return_obj = next_action(self)
>> if return_obj is None:
>>   return
>> yield return_obj
> 
> 
> ^^^ This is arguably less clear than the existing code.

And arguably more "pythonic" since we don't have switch statements. Not a big deal.

> 
>> "Refuse to decode non-shortest forms of UTF-8" ??
>> Meaning that you can encode a utf-8 string with a longer number of bytes
>> than minimal?
> 
> Yes.  Unicode specifies a single canonical utf-8 form for each
> codepoint, by saying that the shortest possible encoding must be used.
> Absent this restriction, there would be multiple ways of encoding the
> same codepoint in UTF-8, and in fact, some UTF-8 decoders will accept
> these non-minimal forms.
> 
>> What about non-normalized Unicode? Don't worry about it?
> 
> I think that's a higher-level issue.
> 
>> +                if name in all_names:
>> +                    raise errors.DuplicateRecordNameError(name)
>> +                all_names.add(name)
>> +        excess_bytes = self.reader_func(1)
>> +        if excess_bytes != '':
>> +            raise errors.ContainerHasExcessDataError(excess_bytes)
>> +
> 
>> ^- It makes more sense to me to use self.iter_record_objects() and have
>> each object validate itself. That way if a given record type has more
>> structure than just a string of bytes, it has the opportunity to
>> validate that structure.
>> With this code, the only thing that can be globally validated is the names.
> 
> I don't think we want to pay the price of validation here.  If we're
> just going to skip the record, we shouldn't have to validate it.

The point is that you *are* calling Container.validate() shouldn't that
*validate* everything? Or are you just validating the surface level, so you
need another function for "Container.really_validate_everything()".

> 
>> I also feel like passing around a callable isn't nearly as nice as just
>> returning the "BytesRecordReader" object. So you do:
> 
>> for reader in container.iter_objects():
>>   names = reader.names # I would expect this part to be read already
>> 		       # But we could have a 'reader.read()' if you want
>>   bytes = reader.read(1024*1024)
>>   while bytes:
>>     # Do something with these bytes
>>     bytes = reader.read(1024*1024)
> 
> I'll leave this one to Andrew.
> 
>> +    def validate(self):
>> +        """Validate this record.
>> +
>> +        You can either validate or read, you can't do both.
>> +
>> +        :raises ContainerError: if this record is invalid.
>> +        """
>> +        names, read_bytes = self.read()
>> +        for name in names:
>> +            _check_name_encoding(name)
>> +        read_bytes(None)
>> +
> 
>> ^- Why not have validate return the names and the bytes?
> 
> Well, that would be read, wouldn't it?

"You can either validate or read, but not both". Why? It is an arbitrary
restriction. Because you disallow seeking, I may want to make sure that the
next hunk is *valid* before I process it. But then I have to manually iterate
through the whole file (because Container.validate() doesn't do a lot of
per-hunk validation, only whole container). And then come back afterwards and
read things.

> 
>> Also, you mention quite strongly that you shouldn't use
>> 'read_bytes(None)' because it creates an un-bounded size string. And it
>> still creates it, even if it isn't used, so this should be turned into a
>> limited memory loop.
> 
> Okay.
> 
>> You also should test non-ascii names.
> 
> Will do.
> 
>> +    def test_unknown_format(self):
>> +        """Unrecognised container formats raise
>> UnknownContainerFormatError."""
>> +        reader = self.get_reader_for("unknown format\n")
>> +        self.assertRaises(
>> +            errors.UnknownContainerFormatError, reader.iter_records)
>> +
> 
>> ^- Are you asserting that reader.iter_records *must* parse the header
>> before it returns the generator?
> 
> Don't you think it's nicer to get before you start iterating?

In general, yes. I just didn't think it should be specifically enforced.

> 
>> v- hmm... wouldn't it be better to use "list(iterator)", .next is okay,
>> I guess.
> 
> iterator.next() is simpler, in my view.
> 
>> +        reader = self.get_reader_for("Bazaar pack format 1\nB5\n\naaaaaE")
>> +        expected_records = [([], 'aaaaa')]
>> +        self.assertEqual(
>> +            expected_records,
>> +            [(names, read_bytes(None))
>> +             for (names, read_bytes) in reader.iter_records()])
> 
>> doesn't:
> 
>>   self.assertEqual(expected_records,
>> 		   [(r.names, r.read_bytes())
>>                     for r in reader.iter_records()])
> 
>> Seem like a better api? Maybe it is just me.
> 
> Well, by reading the names unconditionally, you lose the possibility of
> skipping records...

Then how do you know which records to skip? Alternatively, you just have a
'.read()' which initializes self and .names isn't valid. Or .names is a lazy
property.

Also, if I was reading it correctly, I think you have to parse the header
completely to figure out how far to jump for the next entry. Because the name
list isn't in the byte field length. Specifically:

+    def test_record_with_one_name(self):
+        """Reading a Bytes record with one name returns a list of just that
+        name.
+        """
+
+        reader = self.get_reader_for("5\nname1\n\naaaaa")
+        names, get_bytes = reader.read()
+        self.assertEqual(['name1'], names)
+        self.assertEqual('aaaaa', get_bytes(None))
+

So you already have to "read then names unconditionally" in order to skip records.

This is one of the reasons I don't like '\n\n' as the header delimiter. Because
then you *have* to read and parse every character looking for '\n\n'. And the
way the code is structured is actually to call readline() until you get a line
which is just '\n'.

It seems nicer to have a length prefixed set of names (I know there will be 100
bytes of names, and then 500 bytes of data).

I'm sorry if I'm a bit snippy. I probably shouldn't do reviews late at night.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGhIHfJdeBCYSNAAMRAqYyAKCdAqO35tS3ZpDFq8CuTO8wLgYmcwCg2QtA
jiYoloPxlEirTQ43e1Yd4Pc=
=0eCg
-----END PGP SIGNATURE-----