[MERGE] file-object based container format
Aaron Bentley
aaron.bentley at utoronto.ca
Fri Jun 29 04:20:20 BST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
This is a partial reply, because Andrew said he'll also be replying.
John Arbash Meinel wrote:
> Aaron Bentley wrote:
> ContainerReader uses a 'file' object, but your ContainerWriter uses
> 'file.write'. I think they should be made the same. Either they both
> take callables, or they both take file-like objects.
There is a practical reason why ContainerReader takes a file. There is
no practical reason why ContainerWriter should.
There ought to be a "principle of least privilege" in API design-- APIs
should only demand what they need to work efficiently. ContainerWriter
doesn't need a file-like object. ContainerReader does.
Adding a requirement that ContainerWriter be passed a file-like object
would impose more on ContainerWriter clients. IterableFile can be used
to convert most any input stream into a file-like object. There's no
equivalent for ContainerWriter (save StringIO, and that's not
memory-efficient).
> This came up several times, so I tried not to comment on it too much.
> Your current 'default' api for ContainerReader is to iterate over
> (names, callable) tuples. But you've already gone through all the effort
> to build up custom "RecordReader" objects. Why not just iterate over
> those record reader objects. It seems like a much nicer api.
>
> for record in reader.iter_records():
> print record.names
> print record.read_bytes()
> print record.type # ??
I'll leave this one to Andrew.
> You call the file bzrlib/pack.py, but all the code talks about
> Container. So shouldn't it be 'bzrlib/container.py' ? And "Bazaar
> container format 1" ?
Well, "Bazaar pack format 1" is the documented value in
doc/developers/container-format.txt
I agree it would be nice to use a single name. I kinda prefer 'pack',
though.
> I kind of like using a distinct name, rather than calling them 'pack'
> files. Especially since they won't follow the format of git pack files.
I think "container" is even more generic than "pack".
> ^- I don't think that '\x0b' and '\x0c' are actually whitespace
> characters. I guess it is arguable, as u'\x0b' is "Vertical Tabulation"
> and u'\x0c' is "Form Feed".
I have changed this to use osutils.contains_whitespace, to be consistent
with how illegal revision identifiers are defined.
> + def __init__(self, write_func):
> + """Constructor.
> +
> + :param write_func: a callable that will be called when this
> + ContainerWriter needs to write some bytes.
> + """
> + self.write_func = write_func
>
> ^- Why are you taking a 'file' so you can do 'file.read()' but turning
> around and directly asking for 'file.write'?
I'm taking a file so I can do file.read AND file.readline. For output,
I only need file.write.
> + def add_bytes_record(self, bytes, names):
> + """Add a Bytes record with the given names."""
> + # Kind marker
> + self.write_func("B")
> + # Length
> + self.write_func(str(len(bytes)) + "\n")
> + # Names
> + for name in names:
> + # Make sure we're writing valid names. Note that we will
> leave a
> + # half-written record if a name is bad!
> + _check_name(name)
> + self.write_func(name + "\n")
> + # End of headers
> + self.write_func("\n")
> + # Finally, the contents.
> + self.write_func(bytes)
>
> ^- This looks pretty inefficient, since you have 4+ calls to
> 'write_func()' for every call. And file.write() (iirc) is a syscall,
> even if it is buffered. I would rather see:
>
> out = ['B']
> out.append(str(len(bytes)) + '\n')
> for name in names:
> ...
>
> self.write_func(''.join(out))
> self.write_func(bytes)
Done.
> I also find it a little odd that you terminate the name list with
> '\n\n', rather than something more obvious. But it isn't a big deal.
Don't smtp and http both work that way?
> + def reader_func(self, length=None):
> + return self._source.read(length)
>
>
> ^- Why is self.reader_func() abstracted out from self._source.read()?
> It seems like just an extra abstraction for minimal gain. Unless it is
> useful for testing. (Or maybe because the api was using it in the past?)
It was because I wanted to make minimal changes to Andrew's work. I
have eliminated it now.
> + def _read_line(self):
> + line = self._source.readline()
> + if not line.endswith('\n'):
> + raise errors.UnexpectedEndOfContainerError()
> + return line.rstrip('\n')
>
> ^- EOF doesn't seem like quite the right error here
It's not EOF, it's EndOfContainer, which is a container-level rather
than file-level issue.
> but I suppose you
> know that you will always end with a newline? (So if you don't you
> somehow reached EOF). To *me* EOF is when 'readline()' returns the empty
> string.
Yes, it's an UnexpectedEndOfContainer, because bytes records are \n
terminated.
> + :raises ContainerError: if any sort of containter corruption is
> + detected, e.g. UnknownContainerFormatError is the format of the
> + container is unrecognised.
> + :seealso: ContainerReader.read
> + """
>
>
> ^- I think the :seealso: is incorrect here, but I'm not sure.
It seems to be an acceptable alias for :see:.
> ^v- I don't quite understand why you have one interface which returns
> the names + a callable, and another which returns "nice" objects, when
> the "nice" objects are being used to generate the simple ones. Meaning,
> if you were trying to do names+callable to be fast (not generating
> objects) this fails, since it is defined in terms of the "slow" one.
I think the "nice" interface is not as nice as the tuple one, myself.
Given my druthers, I'd zap the iter_record_objects interface.
> + def _iter_record_objects(self):
> + while True:
> + record_kind = self.reader_func(1)
> + if record_kind == 'B':
> + # Bytes record.
> + reader = BytesRecordReader(self._source)
> + yield reader
> + elif record_kind == 'E':
> + # End marker. There are no more records.
> + return
> + elif record_kind == '':
> + # End of stream encountered, but no End Marker record
> seen, so
> + # this container is incomplete.
> + raise errors.UnexpectedEndOfContainerError()
> + else:
> + # Unknown record type.
> + raise errors.UnknownRecordTypeError(record_kind)
>
>
> ^- Whenever I see an "if/elseif/elseif/else" statement, I get the
> feeling that we should be using a registry of some sort, so it can be
> easily extended.
This format cannot be easily extended. Extending it would make it
impossible for unextended readers to use it. I think this is a YAGNI.
It has been made very clear to me that they don't want a lot of record
types. They didn't even want a different record type for multi-parent
diffs. So new record types should be few and far between.
> So you would do:
>
> record_kind = self.reader_func(1)
> try:
> next_action = self._actions[record_kind]
> except KeyError:
> raise errors.UnknownRecordTypeError(record_kind)
> return_obj = next_action(self)
> if return_obj is None:
> return
> yield return_obj
^^^ This is arguably less clear than the existing code.
> "Refuse to decode non-shortest forms of UTF-8" ??
> Meaning that you can encode a utf-8 string with a longer number of bytes
> than minimal?
Yes. Unicode specifies a single canonical utf-8 form for each
codepoint, by saying that the shortest possible encoding must be used.
Absent this restriction, there would be multiple ways of encoding the
same codepoint in UTF-8, and in fact, some UTF-8 decoders will accept
these non-minimal forms.
> What about non-normalized Unicode? Don't worry about it?
I think that's a higher-level issue.
>
> + if name in all_names:
> + raise errors.DuplicateRecordNameError(name)
> + all_names.add(name)
> + excess_bytes = self.reader_func(1)
> + if excess_bytes != '':
> + raise errors.ContainerHasExcessDataError(excess_bytes)
> +
>
> ^- It makes more sense to me to use self.iter_record_objects() and have
> each object validate itself. That way if a given record type has more
> structure than just a string of bytes, it has the opportunity to
> validate that structure.
> With this code, the only thing that can be globally validated is the names.
I don't think we want to pay the price of validation here. If we're
just going to skip the record, we shouldn't have to validate it.
> I also feel like passing around a callable isn't nearly as nice as just
> returning the "BytesRecordReader" object. So you do:
>
> for reader in container.iter_objects():
> names = reader.names # I would expect this part to be read already
> # But we could have a 'reader.read()' if you want
> bytes = reader.read(1024*1024)
> while bytes:
> # Do something with these bytes
> bytes = reader.read(1024*1024)
I'll leave this one to Andrew.
> + def validate(self):
> + """Validate this record.
> +
> + You can either validate or read, you can't do both.
> +
> + :raises ContainerError: if this record is invalid.
> + """
> + names, read_bytes = self.read()
> + for name in names:
> + _check_name_encoding(name)
> + read_bytes(None)
> +
>
> ^- Why not have validate return the names and the bytes?
Well, that would be read, wouldn't it?
>
> Also, you mention quite strongly that you shouldn't use
> 'read_bytes(None)' because it creates an un-bounded size string. And it
> still creates it, even if it isn't used, so this should be turned into a
> limited memory loop.
Okay.
> You also should test non-ascii names.
Will do.
> + def test_unknown_format(self):
> + """Unrecognised container formats raise
> UnknownContainerFormatError."""
> + reader = self.get_reader_for("unknown format\n")
> + self.assertRaises(
> + errors.UnknownContainerFormatError, reader.iter_records)
> +
>
> ^- Are you asserting that reader.iter_records *must* parse the header
> before it returns the generator?
Don't you think it's nicer to get before you start iterating?
> v- hmm... wouldn't it be better to use "list(iterator)", .next is okay,
> I guess.
iterator.next() is simpler, in my view.
> + reader = self.get_reader_for("Bazaar pack format 1\nB5\n\naaaaaE")
> + expected_records = [([], 'aaaaa')]
> + self.assertEqual(
> + expected_records,
> + [(names, read_bytes(None))
> + for (names, read_bytes) in reader.iter_records()])
>
> doesn't:
>
> self.assertEqual(expected_records,
> [(r.names, r.read_bytes())
> for r in reader.iter_records()])
>
> Seem like a better api? Maybe it is just me.
Well, by reading the names unconditionally, you lose the possibility of
skipping records...
> +class ContainerError(BzrError):
> + """Base class of container errors."""
> +
> +
> +class UnknownContainerFormatError(ContainerError):
> +
> + _fmt = "Unrecognised container format: %(container_format)r"
> +
> + def __init__(self, container_format):
> + self.container_format = container_format
> +
> +
>
> v- Shouldn't this at least take the Container object to display. It
> might be nice for it to display Container._source.name if it is
> available. So you know what file you were reading when you failed.
Sure.
> +class DuplicateRecordNameError(ContainerError):
> +
> + _fmt = "Container has multiple records with the same name:
> \"%(name)s\""
> +
> + def __init__(self, name):
> + self.name = name
> +
> +
>
> ^- The invalid stream could contain control codes, which seems like it
> could at least corrupt your terminal settings. Wouldn't %(name)r be
> better than (\"%(name)s\") ?
Makes sense.
Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGhHp00F+nu1YWqI0RAgsGAJ9Xw8A721iM22wu7vFbXb/bammQ6ACfSdVR
t/my9zLe1OnYdDO5yb52HFY=
=OnqY
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list