[MERGE] file-object based container format

Fri Jun 29 06:17:39 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
>> Adding a requirement that ContainerWriter be passed a file-like object
>> would impose more on ContainerWriter clients.  IterableFile can be used
>> to convert most any input stream into a file-like object.  There's no
>> equivalent for ContainerWriter (save StringIO, and that's not
>> memory-efficient).
> 
> Actually, it is.
> 
>>>> x = ' '*(100*1000*1000)
>   UID   PID  PPID CPU PRI NI      VSZ    RSS WCHAN  STAT  TT       TIME COMMAND
>   501  2300  2168   0  31  0   133916 100688 -      S+    p1    0:00.47 python
> 
>>>> import cStringIO
>>>> y = cStringIO.StringIO(x)
>   UID   PID  PPID CPU PRI NI      VSZ    RSS WCHAN  STAT  TT       TIME COMMAND
>   501  2300  2168   0  31  0   134004 100868 -      S+    p1    0:00.56 python
> 
> Memory consumption went up by a couple hundred bytes, not 100MB.
>>>> y = x + ' '
>   UID   PID  PPID CPU PRI NI      VSZ    RSS WCHAN  STAT  TT       TIME COMMAND
>   501  2300  2168   0  31  0   231664 198528 -      S+    p1    0:01.04 python
> 
> So actually, cStringIO.StringIO() just creates a file pointer around the string
> buffer. And is perfectly memory efficient.

What I mean by "not memory efficient" is that if you write 1G of data
into a StringIO, it takes up 1G of memory.  When you write 1G of data to
an actual file, it takes up a negligible amount of memory.  You can run
out of memory easily by creating a container in a StringIO.

I don't want to force clients that don't have file-like objects to
provide a StringIO.  Because the process of writing that container may
exhaust all available memory.

>>> I kind of like using a distinct name, rather than calling them 'pack'
>>> files. Especially since they won't follow the format of git pack files.
>> I think "container" is even more generic than "pack".
> 
> I agree that it is more generic, but we haven't used the name in our code. And
> you *are* calling it ContainerReader not PackReader. If you want to call it a
> Pack file, then we should have a PackReader (and having the module be pack.py
> makes sense)

Well, Andrew can speak to why he's got two different names in play.

>>> ^- I don't think that '\x0b' and '\x0c' are actually whitespace
>>> characters. I guess it is arguable, as u'\x0b' is "Vertical Tabulation"
>>> and u'\x0c' is "Form Feed".
>> I have changed this to use osutils.contains_whitespace, to be consistent
>> with how illegal revision identifiers are defined.
> 
> True, but that still lets you put a '\0' in the middle of a name, and have it
> be considered valid. I don't know what your goals are for what names are
> considered valid, but I would probably be more strict than whether or not it
> contains whitespace.

I think it would be counterproductive to make the character set smaller
than the permitted characters in file-ids and revision-ids, because we
would then have to escape them.

> So you could use the name "revision_id foo" or "foo
> revision_id". Otherwise you have to introduce arbitrary other characters if you
> are trying to combine things with a revision id. While the revision id itself
> is "guaranteed" not to contain whitespace. So it makes a really good delimiter.

I like that idea.

>> I'm taking a file so I can do file.read AND file.readline.  For output,
>> I only need file.write.
> 
> At the very least, we need to document what members we need (read + readline).

I don't understand why we need to document that "at the very least".
Most interfaces that take file-like objects don't provide that info.

>>> I also find it a little odd that you terminate the name list with
>>> '\n\n', rather than something more obvious. But it isn't a big deal.

> I also don't think it makes the most efficient parsing code, which is what we
> are trying to do here. I'm not strictly opposed to it, I just think having a
> clear end of header is a bit nicer.

Actually, file.readline tends to be very well optimized.

>> It's not EOF, it's EndOfContainer, which is a container-level rather
>> than file-level issue.
> 
> But are you actually at the end of the container? I guess you are as long as
> readline() has the strict definition that it will never return unless you get
> to an '\n'.
> 
> I wonder about '\r', though.

I don't think you can get that.

>>> ^- I think the :seealso: is incorrect here, but I'm not sure.
>> It seems to be an acceptable alias for :see:.
> 
> I'm saying that "ContainerReader.read" is an invalid reference. Since the
> function doesn't exist.

Okay, I'll leave that to Andrew.

> I find working in tuples to be very
> ugly unless you actually have a reason to.
> 
> And so far in this code, you don't. You actually create objects to throw them
> away. Well, only partially because you return a member (foo.read_bytes?)

At Andrew's suggestion, I've revved the code so that the
BytesRecordReader is reused, not thrown away.

>>> With this code, the only thing that can be globally validated is the names.
>> I don't think we want to pay the price of validation here.  If we're
>> just going to skip the record, we shouldn't have to validate it.
> 
> The point is that you *are* calling Container.validate() shouldn't that
> *validate* everything? Or are you just validating the surface level, so you
> need another function for "Container.really_validate_everything()".

Sorry.  I thought it was a different part of the code.

>>> +
>>> ^- Why not have validate return the names and the bytes?
>> Well, that would be read, wouldn't it?
> 
> "You can either validate or read, but not both". Why? It is an arbitrary
> restriction. Because you disallow seeking, I may want to make sure that the
> next hunk is *valid* before I process it. But then I have to manually iterate
> through the whole file (because Container.validate() doesn't do a lot of
> per-hunk validation, only whole container). And then come back afterwards and
> read things.

Well, perhaps we should ensure that read also does validation.  It's
already most of the way there.

>>> ^- Are you asserting that reader.iter_records *must* parse the header
>>> before it returns the generator?
>> Don't you think it's nicer to get before you start iterating?
> 
> In general, yes. I just didn't think it should be specifically enforced.

Well, it's good to know where to catch the exception, don't you think?

>>>   self.assertEqual(expected_records,
>>> 		   [(r.names, r.read_bytes())
>>>                     for r in reader.iter_records()])
>>> Seem like a better api? Maybe it is just me.
>> Well, by reading the names unconditionally, you lose the possibility of
>> skipping records...
> 
> Then how do you know which records to skip?

- From another index, or from other data.  In Bundle Format 4, all of the
body records are anonymous, but potentially skippable.  It's the header
records that contain the names.

> Alternatively, you just have a
> '.read()' which initializes self and .names isn't valid. Or .names is a lazy
> property.

I don't like the notion of properties altering state like the position
of a file.

> 
> Also, if I was reading it correctly, I think you have to parse the header
> completely to figure out how far to jump for the next entry. Because the name
> list isn't in the byte field length.

Yeah, I was misunderstanding that.

> This is one of the reasons I don't like '\n\n' as the header delimiter. Because
> then you *have* to read and parse every character looking for '\n\n'.

I'm not sure what you mean by "read and parse every character"-- we can
do it a line at a time, so it's not a great burden.

> It seems nicer to have a length prefixed set of names (I know there will be 100
> bytes of names, and then 500 bytes of data).

Yeah, that might be nice.  Or even just including the names in the
length count.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGhJXz0F+nu1YWqI0RArw+AJ4/D/zJHfyDtJq0dTP3A8JBw0DrpwCbBDMt
aT9llLRiXJ/4h4mSewQdOPk=
=YvgC
-----END PGP SIGNATURE-----