[MERGE] file-object based container format
Aaron Bentley
aaron.bentley at utoronto.ca
Fri Jun 29 06:17:39 BST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
John Arbash Meinel wrote:
>> Adding a requirement that ContainerWriter be passed a file-like object
>> would impose more on ContainerWriter clients. IterableFile can be used
>> to convert most any input stream into a file-like object. There's no
>> equivalent for ContainerWriter (save StringIO, and that's not
>> memory-efficient).
>
> Actually, it is.
>
>>>> x = ' '*(100*1000*1000)
> UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND
> 501 2300 2168 0 31 0 133916 100688 - S+ p1 0:00.47 python
>
>>>> import cStringIO
>>>> y = cStringIO.StringIO(x)
> UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND
> 501 2300 2168 0 31 0 134004 100868 - S+ p1 0:00.56 python
>
> Memory consumption went up by a couple hundred bytes, not 100MB.
>>>> y = x + ' '
> UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND
> 501 2300 2168 0 31 0 231664 198528 - S+ p1 0:01.04 python
>
> So actually, cStringIO.StringIO() just creates a file pointer around the string
> buffer. And is perfectly memory efficient.
What I mean by "not memory efficient" is that if you write 1G of data
into a StringIO, it takes up 1G of memory. When you write 1G of data to
an actual file, it takes up a negligible amount of memory. You can run
out of memory easily by creating a container in a StringIO.
I don't want to force clients that don't have file-like objects to
provide a StringIO. Because the process of writing that container may
exhaust all available memory.
>>> I kind of like using a distinct name, rather than calling them 'pack'
>>> files. Especially since they won't follow the format of git pack files.
>> I think "container" is even more generic than "pack".
>
> I agree that it is more generic, but we haven't used the name in our code. And
> you *are* calling it ContainerReader not PackReader. If you want to call it a
> Pack file, then we should have a PackReader (and having the module be pack.py
> makes sense)
Well, Andrew can speak to why he's got two different names in play.
>>> ^- I don't think that '\x0b' and '\x0c' are actually whitespace
>>> characters. I guess it is arguable, as u'\x0b' is "Vertical Tabulation"
>>> and u'\x0c' is "Form Feed".
>> I have changed this to use osutils.contains_whitespace, to be consistent
>> with how illegal revision identifiers are defined.
>
> True, but that still lets you put a '\0' in the middle of a name, and have it
> be considered valid. I don't know what your goals are for what names are
> considered valid, but I would probably be more strict than whether or not it
> contains whitespace.
I think it would be counterproductive to make the character set smaller
than the permitted characters in file-ids and revision-ids, because we
would then have to escape them.
> So you could use the name "revision_id foo" or "foo
> revision_id". Otherwise you have to introduce arbitrary other characters if you
> are trying to combine things with a revision id. While the revision id itself
> is "guaranteed" not to contain whitespace. So it makes a really good delimiter.
I like that idea.
>> I'm taking a file so I can do file.read AND file.readline. For output,
>> I only need file.write.
>
> At the very least, we need to document what members we need (read + readline).
I don't understand why we need to document that "at the very least".
Most interfaces that take file-like objects don't provide that info.
>>> I also find it a little odd that you terminate the name list with
>>> '\n\n', rather than something more obvious. But it isn't a big deal.
> I also don't think it makes the most efficient parsing code, which is what we
> are trying to do here. I'm not strictly opposed to it, I just think having a
> clear end of header is a bit nicer.
Actually, file.readline tends to be very well optimized.
>> It's not EOF, it's EndOfContainer, which is a container-level rather
>> than file-level issue.
>
> But are you actually at the end of the container? I guess you are as long as
> readline() has the strict definition that it will never return unless you get
> to an '\n'.
>
> I wonder about '\r', though.
I don't think you can get that.
>>> ^- I think the :seealso: is incorrect here, but I'm not sure.
>> It seems to be an acceptable alias for :see:.
>
> I'm saying that "ContainerReader.read" is an invalid reference. Since the
> function doesn't exist.
Okay, I'll leave that to Andrew.
> I find working in tuples to be very
> ugly unless you actually have a reason to.
>
> And so far in this code, you don't. You actually create objects to throw them
> away. Well, only partially because you return a member (foo.read_bytes?)
At Andrew's suggestion, I've revved the code so that the
BytesRecordReader is reused, not thrown away.
>>> With this code, the only thing that can be globally validated is the names.
>> I don't think we want to pay the price of validation here. If we're
>> just going to skip the record, we shouldn't have to validate it.
>
> The point is that you *are* calling Container.validate() shouldn't that
> *validate* everything? Or are you just validating the surface level, so you
> need another function for "Container.really_validate_everything()".
Sorry. I thought it was a different part of the code.
>>> +
>>> ^- Why not have validate return the names and the bytes?
>> Well, that would be read, wouldn't it?
>
> "You can either validate or read, but not both". Why? It is an arbitrary
> restriction. Because you disallow seeking, I may want to make sure that the
> next hunk is *valid* before I process it. But then I have to manually iterate
> through the whole file (because Container.validate() doesn't do a lot of
> per-hunk validation, only whole container). And then come back afterwards and
> read things.
Well, perhaps we should ensure that read also does validation. It's
already most of the way there.
>>> ^- Are you asserting that reader.iter_records *must* parse the header
>>> before it returns the generator?
>> Don't you think it's nicer to get before you start iterating?
>
> In general, yes. I just didn't think it should be specifically enforced.
Well, it's good to know where to catch the exception, don't you think?
>>> self.assertEqual(expected_records,
>>> [(r.names, r.read_bytes())
>>> for r in reader.iter_records()])
>>> Seem like a better api? Maybe it is just me.
>> Well, by reading the names unconditionally, you lose the possibility of
>> skipping records...
>
> Then how do you know which records to skip?
- From another index, or from other data. In Bundle Format 4, all of the
body records are anonymous, but potentially skippable. It's the header
records that contain the names.
> Alternatively, you just have a
> '.read()' which initializes self and .names isn't valid. Or .names is a lazy
> property.
I don't like the notion of properties altering state like the position
of a file.
>
> Also, if I was reading it correctly, I think you have to parse the header
> completely to figure out how far to jump for the next entry. Because the name
> list isn't in the byte field length.
Yeah, I was misunderstanding that.
> This is one of the reasons I don't like '\n\n' as the header delimiter. Because
> then you *have* to read and parse every character looking for '\n\n'.
I'm not sure what you mean by "read and parse every character"-- we can
do it a line at a time, so it's not a great burden.
> It seems nicer to have a length prefixed set of names (I know there will be 100
> bytes of names, and then 500 bytes of data).
Yeah, that might be nice. Or even just including the names in the
length count.
Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGhJXz0F+nu1YWqI0RArw+AJ4/D/zJHfyDtJq0dTP3A8JBw0DrpwCbBDMt
aT9llLRiXJ/4h4mSewQdOPk=
=YvgC
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list