[MERGE] Implement and use Repository.iter_files_bytes

Aaron Bentley aaron.bentley at utoronto.ca
Thu Aug 16 18:39:25 BST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> Aaron Bentley wrote:
> I may be misunderstanding what you are doing, but it certainly sounds
> like this ends up reading all texts *for the whole tree* into memory.

On knits, it just reads a file, returns the iterable, reads the next
file.  That shouldn't read everything into memory unless the client
retains a reference to the iterable.

On packs, we want to cut down on round trips, but we don't want to read
everything into memory.  That means reads must be streaming.  Which is
the motivation for
http://bundlebuggy.aaronbentley.com/request/%3C1187160978.19940.157.camel@localhost.localdomain%3E
I believe.

This interface is flexible enough to allow us to read *parts* of a file
at a time, so it's actually more memory-friendly than say, get_file_lines.

> Especially for http which sends a single request, and buffers the whole
> thing before returning. (sftp and local are a bit better about it, and
> it is one of the things I would like to have fixed for http, because it
> effects all downloads because we buffer all of inventory.knit before we
> start doing any processing of it.)

We will have to fix http so that we can get streaming reads out of it.

> Also, I thought you mentioned having it return texts in a potentially
> different order, does this mean that you have to watch out to make sure
> you create directories at the right time (before the files you get back)?

No, TT doesn't care about the order of operations.  Anything goes, as
long as the transform is valid when you TT.apply().

It won't even affect performance, because performance merely demands
that parent directories be in limbo before their child files are added.
 Since we're deferring file texts until all directories are added, we'll
achieve that.

> It might be better to have a few calls to this. For example, you could
> buffer it per directory, or some other smaller amount.

That would not achieve what we want, because each call would be a
roundtrip.  The grouping in the calls will not reflect repository
storage order.  So for example, we would read [(1, 5), (10, 15), (20,
25)] for the first call, then [(6, 9), (16, 19)] for the second call,
instead of reading [(1, 25)] for a single call.

Streaming reads are absolutely essential.  But if we assume streaming
reads, then issuing multiple calls is a major net loss.

> Maybe you were assuming that readv was going to be optimally small in
> all cases, and didn't realize that the HTTP code is not.

It's not actually me.  Robert asked for this in
"Repository.get_file_texts API and planning for it", and we discussed it
further on IRC.

> Then again, we
> aren't often building a working tree out of an HTTP repository. Though I
> guess if we make it work well, that may become more common.

It sounds very much like we have to fix HTTP.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGxIvN0F+nu1YWqI0RAuIbAJ4lZwFEPY3jvsKMPx+c5CVrPhPg5QCghLg+
Wfw2bf1LLrpWkoUNXbXn60A=
=Ze6N
-----END PGP SIGNATURE-----



More information about the bazaar mailing list