data insertion and reads. with packs

Thu Aug 2 15:00:13 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> I mentioned earlier that we have a disconnect between the streaming of
> containers and the transport interface.

In our current formats we do quite a bit of buffering in memory before we write
out, to avoid this. (Which is why we have Knit._add_raw_records and
_add_versions, which lets us add a group of records in one go).

I'm happy enough if we want to add a TransportFile abstraction, as long as we
make it clearly something you have to finalize (.close()).

> 
> Where this starts to matter is in mapped knits. Basically we currently
> support reading from a knit data that has been added to it, with no
> expectation of a 'finalisation' or other step. But without a good
> incremental-write facility on transport we can't really offer that well
> - to read data from a container being written we will damage the file
> pointer during reads, or we get into requiring os level features like
> 'dup2', which is less portable. The current 'append' interface while it
> should work is not up to the performance constraints we have - pulling
> 15000 objects into a pack will result in 15K append calls - which is
> roughly 45K/60K syscalls, some of which are not so cheap on some
> platforms.

No matter what, 45-60k syscalls are going to be a lot more expensive than 15k+2
(open, append 15k times, close).

Even if we do 64k memory buffering, we still cut the syscalls dramatically by
leaving a file open.

There is a question, though. If you start doing this, is it meant to be a RW
file object, with possible buffering. Or is it a file that you write to, close,
and then you open another one for reading?

> 
> Spiv and I were chatting and one possibility came up, which is that we
> probably *don't need* to read data we are inserting arbitrarily. So if
> we structure our operations thusly:
> 
>  * insert
>  * finish the insertion - output indices, etc
>  * validation if needed (e.g. for fetch, check sha1s of texts now (if it
> wasn't possible during insertion))
>  * commit the new data (commit_write_group)
> 
> Then we avoid this issue completely, and it will probably help by
> getting us to think more carefully about arbitrary re-reading of data.
> 
> Thoughts?
> -Rob
> 

I certainly think we should avoid re-reading the data if possible. We've
handled it 1 time, we should try not to handle it a second time.
- From what I've seen with Knits, I honestly don't think it would be terribly
expensive to extract the texts and check the sha1 as we push the texts into
storage. We can sha1 sum the entire extracted texts of bzr.dev in 150ms using
our current knits. We only spend 77ms applying 3.3k deltas to their extracted
full texts. The bulk of the time is spent reading the gzip hunks and
decompressing them back into a list of lines. And part of that is because the
data is annotated, so we have to split every line to remove the annotation.

So overall, I'm fine with introducing a File abstraction. As long as we are
clear with how it is used. (I'm just concerned that we will end up leaving a
file open for longer than we really need it, not finalize it, end up with
complications if someone is reading and writing to the same object, etc).

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGseNtJdeBCYSNAAMRAuV0AJ9bHAo1GlxPyY3BuSmmdpDGXsKx0ACgk7ud
ESJ5fDfud70qlVxl+5uYYzg=
=BYQ4
-----END PGP SIGNATURE-----