data insertion and reads. with packs
Robert Collins
robertc at robertcollins.net
Thu Aug 2 23:26:03 BST 2007
On Thu, 2007-08-02 at 09:00 -0500, John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Robert Collins wrote:
> > I mentioned earlier that we have a disconnect between the streaming of
> > containers and the transport interface.
>
> In our current formats we do quite a bit of buffering in memory before we write
> out, to avoid this. (Which is why we have Knit._add_raw_records and
> _add_versions, which lets us add a group of records in one go).
Right. I'm currently using an adapter to transport.append, and
performance has dropped by about 50% on initial pull of revno 2K. It
used to be 30 seconds, its now just over 1 minute.
> I'm happy enough if we want to add a TransportFile abstraction, as long as we
> make it clearly something you have to finalize (.close()).
definately.
> >
> > Where this starts to matter is in mapped knits. Basically we currently
> > support reading from a knit data that has been added to it, with no
> > expectation of a 'finalisation' or other step. But without a good
> > incremental-write facility on transport we can't really offer that well
> > - to read data from a container being written we will damage the file
> > pointer during reads, or we get into requiring os level features like
> > 'dup2', which is less portable. The current 'append' interface while it
> > should work is not up to the performance constraints we have - pulling
> > 15000 objects into a pack will result in 15K append calls - which is
> > roughly 45K/60K syscalls, some of which are not so cheap on some
> > platforms.
>
> No matter what, 45-60k syscalls are going to be a lot more expensive than 15k+2
> (open, append 15k times, close).
Yup.
> Even if we do 64k memory buffering, we still cut the syscalls dramatically by
> leaving a file open.
>
> There is a question, though. If you start doing this, is it meant to be a RW
> file object, with possible buffering. Or is it a file that you write to, close,
> and then you open another one for reading?
I think the cleanest interface is that its a write-once, seekless file.
And calling 'readv' on the same path will cause some form of
synchronisation - whether that is that the file you are writing to gets
'flush()' called on it, or an internal buffer is used for part of the
readv answers - well that can be hidden by the interface. I don't care
about being able to do 'get' on the file while its open, but readv is
quite important to allow the current semantics during insertion.
> I certainly think we should avoid re-reading the data if possible. We've
> handled it 1 time, we should try not to handle it a second time.
> - From what I've seen with Knits, I honestly don't think it would be terribly
> expensive to extract the texts and check the sha1 as we push the texts into
> storage. We can sha1 sum the entire extracted texts of bzr.dev in 150ms using
> our current knits. We only spend 77ms applying 3.3k deltas to their extracted
> full texts. The bulk of the time is spent reading the gzip hunks and
> decompressing them back into a list of lines. And part of that is because the
> data is annotated, so we have to split every line to remove the annotation.
the problem is memoryy - If I have a pile of deltas arriving I would be
building all fulltexts in parallel until I know that none are the basis
for any other compressed text. So some ordering information and build
information is needed. Its probably doable, just not trivial.
> So overall, I'm fine with introducing a File abstraction. As long as we are
> clear with how it is used. (I'm just concerned that we will end up leaving a
> file open for longer than we really need it, not finalize it, end up with
> complications if someone is reading and writing to the same object, etc).
Yah, I think we'll need tests for the key use case:).
Rob
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070803/c87f92dd/attachment.pgp
More information about the bazaar
mailing list