[RFC] Reworking 'commit' internals to work in texts rather than 'lines'

Tue Apr 28 00:57:57 BST 2009

On Mon, 2009-04-27 at 12:52 -0500, John Arbash Meinel wrote:
> The main change is to avoid "KnitVersionedFile.add_lines()" and
> instead
> go for something along the lines of "KVF.add_text()" which takes a
> single string.

KVF.insert_record_stream please.

> This means that the "record_iter_changes" code can do "file.read()"
> rather than "file.readlines()", etc. It also means that we can
> compress
> the text without either iterating 5M lines (and the associated 5M*3
> function calls to compress the bytes and compute crc, etc.)

We already batch up the content to one big string, as you know:) So
we're not avoiding 5M*3 function calls for compression and CRC - this
was part of the pack introduction commit optimisation pass. We were
blocked there by the serialisation needs of knits (that they need a line
count primarily).

> I also think this will be a big win for 'dev6' repositories. Because
> in
> that case there *is* no delta generated at commit time. So getting the
> fulltext makes it all that much faster to shove it down into the
> repository, and get on with your life. (The main downside is that we
> won't even get cross-file deltas because we end up with 1-group per
> file, rather than 1-group per commit.)

We already have 1-group per file, there is no change here. It is
definitely better for dev6.

It should be as 'simple' as:
 - change to file.read()
 - change to insert_record_stream() or a successor function
 - change insert_record_stream or the successor function as needed so
   that it returns the data needed by commit.

Note that the core of GroupCompressVersionedfile *already* has this
structure as 'add_lines' is implemented in terms of
_insert_record_stream.

I'd personally leave knits largely alone, as they are much more complex
to get this right on, and something we're moving away from rapidly.

The only complex thing is to think about over the network insertion;
should the api be a generator, or return a list. I think return a list
so that it fits the half-duplex needs of the rpc layer.

-Rob
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20090428/dae2bf4d/attachment.pgp