[MERGE][RFC] Compact origin information for *.knit (Draft)

Thu Dec 7 16:18:20 GMT 2006

Dmitry Vasiliev wrote:
> 
> The attached bundle is a draft implementation of my proposal described
> in https://lists.ubuntu.com/archives/bazaar-ng/2006q4/019626.html
> 
> Brief overview of the changes:
> 
> - Added tests for Knit(Annotated|Plain)Factory.
> - Some optimizations/fixes for Knit(Annotated|Plain)Factory methods.
> Some changes are the same as proposed by John in
> https://lists.ubuntu.com/archives/bazaar-ng/2006q4/020144.html
> (optimized parse_line_delta() and fixed parse_line_delta_iter()).
> - Added support for compact origin information in *.knit files (details
> described in the proposal).
> - Added new repository format.
> 
> Open questions:
> 
> - After upgrade to the new format I've got the following results:
> 
> $ find .bzr/repository/knits/ -name '*.knit' -exec cat '{}' \; | wc -c
> 17468574
> $ find .bzr.backup/repository/knits/ -name '*.knit' -exec cat '{}' \; |
> wc -c
> 18178308
> 
> The changes are not so big and moreover some new *.knit files are even
> bigger than the old ones (because gzip can't compress them better). So
> currently I'm even not sure about the proposed idea. Maybe I just need
> to leave only the tests/optimizations/fixes? What do you think?

I was expecting to not see a large post-compression change. Especially
since the most common annotation revision id is going to be the version
at the start of the hunk, gzip will already have a string that it can
compress all of the lines. What I'm more curious about is to change the
annotation lines to be:

origin_utf8, text = line.split(' ', 1)
if origin_utf8 == prev_utf8:
  origin = prev
else:
  prev_utf8 = origin_utf8
  origin = prev = decode_utf8(origin_utf8)

I'd like to see it performance tested, as I don't know whether the
function call + dictionary lookup would be faster or slower than a
if/else and comparison.

Also, something similar for the *creation* of annotated knit lines. So
in lower_fulltext and lower_line_delta we could do:

prev = None
prev_utf8 = None
result = []

for origin, text in content._lines:
  if origin != prev:
    prev = origin
    prev_utf8 = encode_utf8(origin) + ' '
  result.append(prev_utf8 + text)
return result

For stuff like the first commit to a kernel sized tree, you are talking
7.5M lines being added. So small benefits might add up. (And in this
pathological case, every line has the same annotation, since there is
only 1 revision.)

The fact that is is replacing a list comprehension, though, means that
it may not be any faster.

return ['%s %s' % (encode_utf8(o), t) for o, t in content._lines]

encode_utf8 is just a dictionary lookup, so it should theoretically be
rather fast.

I would hope that the reader is currently returning the same string for
each line. And a quick check seems to say that it does. (At least
annotate_iter() returns the same id(origin) for each line with the same
origin)

Anyway, if the lines are all the same string, that means that the
dictionary lookup should be extra fast, because the string's hash has
already been computed and saved.

So I like having more tests, and I'm not opposed to allowing blank
spaces in the annotations, but I don't think it gains us a lot. Unless
you can show that the extraction performance for a large file is much
faster.

So I would say +0.5. I don't think it is worth introducing a new
repository format for a small gain, but the rest of your stuff looks
pretty good.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20061207/e684f19d/attachment.pgp