[BUG] Different results when passing 'parent_texts'

Fri Sep 7 01:12:29 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I'm getting a really weird result when trying to do a conversion.

Basically I have a loop over all revisions and I'm copying the file texts over
directly into the target knits.

I started looking into caching the results, because it seems to have a large
speed improvement.

My loop basically looks like:

line_cache = {}

...

    parent_texts = {}
    for parent_id in file_heads:
        key = (file_id, parent_id)
        if key in lines_cache:
            parent_texts[parent_id] = lines_cache[key]

    # extract text and insert it.
    lines = rt.get_file_lines(file_id)
    try:
        text_sha1, text_len, added_lines = weave.add_lines(revision_id,
                                      file_heads,
                                      lines,
                                      parent_texts=parent_texts)
    except errors.bzr_errors.RevisionNotPresent, e:
        import pdb; pdb.set_trace()
        raise
    key = (file_id, revision_id)
    lines_cache[key] = added_lines

The problem is that I get different results if I disable the cache.
Specifically, as near as I can tell, if I pass parent_texts, then it realizes
that the file contents are identical, and adds a line-delta with 0 lines changed.
If I don't pass 'parent_texts', then it seems to create a line delta with *all*
lines changed.

I seem to encounter this on texts that have 'no-eol' set (GIF files), and on
fulltext caches (the line annotations differ).

When I check it, I can see that added_lines._lines includes a trailing newline,
while "lines" does not (and also weave._get_content(parent_id) does not).

I haven't figured out why the annotations are different yet, but the no-eol
thing needs to be fixed.

Note: this doesn't actually cause corruption, because

a) KnitVF.add_lines() checks to see if there is a final newline, if there isn't
it sets the 'no-eol' flag, and then adds one.

b) it then diffs the final set of lines (with the trailing '\n') against the
parent texts

c) So if the file didn't have a newline, and the last line doesn't change. If
you don't cache, it ends up seeing the last line change, so it adds a delta
(effectively adding a newline, which it then marks as not existing in the index).

Now, if you cache the returned values, your KnitContent object has a final
newline (regardless of whether the parent had one).
When you pass it back in, it actually changes (c) so that it doesn't see a
difference (and still marks no eol in the index).

What can only happen is that we get entries we don't need (adding the last line
of the file as a delta, when we don't need to).

One possible fix is to just change the add_lines() code, so that after getting
parent content objects, it ensures that all of them have a trailing newline (as
it does for the lines being added).

Alternatively, for consistency, it could remove the trailing newline to the
returned KnitContent object. Which has the nice property that the sha1 sum
matches correctly. It has the downside that it will always re-add the last line
for any files that are missing a final newline.

What is probably the best fix, would be to change the algorithm, so that rather
than munging the lines that were passed in, it does all the diffs,etc on the
pristine lines. And we just update factory.lower_line_delta/lower_fulltext to
ensure that we always have a final newline in the output.

Thoughts?

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG4JdtJdeBCYSNAAMRAkwlAKCSJWCY0hGVsSSBEEKvZ8TIArCVlQCgsaak
qlFUdCs2mYxqXir/+rc+1NQ=
=P/Ug
-----END PGP SIGNATURE-----