[PATCH] KnitIndex tests/fixes/optimizations (was: KnitIndex optimizations)

Tue Nov 28 15:56:36 GMT 2006

Dmitry Vasiliev wrote:
> 

v- We actually have gotten away from adding "Modified by" lines at the
top. Part of the point of having a VCS with annotation support is that
these sorts of things are automatically tracked. And if all of the lines
introduced by someone are removed, that can be shown as well.

So we should probably go through the code and remove all of them. But it
doesn't really matter right now.

>  # Modified by Johan Rydberg <jrydberg at gnu.org>
>  # Modified by Robert Collins <robert.collins at canonical.com>
>  # Modified by Aaron Bentley <aaron.bentley at utoronto.ca>
> +# Modified by Dmitry Vasiliev <dima at hlabs.spb.ru>
>  #
>  # This program is free software; you can redistribute it and/or modify
>  # it under the terms of the GNU General Public License as published by
> @@ -74,6 +75,7 @@
>  import warnings
>  

v- It is better to do this as:

from bzrlib import (
    ...
    ui,
    )

And then later on just use 'ui.ui_factory'. It helps avoids bugs like
this, where it wasn't imported in this file, but still works because
someone happened to import it earlier.

>  import bzrlib
> +import bzrlib.ui
>  from bzrlib import (
>      cache_utf8,
>      errors,
> @@ -596,12 +598,7 @@
>  
>      def _check_versions_present(self, version_ids):
>          """Check that all specified versions are present."""
> -        version_ids = set(version_ids)
> -        for r in list(version_ids):
> -            if self._index.has_version(r):
> -                version_ids.remove(r)
> -        if version_ids:
> -            raise RevisionNotPresent(list(version_ids)[0], self.filename)
> +        self._index.check_versions_present(version_ids)
>  
>      def _add_lines_with_ghosts(self, version_id, parents, lines, parent_texts):
>          """See VersionedFile.add_lines_with_ghosts()."""
> @@ -882,7 +879,6 @@
>              versions = [versions]
>          if not versions:
>              return []
> -        self._check_versions_present(versions)
>          return self._index.get_ancestry(versions)

^- Why are you removing the safety checks? I didn't see a comment about
how it helps/hurts.

> -    def _parse_parents(self, compressed_parents):
> -        """convert a list of string parent values into version ids.
> -
> -        ints are looked up in the index.
> -        .FOO values are ghosts and converted in to FOO.
> -
> -        NOTE: the function is retained here for clarity, and for possible
> -              use in partial index reads. However bulk processing now has
> -              it inlined in __init__ for inner-loop optimisation.
> -        """
> -        result = []
> -        for value in compressed_parents:
> -            if value[-1] == '.':
> -                # uncompressed reference
> -                result.append(value[1:])

^- Good to see this removed, since it is actually buggy. It should be
"if value[0] == '.':" not value[-1].

v- I think the version with parenthesis is actually easier to read. Is
there a reason you wanted to remove it?

>      def has_version(self, version_id):
>          """True if the version is in the index."""
> -        return (version_id in self._cache)
> +        return version_id in self._cache
>  
>      def get_position(self, version_id):
>          """Return data position and size of specified version."""
> -        return (self._cache[version_id][2], \
> -                self._cache[version_id][3])
> +        entry = self._cache[version_id]
> +        return entry[2], entry[3]

^- Why not return

# Return entries 2 and 3
return self._cache[version_id][2:4]

Or possibly:

# Return entries 2 and 3
return tuple(self._cache[version_id][2:4])

I don't think tuple() is strictly necessary, though.

...

v- I don't know that it matters, but I'm curious if this would be faster:

missing = [v for v in version_ids if v not in self._cache]
if missing:
  raise ...

I realize it doesn't shortcut like the for loop. But honestly the number
of entries is generally small, and it should be very rare that the
version_id is not present.

>      def check_versions_present(self, version_ids):
>          """Check that all specified versions are present."""
> -        version_ids = set(version_ids)
> -        for version_id in list(version_ids):
> -            if version_id in self._cache:
> -                version_ids.remove(version_id)
> -        if version_ids:
> -            raise RevisionNotPresent(list(version_ids)[0], self.filename)
> +        cache = self._cache
> +        for version_id in version_ids:
> +            if version_id not in cache:
> +                raise RevisionNotPresent(version_id, self._filename)
>  
>  

...

v- Why are you using a MockTransport rather than something like
MemoryTransport?

You can do what you want with a simple:

t = get_transport('memory:///')
t.put_bytes(filename, ''.join(lines))
t.get(filename) / t.get_bytes(filename)

Then you have a Transport that implements the full API, and you don't
have to worry if Knit changes to use get_bytes instead of get() or some
other small change. It also means we are using a Transport which
conforms to the correct API, rather than something which might, but it
untested. You could still write a logging decorator that uses
__getattr__() to log and pass all requests to self._real_transport.

> +class MockTransport(object):
> +
> +    def __init__(self, file_lines=None):
> +        self.file_lines = file_lines
> +        self.calls = []
> +
> +    def get(self, filename):
> +        if self.file_lines is None:
> +            raise NoSuchFile(filename)
> +        else:
> +            return StringIO("\n".join(self.file_lines))
> +
> +    def __getattr__(self, name):
> +        def queue_call(*args, **kwargs):
> +            self.calls.append((name, args, kwargs))
> +        return queue_call

-- If you keep the class, PEP8 says to have 2 blank spaces between
classes. So you need another space here.

>

...

> +    def test_read_ignore_corrupted_lines(self):
> +        transport = MockTransport([
> +            _KnitIndex.HEADER,
> +            "corrupted",
> +            "corrupted options 0 1 .b .c ",
> +            "version options 0 1 :"
> +            ])
> +        index = _KnitIndex(transport, "filename", "r")
> +        self.assertEqual(1, index.num_versions())
> +        self.assertTrue(index.has_version(u"version"))

^- You probably also want to add

	self.assertFalse(index.has_version(u'corrupted'))

v- You aren't testing that the parent references are also updated. So
you may want to make the lines:

  "parent options 0 1 :",
  "version options1 1 2 :",
  "version options2 3 4 .other :",
  "version options3 5 6 0 :",

Because honestly, Robert actually only intended for it to override the
parents entry. It turns out beneficial to allow the others, but
overriding the parent list is the *important* side effect.

> +    def test_read_duplicate_entries(self):
> +        transport = MockTransport([
> +            _KnitIndex.HEADER,
> +            "version options1 0 1 :",
> +            "version options2 1 2 :",
> +            "version options3 3 4 :"
> +            ])
> +        index = _KnitIndex(transport, "filename", "r")
> +        self.assertEqual(1, index.num_versions())
> +        self.assertEqual(0, index.lookup(u"version"))
> +        self.assertEqual((3, 4), index.get_position(u"version"))
> +        self.assertEqual(["options3"], index.get_options(u"version"))
> +
> +    def test_read_compressed_parents(self):
> +        transport = MockTransport([
> +            _KnitIndex.HEADER,
> +            "a option 0 1 :",
> +            "b option 0 1 0 :",
> +            "c option 0 1 1 0 :",
> +            ])
> +        index = _KnitIndex(transport, "filename", "r")
> +        self.assertEqual([u"a"], index.get_parents(u"b"))
> +        self.assertEqual([u"b", u"a"], index.get_parents(u"c"))
> +

v- For all of these tests, I'm not sure that 'append_bytes()' is as
important as what the final contents are. It may be done by append() or
may be done through append_bytes().

What could be useful is to make sure that only a single append() is
invoked, even when we are adding several versions. But I think if you
use a MemoryTransport instead of MockTransport, you can check the bytes
on "disk". Which I think is what you are trying to check by looking at
the transport.calls entry.

> +    def test_write_utf8_version_id(self):
> +        transport = MockTransport([
> +            _KnitIndex.HEADER
> +            ])
> +        index = _KnitIndex(transport, "filename", "r")
> +        index.add_version(u"version-\N{CYRILLIC CAPITAL LETTER A}",
> +            ["option"], 0, 1, [])
> +        self.assertEqual(("append_bytes", ("filename",
> +            u"\nversion-\N{CYRILLIC CAPITAL LETTER A}"
> +                u" option 0 1  :".encode("utf-8")),
> +            {}),
> +            transport.calls.pop(0))

v- Robert has convinced me that it is better to add a new function than
to use local functions like this. So you want to add:

  def assertAncestryEqual(expected, index, revision_id):
     actual = index.get_ancestry(revision_id)
     self.assertEqual(expected, actual)

  ...

     self.assertAncestryEqual([], index, [])
     self.assertAncestryEqual([u'a'], index, [u'a'])
  ...

> +    def test_get_ancestry(self):
> +        transport = MockTransport([
> +            _KnitIndex.HEADER,
> +            "a option 0 1 :",
> +            "b option 0 1 0 .e :",
> +            "c option 0 1 1 0 :",
> +            "d option 0 1 2 .f :"
> +            ])
> +        index = _KnitIndex(transport, "filename", "r")
> +
> +        eq = self.assertEqual
> +        get = index.get_ancestry
> +
> +        eq([], get([]))
> +        eq([u"a"], get([u"a"]))
> +        eq([u"a", u"b"], get([u"b"]))
> +        eq([u"a", u"b", u"c"], get([u"c"]))
> +        eq([u"a", u"b", u"c", u"d"], get([u"d"]))
> +        eq([u"a", u"b"], get([u"a", u"b"]))
> +        eq([u"a", u"b", u"c"], get([u"a", u"c"]))
> +
> +        self.assertRaises(RevisionNotPresent, get, [u"e"])
> +

...

v- You add the exact same entry 2 times. Shouldn't you add it with new
options at least?

> +    def test_num_versions(self):
> +        transport = MockTransport([
> +            _KnitIndex.HEADER
> +            ])
> +        index = _KnitIndex(transport, "filename", "r")
> +
> +        self.assertEqual(0, index.num_versions())
> +        self.assertEqual(0, len(index))
> +
> +        index.add_version(u"a", ["option"], 0, 1, [])
> +        self.assertEqual(1, index.num_versions())
> +        self.assertEqual(1, len(index))
> +
> +        index.add_version(u"a", ["option"], 0, 1, [])
> +        self.assertEqual(1, index.num_versions())
> +        self.assertEqual(1, len(index))
> +
> +        index.add_version(u"b", ["option"], 0, 1, [])
> +        self.assertEqual(2, index.num_versions())
> +        self.assertEqual(2, len(index))
> +

...

I didn't fine-tooth-comb all of the tests. But overall it looks pretty
good. So +0.5 as is, but with a bit of cleanup, it will be very nice to
merge it.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20061128/a33417ee/attachment.pgp