[PATCH] KnitIndex tests/fixes/optimizations (was: KnitIndex optimizations)

Thu Nov 30 15:13:36 GMT 2006

John Arbash Meinel wrote:
> Dmitry Vasiliev wrote:
> 
> v- We actually have gotten away from adding "Modified by" lines at the
> top. Part of the point of having a VCS with annotation support is that
> these sorts of things are automatically tracked. And if all of the lines
> introduced by someone are removed, that can be shown as well.
> 
> So we should probably go through the code and remove all of them. But it
> doesn't really matter right now.

Ok, I'll then remove all those lines at the top of the knit.py.

> v- It is better to do this as:
> 
> from bzrlib import (
>     ...
>     ui,
>     )
> 
> And then later on just use 'ui.ui_factory'. It helps avoids bugs like
> this, where it wasn't imported in this file, but still works because
> someone happened to import it earlier.

I agree, it's better solution.

>> -        self._check_versions_present(versions)
>>          return self._index.get_ancestry(versions)
> 
> ^- Why are you removing the safety checks? I didn't see a comment about
> how it helps/hurts.

The check is now inlined into index's 'get_ancestry' and 
'get_ancestry_with_ghosts' methods.

>> -    def _parse_parents(self, compressed_parents):
[skip]
> ^- Good to see this removed, since it is actually buggy. It should be
> "if value[0] == '.':" not value[-1].

Yes, I tried to fix it but then I realized that it's just YAGNI and needs to be 
removed.

> v- I think the version with parenthesis is actually easier to read. Is
> there a reason you wanted to remove it?
> 
>> -        return (version_id in self._cache)
>> +        return version_id in self._cache

I've just removed them because they are unnecessary.

>> +        entry = self._cache[version_id]
>> +        return entry[2], entry[3]
> 
> ^- Why not return
> 
> # Return entries 2 and 3
> return self._cache[version_id][2:4]
> 
> Or possibly:
> 
> # Return entries 2 and 3
> return tuple(self._cache[version_id][2:4])
> 
> I don't think tuple() is strictly necessary, though.

I thought about this but the 'tuple(...)' version is slower than the 'entry[2], 
entry[3]' version and I don't want to change type of the returned value for 
now. I think it's better to first analyze the whole index's interface based on 
the current use cases.

> v- I don't know that it matters, but I'm curious if this would be faster:
> 
> missing = [v for v in version_ids if v not in self._cache]
> if missing:
>   raise ...
>
>> +        cache = self._cache
>> +        for version_id in version_ids:
>> +            if version_id not in cache:
>> +                raise RevisionNotPresent(version_id, self._filename)
> 
> I realize it doesn't shortcut like the for loop. But honestly the number
> of entries is generally small, and it should be very rare that the
> version_id is not present.

I'd prefer the 'for loop' version because it's much more readable and anyway 
differences in speed won't be so big.

> v- Why are you using a MockTransport rather than something like
> MemoryTransport?
> 
> You can do what you want with a simple:
> 
> t = get_transport('memory:///')
> t.put_bytes(filename, ''.join(lines))
> t.get(filename) / t.get_bytes(filename)
> 
> Then you have a Transport that implements the full API, and you don't
> have to worry if Knit changes to use get_bytes instead of get() or some
> other small change. It also means we are using a Transport which
> conforms to the correct API, rather than something which might, but it
> untested. You could still write a logging decorator that uses
> __getattr__() to log and pass all requests to self._real_transport.

I'd prefer to use little as possible test framework with simple classes in unit 
tests because unit tests in most cases should only test how the object interact 
with the outer world. And I see some advantage for a case when you just change 
get() in favor of get_bytes() - you'll got a broken tests and will need to 
double check your changes. I think use MemoryTransport is more preferable for 
high level functional tests.

> -- If you keep the class, PEP8 says to have 2 blank spaces between
> classes. So you need another space here.

Yes, you're right.

>> +    def test_read_ignore_corrupted_lines(self):
>> +        transport = MockTransport([
>> +            _KnitIndex.HEADER,
>> +            "corrupted",
>> +            "corrupted options 0 1 .b .c ",
>> +            "version options 0 1 :"
>> +            ])
>> +        index = _KnitIndex(transport, "filename", "r")
>> +        self.assertEqual(1, index.num_versions())
>> +        self.assertTrue(index.has_version(u"version"))
> 
> ^- You probably also want to add
> 
> 	self.assertFalse(index.has_version(u'corrupted'))

I think 'index.num_versions() == 1' should be enough since num_versions() and 
has_version() are tested in a separate tests.

> v- You aren't testing that the parent references are also updated.

Good idea, I'll add such a test.

> v- For all of these tests, I'm not sure that 'append_bytes()' is as
> important as what the final contents are. It may be done by append() or
> may be done through append_bytes().
> 
> What could be useful is to make sure that only a single append() is
> invoked, even when we are adding several versions.

It's already done in test_add_versions().

> But I think if you
> use a MemoryTransport instead of MockTransport, you can check the bytes
> on "disk". Which I think is what you are trying to check by looking at
> the transport.calls entry.

With MockTransport we've got much finer control over what is going on. And I 
think it can be important for a unit test.

> v- Robert has convinced me that it is better to add a new function than
> to use local functions like this. So you want to add:
> 
>   def assertAncestryEqual(expected, index, revision_id):
>      actual = index.get_ancestry(revision_id)
>      self.assertEqual(expected, actual)

I only made shorter references to methods to make the lines shorter so I don't 
see any problems with this. However it can be useful to extract tests from 
test_get_ancestry() and test_get_ancestry_with_ghosts() into a one separate 
method.

> v- You add the exact same entry 2 times. Shouldn't you add it with new
> options at least?

Ok, maybe it'll made the test more robust.

> I didn't fine-tooth-comb all of the tests. But overall it looks pretty
> good. So +0.5 as is, but with a bit of cleanup, it will be very nice to
> merge it.

I'll prepare the bundle with the cleanups.

-- 
Dmitry Vasiliev (dima at hlabs.spb.ru)
     http://hlabs.spb.ru