[RFC] allow fulltext to be more dynamic

Fri Nov 24 16:53:24 GMT 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dmitry Vasiliev wrote:
> John Arbash Meinel wrote:
>> Anyway, we really want to address our large-tree performance in other
>> ways (breaking up the inventory by dir, using recursive 'last-changed'
>> fields to avoid inspection of subdirs, etc). But this is a fairly simple
>> fix, that is completely compatible with current versions of bzr.
> 
> +1
> 
>> +        self._max_delta_chain = 200
>> +
> 
> Isn't it better to always define such a constants at the class level?
> 

Well, at this point I'm not sure if it should be a constant. I could
define it at the class level, but then you are supposed to access them
as class-level members. So do something like:

class KnitVersionedFile(...

  _max_delta_chain = 200

...

  while count < KnitVersionedFile._max_delta_chain:
    ...

At least that is what I've understood from Martin and Robert. They don't
really like having class level definitions which can be overridden per
object and accessed as 'self._max_delta_chain'.

And this is something that I think could be different for inventories
versus standard file texts.

>> +        count = 0
>> +        delta_size = 0
>> +        fulltext_size = None
>> +        delta_parents = first_parents
>> +        while count < self._max_delta_chain:
> 
> "for count in xrange(self._max_delta_chain):" seems more pythonic to me.

Agreed. I was copying the old code, but a 'for' loop fits better here.

> 
>> +            parent = delta_parents[0]
>> +            method = self._index.get_method(parent)
>> +            pos, size = self._index.get_position(parent)
>> +            if method == 'fulltext':
>> +                fulltext_size = size
>> +                break
>> +            delta_size += size
>> +            delta_parents = self._index.get_parents(parent)
>> +            count = count + 1
> 
> I'd prefer to localize often used variables inside loops (self._index in
> this case).

I would only do it if it is a performance critical area. And I'm not
sure if this is or not.

Otherwise the mental overhead of 'is is this index the other index that
I'm thinking of....' starts to become too much.

Plus, if we really wanted to performance tune this, it might be better
to ask for the list, rather than indirecting through a function call.

> 
>> +        if method == 'line-delta' or fulltext_size < delta_size:
>> +            return False
>> +        return True
> 
> Again I'd prefer just
> return not (method == 'line-delta' or fulltext_size < delta_size)
> 

What do people think of this? I don't think I prefer doing:

  return (complex boolean logic)

If it is simple, then:
  return (x == 1)

Seems okay. And actually with Henri's suggestion, rewriting it to be:

for count in xrange(self._max_chain_length):
  ...
else:
  return False #Couldn't find a full text, so we must create a new one

# Create a delta if it will still be smaller than a fulltext
return fulltext_size > delta_size

Also, do people think it would be worth anything to use a multiplier? So
you would get:

return ((fulltext_size * self._size_multiplier) > delta_size)

My one thought is that texts generally grow in size. And you really want
to select a delta based on the current fulltext size, not on the
original fulltext size.

But the code at this point may only have a delta (there are 2 code
paths, one starts with a delta, and the other starts with a fulltext).
And you really don't want to have to create the fulltext from a delta,
to determine that you want to use the delta ...

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFZyOEJdeBCYSNAAMRAjKPAJsF1v4qoEWkrXoNqpPW71iUflrEhgCgy38B
GGRenifFOx2sMcTbZspU/uc=
=rAlN
-----END PGP SIGNATURE-----