[MERGE] Faster 'build_tree'

Thu Jul 26 17:45:18 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> John Arbash Meinel wrote:
>> Aaron Bentley wrote:
>>> Also, if we have_get_text_map, does _get_contents_map need to return a
>>> text map?  I think we can remove that.
>> I believe there are other users of _get_contents_map, but if not, then yes, we
>> can get rid of it. I'll look around.
> 
> I mean that we can make _get_contents_map return only a contents map.  I
> didn't mean completely eliminating the method.

Sure, my reply was poor. I meant that I thought there were other users that
used the 'text_map' portion of _get_contents_map.

> 
>> I'm actually thinking about removing _get_text_map() entirely, since it is
>> focused on extracting multiple entries. When all we really want is a fast way
>> to extract a *single* entry. (get_lines()).
> 
> I never figured the overhead would be that big.

About 100ms out of 1s for get_file_lines() calls. So bigger than one might expect.

> 
>>>      def get_file_text(self, file_id):
>>>          file_id = osutils.safe_file_id(file_id)
>>> -        return ''.join(self.get_file_lines(file_id))
>>> +        ie = self._inventory[file_id]
>>> +        weave = self._get_weave(file_id)
>>> +        return weave.get_text(ie.revision)
>>> ^^^ Was this just a drive-by?  (You don't seem to use it anywhere.)
> 
>> Yep. Part of the work was to get "get_file_lines()" and "get_file_text()" and
>> "get_file()" properly used everywhere. And it made more sense to have
>> RT.get_file_text() go down to the lower api, rather than having it go through
>> its own function. I was at least considering having a faster form for
>> extracting the actual text (rather than lines). Though at the moment it isn't
>> strictly necessary.
> 
> If it was essentially doing this already, it seems questionable to add
> extra code.  Saving a function call isn't a very big win for these
> operations.
> 
> Aaron

Well, compare:

lines = rt.get_file(file_id).readlines()

to

lines = rt.get_file_lines(file_id)

The former goes through:

StringIO(''.join(Knit.get_lines(revision_id))).readlines()

The latter can just be:

Knit.get_lines(revision_id)

So propagating the callers needs down the stack means that the low-level
implementation can do what in needs to to return it. Rather than having all the
higher-level apis massaging the data repeatedly.

I don't know any current users of get_file_text() offhand. I'm fine with not
messing with the function at this point.
I believe that it can ultimately be better for us to work in file blocks,
rather than as strictly lines, except for in places where we actually care
about lines (like in annotations).
At the very least, 10 deltas can be defined in terms of 10 blocks, rather than
100+ lines.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGqM+eJdeBCYSNAAMRArO4AJ9A7D5HpbOJti4goq+w0KBXlJPMeQCeJrFw
WRNhNqH+G8F83ihVpNr7fEs=
=vYTC
-----END PGP SIGNATURE-----