[MERGE] Faster 'build_tree'

John Arbash Meinel john at arbash-meinel.com
Thu Jul 26 17:03:34 BST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> John Arbash Meinel wrote:
>> I was poking around our knit extraction code, because it is one of our current
>> bottlenecks. And I found a few areas that could use some cleanup.
> 
> I'm not sure that's the best use of your time, but okay, I'll review it.

Well, it isn't specifically 'build_tree', it is any time that we extract a
plain text from a Knit. Which means it includes diff/commit/build_tree, etc.
Also, you were saying that "get_file()" was high on your profile for merge times.

Just that 'build_tree' is an obvious user, since it has to extract every text
of the tree.

Also, since I do "bzr cbranch"  a *lot* it effects me often. (Since every new
branch has its own working tree. And because I use shared repositories, often
the bulk of the time spent is in creating the new WT).


> 
> Anyhow, I vote <resubmit>.  It needs a bit more cleanup, and possibly
> refactoring.
> 
> === modified file 'bzrlib/knit.py'
> --- bzrlib/knit.py	2007-07-20 12:56:33 +0000
> +++ bzrlib/knit.py	2007-07-23 23:28:06 +0000
> @@ -226,11 +226,6 @@
>          lines = iter(lines)
>          next = lines.next
> 
> -        cache = {}
> -        def cache_and_return(line):
> -            origin, text = line.split(' ', 1)
> -            return cache.setdefault(origin, origin), text
> 
> ^^^ Dead code elimination?
> 

Yep

> -
>          # walk through the lines parsing.
>          for header in lines:
>              start, end, count = [int(n) for n in header.split(',')]
> @@ -238,6 +233,27 @@
>              result.append((start, end, count, contents))
>          return result
> 
> +    def parse_line_delta_no_annotations(self, lines):
> +        """Convert the delta lines into real lines, but ignore annotations.
> +
> +        line delta is in the form of:
> +        intstart intend intcount
> +        1..count lines:
> +        revid(utf8) newline\n
> +        internal representation is
> +        (start, end, count, [1..count line)])
> +        """
> +        result = []
> +        lines = iter(lines)
> +        next = lines.next
> +
> +        # walk through the lines parsing.
> +        for header in lines:
> +            start, end, count = [int(n) for n in header.split(',')]
> +            contents = [next().split(' ', 1)[1] for i in xrange(count)]
> +            result.append((start, end, count, contents))
> +        return result
> +
>      def get_fulltext_content(self, lines):
>          """Extract just the content lines from a fulltext."""
>          return (line.split(' ', 1)[1] for line in lines)
> @@ -307,6 +323,18 @@
>      def parse_line_delta(self, lines, version_id):
>          return list(self.parse_line_delta_iter(lines, version_id))
> 
> +    def parse_line_delta_no_annotations(self, lines):
> +        cur = 0
> +        num_lines = len(lines)
> +        result = []
> +        while cur < num_lines:
> +            header = lines[cur]
> +            cur += 1
> +            start, end, c = [int(n) for n in header.split(',')]
> +            result.append((start, end, c, lines[cur:cur+c]))
> +            cur += c
> +        return result
> 
> ^^^ It would aid comprehension and comparison if both of these were
> written in the same style.  KnitAnnotateFactory.parse_line_delta is
> index-based, KnitAnnotateFactory.parse_line_delta_no_annotations
> iterator-based, KnitPlainFactory.parse_line_delta_iter is index-based,
> and KnitPlainFactory.parse_line_delta_no_annotations is index-based.
> 

Yeah, actually that stems from the fact that the surrounding code also had that
dichotomy. I'm happy to clean it up.


> You might also see a performance improvement from avoiding iteration.
> 
> +    def _get_text_map(self, version_ids):
> +        """Produce maps of text and KnitContents
> 
> ^^^ This doesn't produce a map of KnitContents.

Yeah, just a Copy & Paste bug.


> 
> Also, if we have_get_text_map, does _get_contents_map need to return a
> text map?  I think we can remove that.

I believe there are other users of _get_contents_map, but if not, then yes, we
can get rid of it. I'll look around.

> 
> Finally, if we do that, I think we may be able to make a parametrizable
> method that either does get_text_map or get_contents_map.
> 
> +        no_newlines = []
> +        raw_text_map = {}
> +        text_map = {}
> +        for version_id in version_ids:
> +            components = []
> +            cursor = version_id
> +            while cursor is not None:
> +                method, data, digest, next = record_map[cursor]
> +                components.append((cursor, method, data, digest))
> +                if cursor in raw_text_map:
> +                    break
> +                cursor = next
> 
> ^^^ Ah, this brings back memories.  Perhaps a comment explaining this
> creates a set of instructions for building the target version?  Or since
> it's common, maybe we can factor it out?

In later versions I actually moved this sort of thing to the _KnitIndex because
it has more direct information about what we need.

I should probably split out what I've done in pyrex versus what I've done in
Python, and just submit a cleanup patch that works in python only.

I'm generally not very happy with how our current code uses the Knit*Factory
helper class. Just because the interface is a bit scattered in how it works.
I've been leaving things alone for compatibility purposes. (Technically those
classes are not public).

> 
> +            raw_text = None
> +            for component_id, method, data, digest in reversed(components):
> +                if component_id in raw_text_map:
> +                    raw_text = raw_text_map[component_id]
> +                else:
> +                    if method == 'fulltext':
> +                        assert raw_text is None
> +                        raw_text =
> list(self.factory.get_fulltext_content(data))
> +                    elif method == 'line-delta':
> +                        assert raw_text is not None
> +                        delta =
> self.factory.parse_line_delta_no_annotations(data)
> +                        raw_text = self._apply_delta(raw_text, delta)
> +                    raw_text_map[component_id] = raw_text
> +
> +            if 'no-eol' in self._index.get_options(version_id):
> +                text = raw_text[:] # Don't change the cached text
> +                assert len(raw_text) > 0, ('We have no-eol on a text that'
> +                    'has no actual lines for version_id %s in %s'
> +                    % (version_id, self))
> +                text[-1] = raw_text[-1].rstrip('\n')
> +            else:
> +                text = raw_text
> +            text_map[version_id] = text
> +
> +            # digest here is the digest from the last applied component.
> +            if sha_strings(text) != digest:
> +                raise KnitCorrupt(self.filename,
> +                                  'sha-1 does not match %s' % version_id)
> +        return text_map
> 
> ^^^ A bunch of these lines exceed 79 chars.
> 

Yeah, that isn't helped by having code functions that aren't quite what you
want. Nested lines like this:
raw_text = list(self.factory.get_fulltext_content(data))

Tend to make things a little long.

I'm actually thinking about removing _get_text_map() entirely, since it is
focused on extracting multiple entries. When all we really want is a fast way
to extract a *single* entry. (get_lines()).

There are times when multiple entries are good (like when grabbing 100
Revisions), but we honestly request a single content out of knits just as often.

> === modified file 'bzrlib/revisiontree.py'
> --- bzrlib/revisiontree.py	2007-07-17 20:04:13 +0000
> +++ bzrlib/revisiontree.py	2007-07-24 00:13:12 +0000
> @@ -79,7 +79,9 @@
> 
>      def get_file_text(self, file_id):
>          file_id = osutils.safe_file_id(file_id)
> -        return ''.join(self.get_file_lines(file_id))
> +        ie = self._inventory[file_id]
> +        weave = self._get_weave(file_id)
> +        return weave.get_text(ie.revision)
> 
> ^^^ Was this just a drive-by?  (You don't seem to use it anywhere.)
> 
> Aaron

Yep. Part of the work was to get "get_file_lines()" and "get_file_text()" and
"get_file()" properly used everywhere. And it made more sense to have
RT.get_file_text() go down to the lower api, rather than having it go through
its own function. I was at least considering having a faster form for
extracting the actual text (rather than lines). Though at the moment it isn't
strictly necessary.

Thanks for the review. I'll try and put together a "cleaner" and more obvious
patch.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGqMXWJdeBCYSNAAMRAsn1AJ9hQYBBxdRQbGb9QbdjO90Evz/BqACeP55V
1uOa1IrDmCH8pwZIFYtiPow=
=AV1d
-----END PGP SIGNATURE-----



More information about the bazaar mailing list