[MERGE] performance enhancements for log -v

Sun Jun 25 20:46:01 BST 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> (Same thing with your fixes to 'log'. I think it does what I want, but I
> want some time to play with both to get a feel for if the information is
> understandable in the new forms, and which is actually more understandable.)

Well, since the original changes passed muster, and the update was
small, and I was very confident they did what you wanted, I went ahead
and merged them.

>> The main change was to retrieve the the inventories in batches, but
>> after I had that, I kept working on eliminating list construction, and
>> wound up knocking out a lot of the private methods.  The way a group of
>> texts is generated now is:
> 
> Are you retrieving inventories, or the complete set of involved texts in
> batches?

I'm retrieving the XML of the inventories in batches.  The size of the
batches is the same size as the size of revision batches we use for
logs, because they're both done from iter_revisions in show_log.

>> 1. A dict called component_data is produced that holds describes data
>> about how to build all component versions.  This component_data doesn't
>> have the actual texts, but it knows where they're located in the knit
> 
> 
> I assume this is just in a per-knit sense, right?

Yes.  All the 'texts' are just different versions of the same versioned
file (they're either revisions, signatures, inventories, or files with a
given file-id).  This API applies to single VersionedFiles.

> So you tell the knit
> 'give me these revisions', and the knit builds up a map of what it needs
> to get, where it is, and then goes and starts building up your request,
> caching any intermediates as it goes.

At this stage, we gather all the information needed to generate the
requested versions.  It's hard to really call it a cache.

>> 3. Each requested revision is built.  Intermediate KnitContent objects
>> are held in a dict, and reused for building subsequent revisions, where
>> possible.
>>
>> There's also a dict of *final* KnitContent objects, because texts
>> lacking final eols are handled very late in the process.  KnitComponents
>> without final eols cannot be reused to build later revisions from.
> 
> This seems like something that should be fixed. Not necessarily in this
> pass, but any time we have the statement "X (which should be used)
> cannot be used" seems like something that needs fixing.

I agree that it would be nice to fix it.  It's actually not much of a
burden, because we can use the same object as an intermediate
KnitContent and as a final KnitContent, unless it lacks a final eol.

It's due to the way knits deltas are calculated, so we need a format
bump to solve the root problem.

>> Additionally, since texts are generated for SHA1 verification, these are
>> kept in a dict and returned.  Most operations will use this map.
>>
> 
> This sounds a lot like we are caching all of the contents that we
> extract from a given knit. Does this clash with your "don't cache unless
> we need to", and when is this text map cleaned out.

It's not really a cache-- it's not retained as part of the
KnitVersionedFile.  It is a local variable in _get_content_maps that is
returned to the caller, which usually turns the map into a list and
returns that.

So the map is destroyed by _get_content_maps' caller, but the elements
of either final_content or text_map are typically passed back to the
caller's caller.  In the case of inventories, the texts are actually
thrown away in Repository.revision_trees.

> It would seem to me that this map should only exist for the length of
> the actual top-level call (it shouldn't be a state member of the knit,
> it should be passed around between the functions that use it, and then
> garbage collected when you return the texts). At least from what you
> seemed to see with kernel-sized trees taking up too much ram.

Right, it's like that.

> 
> ^- This doesn't seem to be very readable. I realize the old one wasn't
> either, but maybe we could do:
> 
> map = self._get_record_map([version_id])
> record_info = map[version_id]
> return record_info[2] # sha is the 2nd index in the info map

Oh, we could take it farther, even.
map = self._get_record_map(version_id)
method, content, digest, next = map[version_id]
return digest

> -    def _get_component_versions(self, version_id):
> -        basis = self.basis_knit
> -        needed_versions = []
> -        basis_versions = []
> -        cursor = version_id
> -
> -        while 1:
> -            picked_knit = self
> -            if basis and basis._index.has_version(cursor):
> -                picked_knit = basis
> -                basis_versions.append(cursor)
> -            method = picked_knit._index.get_method(cursor)
> -            needed_versions.append((method, cursor))
> -            if method == 'fulltext':
> -                break
> -            cursor = picked_knit.get_parents(cursor)[0]
> -        return needed_versions, basis_versions
> -
> -    def _get_component_positions(self, version_id):
> -        needed_versions, basis_versions = \
> -            self._get_component_versions(version_id)
> -        assert len(basis_versions) == 0
> -        positions = []
> -        for method, comp_id in needed_versions:
> -            data_pos, data_size = self._index.get_position(comp_id)
> -            positions.append((method, comp_id, data_pos, data_size))
> -        return positions
> -
> -    def _get_components(self, version_id):
> -        """Return a list of (version_id, method, data) tuples that
> -        makes up version specified by version_id of the knit.
> -
> -        The components should be applied in the order of the returned
> -        list.
> -
> -        The basis knit will be used to the largest extent possible
> -        since it is assumed that accesses to it is faster.
> +    def _get_components_positions(self, version_ids):
> +        """Produce a map of position data for the components of versions.
> +
> +        This data is indended to be used for retrieving the knit records.
> +
> +        A dict of version_id to (method, data_pos, data_size, next) is
> +        returned.
> +        method is the way referenced data should be applied.
> +        data_pos is the position of the data in the knit.
> +        data_size is the size of the data in the knit.
> +        next is the build-parent of the version, or None for fulltexts.
>          """
> -        #profile notes:
> -        # 4168 calls in 14912, 2289 internal
> -        # 4168 in 9711 to read_records
> -        # 52554 in 1250 to get_parents
> -        # 170166 in 865 to list.append
> -
> -        # needed_revisions holds a list of (method, version_id) of
> -        # versions that is needed to be fetched to construct the final
> -        # version of the file.
> -        #
> -        # basis_revisions is a list of versions that needs to be
> -        # fetched but exists in the basis knit.
> -
> -        needed_versions, basis_versions = \
> -            self._get_component_versions(version_id)
> -
> -        components = {}
> -        if basis_versions:
> -            assert False, "I am broken"
> -            basis = self.basis_knit
> -            records = []
> -            for comp_id in basis_versions:
> -                data_pos, data_size =
> basis._index.get_data_position(comp_id)
> -                records.append((comp_id, data_pos, data_size))
> -            components.update(basis._data.read_records(records))
> -
> -        records = []
> -        for comp_id in [vid for method, vid in needed_versions
> -                        if vid not in basis_versions]:
> -            data_pos, data_size = self._index.get_position(comp_id)
> -            records.append((comp_id, data_pos, data_size))
> -        components.update(self._data.read_records(records))
> -
> -        # get_data_records returns a mapping with the version id as
> -        # index and the value as data.  The order the components need
> -        # to be applied is held by needed_versions (reversed).
> -        out = []
> -        for method, comp_id in reversed(needed_versions):
> -            out.append((comp_id, method, components[comp_id]))
> -
> -        return out
> -
> +        component_data = {}
> +        for version_id in version_ids:
> +            cursor = version_id
> +
> +            while cursor is not None and cursor not in component_data:
> +                picked_knit = self
> +                method = picked_knit._index.get_method(cursor)
> +                if method == 'fulltext':
> +                    next = None
> +                else:
> +                    next = picked_knit.get_parents(cursor)[0]
> +                data_pos, data_size = self._index.get_position(cursor)
> +                component_data[cursor] = (method, data_pos, data_size,
> next)
> +                cursor = next
> +        return component_data
> +
> 
> 
> Do I understand correctly that you basically folded up the 2 requests
> into a single function that now takes an array of versions, rather than
> just one?

Right.  Ultimately, I needed a map like the one produced by
_get_components_positions, so it made sense to build it in one step
instead of two, and using an array of versions meant we could eliminate
duplicate components at this stage.

> You seem to still be carrying around the 'basis-knit' cruft,
> even though you never reference the basis knit that I can tell.
> (Otherwise why are you using 'picked knit').

Yes.  I thought that completely removing basis was a somewhat different
focus, so I should do that as a separate merge request.  I wouldn't want
performance enhancements to be delayed by discussions of the utility of
'basis', for example.

> In general, I think I like this function, and I like that it takes an
> array. I'm hoping we don't need to deprecate the old functions, just
> because they were marked private. But if we do, couldn't they just thunk
> into this?

The old functions had unnecessary complications, but it could be done if
necessary.  To implement _get_component_versions, we'd have to call
_get_components_positions and then strip out the position data, for example.

But private interfaces are fair game for mutilation, and I made sure
that they had no callers before I deleted them.

>      def _get_content(self, version_id, parent_texts={}):
>          """Returns a content object that makes up the specified
>          version."""
> @@ -605,26 +555,7 @@
>          if self.basis_knit and version_id in self.basis_knit:
>              return self.basis_knit._get_content(version_id)

...
> +        return self._get_content_maps([version_id])[1][version_id]
> 
> I must not understand the map that you are returning. Because I would
> have sworn that this should have been:
> return self._get_content_maps([version_id])[version_id][1]

get_content_maps returns (text_map, final_contents).

so this is equivalent to:
    text_map, final_contents = self._get_content_maps([version_id])
    return final_contents[version_id]

> If I'm right, then I think you should apply the same fix for
> content = self._get_content_map([version_id])[version_id]
> return content[1]
> Or something to make it clearer what _get_content_map is returning.

I've already documented the return type.  What else should I do?

> It might also be better to have a convenience function for returning the
> content for a single entry

That's _get_content, the method we're looking at here.
_get_content_maps has no other clients that want a single entry.

>, since we seem to want to use that in several
> places

Nope.  Just _get_content.

>, and I think you've shown it is fairly easy to get it wrong. (or
> at least hard to read when you have get_foo([a])[b][c]

I can do the two-step version, if you prefer.

> +        records = [(c, p, s) for c, (m, p, s, n) in
> position_map.iteritems()]
> +        record_map = {}
> +        for component_id, content, digest in\
> +            self._data.read_records_iter(records):
> +            method, position, size, next = position_map[component_id]
> +            record_map[component_id] = method, content, digest, next
> +
> +        return record_map
> 
> Either use longer variable names, or at least put a comment right above
> it, as to what the different variables mean:

I only use short variable names in tightly limited scopes like lambdas,
list comprehensions and generator expressions.

> # c: content, p: position, s: size/length, m:???, n:???
> Or is s sha? (Hence the problem with single letter variables).

Well, it'd pretty consistently 'digest' throughout the module.

> I'm okay with short ones inside the list comprehension, since it makes
> it nice and succinct, but we need to clarify what the objects are.

Alright, I'll stick a comment in there.  But I do think anyone trying to
understand this function will need to understand
get_components_positions, which describes its return value in its docstring.

> ^-- Grammer bug 'dicts are take', 'dicts take'?

Thanks.  'Dicts take' was intended.

> ^- Looking at the above lines, it would seem that knit deltas are
> actually built up assuming that every line has a final newline, and then
> they strip the last newline. Is that correct?

That's correct.  The relationship of knit texts to each other is
calculated with a final eol included, but the digest is calculated with
respect to the final text, i.e. with no trailing newline.

> Since the knits are actually 'line-deltas' rather than being actual raw
> deltas, maybe this is reasonable. I don't really like keeping 2 copies
> of records, but as long as it is the same reference for everything other
> than entries without eol, its probably okay.

I think it is counterintuitive.  I spent quite a while trying to figure
out why a few inventories were giving me hash failures.  It would be
much nicer if we didn't need this intermediate/final distinction.

>      def iter_lines_added_or_present_in_versions(self, version_ids=None):
>          """See VersionedFile.iter_lines_added_or_present_in_versions()."""
> @@ -1444,13 +1407,15 @@
>          # 4168   calls to readv              in 1411
>          # 4168   calls to parse_record       in 2880
> 
> -        needed_records = []
> +        needed_records = set()
>          for version_id, pos, size in records:
>              if version_id not in self._records:
> -                needed_records.append((version_id, pos, size))
> +                needed_records.add((version_id, pos, size))
> +
> +        # turn our set into a list, sorted by file position
> +        needed_records = sorted(needed_records, key=operator.itemgetter(1))
> 
> ^- This is where you figure out what you are going to need to 'readv'
> right? Why are you using a set() instead of a list?

So that duplications in the input are eliminated.

, I do believe
> list.append() is faster than 'set.add()'. 

They're on the same order of magnitude.  For 10M operations, I get

list.append
===========
real    0m6.663s
user    0m5.896s
sys     0m0.424s

set.add
=======
real    0m9.874s
user    0m7.760s
sys     0m1.020s

But that's ten million operations!  These things are both pretty fast.

>Though I guess you may be
> adding the same version_id more than once (depending on what is in
> 'records' and '_records'.

Right.

> I am also wondering if it would be better to add them as:
> needed_records.add((pos, size, version_id)) and then just let them sort
> naturally.

I was being a bit minimal here, I guess.  That ordering is also used in
read_records_iter_raw, and keeping them the same seems like a good idea.

> (I wouldn't mind sorting on pos, size rather than just pos. Though we
> shouldn't have them overlapping).

Indeed, we shouldn't!

>      def iter_revisions():
>          revision_ids = [r for r, n, d in view_revisions]
> +        zeros = set(r for r, n, d in view_revisions if d == 0)
>          num = 9
> +        cur_deltas = {}
> 
> ^- again, either avoid single letters, or define them in a comment, it
> doesn't really matter when there are only one [r for r in revisions],
> but it is more confusing when there are multiple variables

Okay.

> +                deltas =
> branch.repository.get_revision_deltas(delta_revisions)
> +                cur_deltas = dict(zip((d.revision_id for d in
> +                                       delta_revisions), deltas))
> 
> ^- here 'd.revision_id' seems the wrong variable name, since
> 'delta_revisions' is actually a list of 'Revision' objects. Better to
> use r.revision_id for r in delta_revisions. Or even better 'for rev in
> delta_revisions'. I was trying to figure out how you were iterating over
> the dictionary.

Okay.

> +                yield revision, cur_deltas.get(revision.revision_id)

> ^- Can you put a comment that cur_deltas.get() is returning None in the
> case that we are not Verbose?

I can, but isn't it obvious that I'm using dict.get because I want to
get None if it's not there?

> Either that, or change it to a if/else
> block and use 'yeild, None' in the second block, but I think I just
> prefer the comment.

I definitely prefer the comment to that.

> +    @needs_read_lock
> +    def get_revision_deltas(self, revisions):
> +        """Produce a generator of revision deltas.
> +
> +        Note that the input is a sequence of REVISIONS, not revision_ids.
> +        Trees will be held in memory until the generator exits.
> +        Each delta is relative to the revision's lefthand predecessor.
> 
> ^-- Can you make a comment that :param revisions: A list of Revision
> objects. I would have otherwise thought that they were revision ids.

It's the second line of the docstring, in caps, even.  Considering no
one is actually using epydoc for bzr, I think that's more useful.

> I'm also concerned that 'get_revision_delta' takes a revision_id, and
> 'get_revision_deltas()' takes Revision objects.
> Maybe 'get_deltas_for_revisions()' is better, to make it clear that the
> api difference is more than just taking a list versus a single item.

Makes sense.

> ^- Ultimately, we should think about changing our data storage, so that
> rather than deserializing the entire inventory, just to run
> 'compare_trees' on the result, we should be able to come up with some
> sort of Delta object, that can be run directly on the stored
> information.

We could take this in an evil hackish direction and retrieve the
inventory deltas instead of the inventories...

> I'm not sure how the api would work, and it may involve a
> storage format change.

To be sane, it should.  We just need a one-line-per entry format that we
can sanely parse without its surrounding context.  That also makes
scalar annotate-based-merge a lot easier.

> But I think from an algorithmic level, it cuts
> out *lots* of operations of extracting all the old inventories, just to
> build them up again.
> And since our data storage right now is actually just a delta against
> the leftmost ancestor, it seems like it wouldn't require a data format
> change.

Not a knit format change, but I'd really prefer to avoid doing this on
our current inventory XML format.

> This doesn't block what you've done, but I see it as a *massive*
> improvement for kernel-sized trees, since it changes O(num_files) to
> O(num_changes). And for all I know, the compare_trees is actually
> O(num_files^2), I think it is at least O(n log(n)). Or at least O(2n),
> since it has to iterate over both the previous and post inventories,
> rather than just looking at the delta that we've already computed.

On the bzr mainline, out of 111.29 seconds in show_log, compare_trees
takes 59.39 seconds.  This isn't shown by my benchmark, because my
benchmark doesn't change any files.

Didn't you already get it to O(2n) scaling?

> So, I think you've done some great work, and with a couple cleanups you
> have a +1. And I think as a group we should think about how we can get
> rid of reading the entire inventory, to build up a RevisionTree, just to
> call compare_trees() and generate a delta against something we already
> have a delta.

Cool.

> However, I think introducing 'get_revision_deltas()' (whatever it ends
> up being called), gives us a location to do the optimization (even if it
> requires new interfaces on VersionedFile).
> So I'm very +1 on your changes.

Okay, I'll clean up and submit.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEnuf40F+nu1YWqI0RAmNSAJ9b7C88MLQ6I4LKyfIbdvtoLIRLBgCcC3LO
j6Z8W4oAurlhIedGyMoUrQk=
=4zns
-----END PGP SIGNATURE-----