[Merge] lp:~jameinel/bzr/1.16-better_heads into lp:bzr

Mon Jun 15 18:05:05 BST 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

(Sending to the mailing list, as most of this is a general discussion,
and not just response to review.)

...

> You have one test failing bzrlib/tests/test_knit.py
> TestStacking.test_annotate (seems trivial but worth checking) and
> some assertions here and there.

I believe I only have "assert" in pyx code, which isn't strictly
forbidden by our test suite :). I can probably remove a lot of that.

> 
> Overall, I think it's a significant improvement, going in the
> right direction and based on ideas we want to generalize, so
> let's land it sooner rather than later.
> 
>>>>>> "jam" == John A Meinel <john at arbash-meinel.com> writes:
> 
>     jam> John A Meinel has proposed merging lp:~jameinel/bzr/1.16-better_heads into lp:bzr.
>     jam> Requested reviews:
>     jam>     bzr-core (bzr-core)
> 
> <snip/>
> 
>    jam> 2) Linear Dominators...
> 
> <snip/>
> 
>     jam> There is another benefit to dominators, in that they
>     jam> also give the "same" result for "heads()" that the nodes
>     jam> themselves give. (if the dominator is a head, then the
>     jam> descendant is also a head, and if the dominator is not a
>     jam> head then the child is not either)
> 
> Worth adding in the sources as comment.

Sure.

> 
>     jam> In testing, being able to skip linear chains was a
>     jam> fairly large win. It dropped the 86% number down to
>     jam> around 57% of the number of keys accessed.
> 
> Certainly worth bringing in so that we can search for other uses
> if only as a reference implementation.
> 
> <snip/>
> 
>     jam> 3) Pyrex implementation
> 
> <snip/>
>           
>     jam> 0m20.764s  bzr.dev
>     jam> 0m14.836s  KnownGraph python
>     jam> 0m11.122s  KnownGraph pyrex
> 
> A bit deceiving :-/ But good enough to bring in. My main concern
> is that it may be a bit early to bring the pyrex version, on the
> other hand, it will be silly to not take advantage of it since
> it's available. 

I'm not really sure how it is deceiving. It is true that NEWS probably
has more lines to resolve than some others. But it is also our largest
file with the greatest number of changes, which makes it a primary
candidate to optimize.

Note that these are also "bzr annotate --show-ids" so we don't need to
load the whole revision graph here. (time 'bzr annotate bzr' shows 2.6s
=> 0.6s for --show-ids.)

> 
>     jam> Then again, once the graph is loaded into memory, both
>     jam> are rather trivially fast to compute. For bzr.dev's 25k
>     jam> revisions, it takes 89.3ms to build both GDFO and Linear
>     jam> Dominators for all revs. (277ms for python.)  For OOo
>     jam> and its 263k revisions, it takes 922ms (0.9s) to build
>     jam> both. In both cases, that is 3.5ms per revision. (Both
>     jam> algorithms are very close to O(n) because they are
>     jam> fairly good at walking each node a single time.)
> 
> But do you still need to use a heapq for the gdfo ?

I don't, and I wrote an algorithm that gets away from it. And found that
it was actually slower for both bzr.dev and OOo. (900ms vs 770ms for
OOo, IIRC.)

As far as I can tell, if you get rid of the heapq, then you need to

for node in nodes:
  if node.gdfo != -1:
    continue
  for parents of node:
    if parent.gdfo == -1:
      # come back to this node later when we walk the children of this
      # parent node
      skip = True
  if skip:
    continue
  # All parents are filled in, number this, and walk its children
  todo = []
  todo.extend(node.children)
  while todo:
    next = todo.pop()
    for parents of next: # Same check as before
      ...

This is, essentially, a 2-ish pass system. Where you walk all nodes in
the dict (in random order), and then for every node that you find, you
walk all of its children who have all parents filled in.

My best guess is that the 'random shot' effect is the slowdown. The
initial "for node in nodes" is going to find 90% of the nodes don't have
their parents filled in yet, but it has to *look* at those parents to
find that out.

The gdfo code is:

# Walk one
tails = [node for node in nodes if not node.parents]

todo = heapq()
while todo:
  next = heapq.pop(todo)
  # Next is already numbered
  for child in node.children:
    for parent in child.parents:
      if parent.gdfo == -1:
        # Come back to this child
        skip = True
    if skip:
      continue
    child.gdfo = X
    heapq.push(todo, child)

In theory, this does the ~ same work. I certainly would say they are
both O(n). It is also 2 pass, though the first pass is fairly trivial.
The main win is that by the time you are checking if a nodes parents are
filled in, 75% of the time they *are*.

So, I stuck with the heapq implementation.

Oh, and when possible, I use "heapreplace()" rather than
heappop/heappush. Which means that for 200k linear revisions (OOo) the
queue is effectively updated as:

next = todo[0]
...
todo[0] = child

(No pop() or append() necessary.)

> 
>     jam> So while loading the whole graph is expensive, it would
>     jam> add <1s to compute both of these from scratch whenever
>     jam> we need to.
> 
> The holy grail is then to maintain these algorithms correct with
> a partially loaded graph ;-)
> 
>     Vincent

So... if you can get me "gdfo" (and optionally "linear_dominator") it is
fairly trivial to keep the algorithms correct.

I suppose I would change a few things, if I knew I was reading from disk.

1) Read info about all nodes with the same GDFO (and possibly 'near'
GDFOs). This allows batch requests to the info layer. At the moment
KnownGraph knows that info is stored in a dict, so it doesn't try to
batch anything. (Actually, the pyrex code goes even further and
references the other nodes directly, but it would be easy to rewrite
that to go via keys() again.)

2) Interesting implementation note. I generally care more about the GDFO
of the *parents* (linear dominators) than I do of the current node.
Because I need to know where to put the parent in the todo queue. This
may effect our design. (I was originally thinking we could use a field
in an index to hold 'gdfo', but that requires reading the index line for
the parent key. An alternative would be to store that as part of the
reference from the child...)

I think there is a lot of code that could be tuned in
annotate.reannotate() if we really wanted to make this better (without
caching).

1) Work with fulltexts and a list of line annotations, rather than
   (annotation, line) tuples.  At a minimum, this saves O(lines) tuples
   from being generated.

2) Create a version of PatienceDiff that
   a) Works directly on plain text, even if it outputs what lines were
      effected (avoids splitting up into lines, avoids
      [t for a, t in lines])

   b) Can cache the lookup buffer between diff calls. PatienceDiff (and
      the new diff-delta code) both compute a hash table mapping
      possible bytes to possible matches. Saving that index *between*
      diffs would be very useful for annotate.

   c) Possibly re-think our hash() function. We went with
      PyObject_Hash() because for PyStrings this can be cached. And for
      the Knit extractor, it will try to share lines between content. So
      when you go to diff, you already have computed most of the hashes
      for most of the lines. This no longer holds true for GC code, and
      PyObject_Hash was measurably slower than the DJB hash that was
      used at one point. (And I also have the Murmur hash sitting around
      in my PyBloom plugin.)

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAko2f0EACgkQJdeBCYSNAAPetgCgwJAN9yf8jC9vQs1RU4x2DIpG
uMQAoNXkK6avDpnMEF19OddfBOC2US8t
=gfTZ
-----END PGP SIGNATURE-----