[RFC] Change to use 'merge_sort' for per-file-log
John Arbash Meinel
john at arbash-meinel.com
Fri Sep 19 15:03:29 BST 2008
-----BEGIN PGP SIGNED MESSAGE-----
So, we had a lot of discussion about this bit of tuning, and what I was hoping
for was a bit of discussion about the "bzr log FILE" changing its display. Can
anyone comment on that?
Robert Collins wrote:
> On Thu, 2008-09-18 at 22:49 -0500, John Arbash Meinel wrote:
>> Also, I have one more addendum, which works around issues with
>> GraphIndex.get_parent_map(lots_of_keys). Basically, by requesting keys
>> in 1000
>> key "chunks", I avoid the "buffer_all" overhead, but also avoid some
>> of the
>> extra bisection overhead (especially for smaller indexes.)
> I find it amusing that we're spinning on this :) - I did note that the
> heuristic size was likely problematic when we brought this in.
So, the specific issue is that the heuristic is expecting keys to be present
when we request them. And in this case, only 2,000/60,000 keys will be there.
Further, the heuristic is based more around a "random spread" of keys, but in
this case, it is very much a "clustered set" of keys.
> Perhaps the heuristic just needs changing ?
We could introduce hinting. I know in advance that the keys are going to be
"rare" and "clustered". I don't know how many, or what keys exist, but it will
generally be at least order of magnitude difference, if not 2 orders or more.
However, when doing a revision_id search, we will tend to actually place
requests we expect to be fulfilled, *and* we will expect them to be sparsely
I also think the issue is that performance is probably a "U" shape (especially
in this case). Asking for them one at a time hits one bottleneck, accessing
them all-at-once hits another bottleneck. So you're stuck with a "magic
number" to get into the middle.
The other problem is that I think this is more of a GraphIndex issue, but
GraphIndex logic is about 4 layers removed from the "log FILE" logic. So it
takes a lot of information passing to get from here to there.
> I think its ok to do this change, as long as we come back to this and
> make it more tunable in the near term - perhaps just by doing the
> generation in-tune with the log iteration batching?
So we need to walk in merge-sorted order (notice that when in 'forward' order,
I actually reverse it and then reverse it back). I could probably walk in
reversed order, but we don't do a simple 'list.reverse()' so it is much
trickier to get right.
So does this mean you are okay with changing the output of "bzr log FILE" as
well as batching up the requests?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
-----END PGP SIGNATURE-----
More information about the bazaar