brisbane:CHKInventory._extract_all_inventory
John Arbash Meinel
john at arbash-meinel.com
Tue Mar 17 03:15:37 GMT 2009
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Ian Clatworthy wrote:
> John Arbash Meinel wrote:
>> I did some timing tests, and I found the current
>> "CHKInventory.iter_entries()" to be a lot slower than it needs to be.
>
> Yes, I've been running into the same thing.
>
>> This isn't something I want to focus on right now, so if someone else
>> wants to pick it up, and finish polishing it, I would be happy to turn
>> it over. (I'm going back to work on lazy group extraction and a
>> byte-stream format for groupcompress blocks.)
>
> I'll happily do it. I almost started on a similar thing last night
> while profiling check.
>
> Ian C.
>
So I was chatting with Robert on IRC about it, and I think I have an
understanding of why things are being slower than they should be. I have
a rough cut of how to fix it, though it will take some time to actually
implement it.
Consider this layout, (roughly the LP tree). You have 6535
files+symlinks, and 521 directories (avg 13.5 items per dir, 32 in
root). The chk_map has 1 internal node, fully 255-wide, and an average
of 28 nodes per leaf. The p_id map is at most 3-deep, fully wide at the
root, with an average of 111 items per leaf.
When you come to the root entry, you compute the hash() of its file_id,
and then do a lookup. This seems to adequately find a single internal
node, pointing to a single leaf node. Which you then iterate and find
the 32 children of the root directory. (so far so good).
You then search in id_to_entry for those 32 items.
It goes to the internal node, and finds the 32 children pages that
matched the 32 search keys. For each of those children, it then computes
approx 32 'length_filters'. It then iterates over all of the _items
entries, seeing if each key matches any of the 32 filters.
So if you have C children of a directory, and L items in a leaf node,
you end up loading C leaf nodes, and doing ~C comparisons against L
items. Or C^2*L comparisons to find the C children.
Here is my rough sketch of how to fix it.
1) When finding the C leaf nodes that match the search key prefixes,
track which key actually matched, and pass that information out of
'_iter_nodes'. We would then pass only those exact keys down to
LeafNode.iteritems(). This would drop the performance down to C leaf
nodes, doing 1 comparison * L items on each leaf. (C*L rather than C^2*L).
2) In the LeafNode.iteritems() we should check if the key_filter is
exactly 'key_width', in which case we change from comparing a prefix
against every key, and switch instead to a simple dict lookup. This
should drop us to something like 1 hash comparison for C children. (Down
to C from C*L down from C^2*L.)
It also prevents us from getting false positives under certain
conditions. Consider the case of arbitrary file-ids. One with 'foo' and
the other with 'foo-bar'. Doing LeafNode.iteritems([('foo',)]) should
not yield 'foo-bar'.
The reason we *do* prefix lookups is 2-fold
a) InternalNodes only partially decode the hash of their keys. So if
'foo' hashes to 'EF', and 'bar' hashes to 'EG', we need to search the
'E' sub-page for both keys.
b) In parent_id_basename_to_file_id, when we look up with a partial key
(eg, ('parent-a',)) we want to get all of the children that have
('parent-a', *)). We *don't* want to get all the children of
('parent-ab', *).
So Ian-
before you implement _extract_all_inventory fully, either look into
fixing iteritems() as I listed, or wait for me to do so. (I probably
won't get to it in the next few days, because I'm working on getting
groupcompress blocks to stream properly.) If you would rather I do it,
just let me know. I think I have a good handle on it, it is just going
to take me a while to get around to it.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkm/FdkACgkQJdeBCYSNAAOOYQCeLP6hDqKNatlXex2tL5K6li1K
+VcAnAiKcPsEeumfGJ7J1OL3Undopo09
=WBmL
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list