Optimizing tree building

Mon Jun 18 18:34:06 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> John Arbash Meinel wrote:
>>> I have pyrex code for _load_data, which
>>>
>>> a) needs review:
>>> http://bundlebuggy.aaronbentley.com/request/%3C46422E07.2090800@arbash-meinel.com%3E
> 
> I've been waiting until pyrex inclusion policy was all settled.  Is it
> all settled yet?

I think Robert and I have settled our differences. At least for now I
think we are going with:

1) Have a foo_py.py file next to foo_c.pyx file. And the two of them
should have the exact same functionality. Some of what I've done had the
python functionality mixed with a different file, and the other way I
tried had the .pyx in a completely different directory.

2) I felt like most people agreed not to version the .c files, so that
is how I would do it. I still don't feel like it was definitively
answered, but that was my 'feeling' about the responses.

> 
>>> Also, our indicies aren't really written for speed of parsing. So there are
>>> probably things we could do. Also, because of dictionary compression, it is a
>>> little bit tricky to read them backwards. If we started them with their line
>>> number, that would help. So that we would have:
>>>
>>> 1 revision-xxyyzz
>>> 2 revision-zzaabb 1
>>> ...
>>>
>>> Otherwise you have to already know a parent to know when you've actually read
>>> the right parent record.
> 
> It seems like we could almost cache that data, were it not that some
> indices get rewritten frequently (who, me? :-)

Well, they get appended to frequently, I don't know about rewritten.

> 
>>>> But then how would we know which index held the information for a given
>>>> revision?  Maybe we put a bloom filter in the filenames?
>>> We looked into that, too. Bloom filters work okay when you use 1-byte per key
>>> (1% mark is around 9.6bits, 10% around 4.?? bits). The problem is that with
>>> case sensitivity you don't have 8-bits per byte that you can use. Also, you
>>> have to be careful about filename collision when the filters start to fill up.
>>> (A full filter ends up as 0xFFFFFF, which obviously collides).
> 
> You can reduce error rates by looking for direct ancestors of the
> revisions you want, so 4 bits could work.  This does give a slight
> chance of false negatives due to ghosts.

Except when you have a "collection of bundles" repository, you end up
with some files which *don't* contain their immediate ancestors, because
they are in a different file. (1.bundle has revisions 1-10, 2.bundle has
revisions 11-20, so when you search for revision 11, if you checked its
ancestors you wouldn't find them, because they *are* elsewhere).

There is always the tiered approach. Which lets you filter out 90% of
the files, and then you read more. Even if that is only 50% it is still
a small win.

> 
>>> Anyway, you end up able to store maybe 100 entries in a reasonable filename
>>> with a 10% error rate. So it is better than nothing, but 100 isn't very much
>>> (considering when you are dealing with 10000 revisions, or even 1M revisions).
> 
> Well, it's better than four-files-per-revision.  But no one would be
> crazy enough to do that, would they? :-)

I'm not sure of anywhere that has four-files-per-revision over time.
Right now we *modify* four files, but we don't generate 4 more files
each time. Because a lot of things are going into files that already exist.

> 
>>> You *could* put a decent sized bloom filter at the start of the file (1k is
>>> pretty good, 10k is getting decent all around).
> 
> I suppose you could also do an index of indices.  That's basically what
> I was using the filenames for.

Yeah. That is a similar "tiered" approach.

> 
>>> But then again, what would you gain if you stored a trie of the revision ids in
>>> the first 10k. It wouldn't be nearly as compact as the bloom filter, but you
>>> would be a long way in knowing *exactly* what revision ids are present.
> 
> Still haven't really gotten my head around tries.

I'm not sure that I understand them 100%, but if you look at the basics,
you do something like:

sort the list of entries
find the longest common prefix, and make that a node
split of the children and do the same thing recursively

So if you have:

 joe at foo.com-1
 joe at foo.com-2
 jerry at foo.com-3
 jerry at foo.com-4
 larry at bar.edu-5
 larry at bar.edu-6

You would end up with something like:

 TOP NODE
  |
  + - j
  |   |
  |   + - oe at foo.com-
  |   |      |
  |   |      + - 1 - TERMINAL NODE
  |   |      |
  |   |      + - 2 - TERMINAL NODE
  |   |
  |   + - erry at foo.com-
  |          |
  |          + - 3 - TERMINAL NODE
  |          |
  |          + - 4 - TERMINAL NODE
  |
  + - larry at bar.edu-
        |
        + - 5 - TERMINAL NODE
        |
        + - 6 - TERMINAL NODE

As far as how that is explicitly serialized, and how you write the
pointers to the sub-nodes, etc, is left as an exercise for the reader.
We've discussed a few different ways (in a hypothetical bundle format,
you might just reference them by their "name", and then expect that the
reader has an index of all names in the file, and can look up the offset
themselves)

But the idea is that you may lay these things out as pages, and you
could read a few pages at a time (to trade bandwidth for latency), and
then you might get the TOP, j, and oe at foo.com- nodes in the first read,
which would get you a long way towards finding some of the revisions you
care about.

The other thing is that you want keys with common prefixes, if you
invert the above, you get very poor layout because you have:

TOP NODE
  |
  + - 1-joe at foo.com
  |
  + - 2-joe at foo.com
  |
  + - 3-jerry at foo.com
  |
  + - 4-jerry at foo.com
...

> 
>>> I do think we will have an index. I thought we had discussed having it
>>> write-once (so each blob gets its own index, and you don't append to existing
>>> blobs, every so often you combine blobs into a new blob, and write out a new
>>> index).
>>>
>>> That has the nice property that you can optimize the index in any way that you
>>> want (bisect, bloom filter, whatever).
>>>
>>> It means you may need to read more total indices, but after the combining step,
>>> there are fewer again.
> 
> Right, and I guess we're already scaling badly at that point, so the
> indices hardly make a difference.
> 
> Aaron

There is also the issue of "discoverability". To this date, we haven't
needed a '.listing' file, because we have everything accessible by known
paths. And we have done it in a way that all files are append-only or
atomic replace. (The only current atomic replace file I'm aware of is
.bzr/branch/(revision_history|last_revision_info)).

Having an index of indicies seems like something that could quickly
scale very poorly. I know one thing we had hoped for is a lock-free
writers. So you can insert new data into the repository without locking.
Or at least without locking for a long time. I think Robert made a
comment about spin-locking on the master index. So you try to write,
then read, and if it isn't what you wrote, you update it with whatever
info, and write again, until it finally stabilizes. If you have 1000
people committing at the same time, it gets a little bit ugly, but
theoretically it does eventually converge.

Especially if the "update with info" goes out and re-scans the whole
filesystem, so that 1000 new inserts get noticed by all 1000 clients
after the first update fails. (Though it probably means you have 999
extra writes of the file because all of them saw it out of date, and
then update it again with the same info).

Oh, and one more thing to think about. If we have 1M keys in a file you
need a bloom filter of approximately 1MB to have a ~2% false positive
rate. Which in some ways seems pretty big. However, if you look at
bzr.dev *right now*, inventory.kndx is 1.3MB for 14,000 revisions.

And if you just look at a file which list only revision ids, you have
about 60bytes per revid, or 60MB for 1M revisions.

Bloom filters generally scale as O(N), their nice property is that their
constant is really small (1byte per N). I don't know how to do it as
O(log(N)) which is what you would really want for a master index.

Also note that because blooms are O(N) splitting up 1M revisions into 10
files with 100k revisions still gives you 1MB of bloom data.

Which is something that a "tree" structure (trie, something else) helps
with. Because the top node gives you part of the tree, and then you read
further. We are doing a bit of a tree with bloom filters, in that the
top node is a bloom filter, which helps you figure out what sub-node to
read. But the top node is of size O(N), and I don't think you can change
that with Blooms.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGdsIOJdeBCYSNAAMRAukaAJ0S9FZiS7fkNTPNkbuBU9RXpqXaFACgun8S
rhSQZkZLcl5iBih2KH0JIGo=
=kfYQ
-----END PGP SIGNATURE-----