Optimizing tree building

Tue Jun 19 13:21:42 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:
> On 6/19/07, John Arbash Meinel <john at arbash-meinel.com> wrote:
>> I think Robert and I have settled our differences. At least for now I
>> think we are going with:
>>
>> 1) Have a foo_py.py file next to foo_c.pyx file. And the two of them
>> should have the exact same functionality. Some of what I've done had the
>> python functionality mixed with a different file, and the other way I
>> tried had the .pyx in a completely different directory.
> 
> +1 and let me check if this is also correct - all code that uses these
> modules should do
> 
>  try:
>    import foo_c as foo
>  except ImportError:
>    import foo_py as foo
> 
> I wonder if we will ever have trouble where there is, say, an out of
> date foo_c installed system wide and we should just be using the
> foo_py version...
> 
> It might be good to add something about these points to HACKING.

We can do so. It shouldn't be a problem, because we will be doing "from bzrlib
import foo_c as foo" so they have to have an "out of date foo_c" in the same
bzrlib that we are already importing from.

> 
>> 2) I felt like most people agreed not to version the .c files, so that
>> is how I would do it. I still don't feel like it was definitively
>> answered, but that was my 'feeling' about the responses.
> 
> I think that was the consensus and I agree.
> 
>> > Right, and I guess we're already scaling badly at that point, so the
>> > indices hardly make a difference.
>> >
>> > Aaron
>>
>> There is also the issue of "discoverability". To this date, we haven't
>> needed a '.listing' file, because we have everything accessible by known
>> paths. And we have done it in a way that all files are append-only or
>> atomic replace. (The only current atomic replace file I'm aware of is
>> .bzr/branch/(revision_history|last_revision_info)).
>>
>> Having an index of indicies seems like something that could quickly
>> scale very poorly. I know one thing we had hoped for is a lock-free
>> writers. So you can insert new data into the repository without locking.
>> Or at least without locking for a long time. I think Robert made a
>> comment about spin-locking on the master index. So you try to write,
>> then read, and if it isn't what you wrote, you update it with whatever
>> info, and write again, until it finally stabilizes. If you have 1000
>> people committing at the same time, it gets a little bit ugly, but
>> theoretically it does eventually converge.
> 
> It's actually: write it, read it back, then read the list of files in
> the directory and make sure it's up to date.  This should converge and
> be lock free, though it may be a bit more complex on Windows where we
> can't rename the file if it's being written or read, so it needs some
> careful thought about that.
> 
> Every filesystem we can write to lets us list directories so it's
> reasonable for writers to update this.
> 

I think it is something worth exploring in the small before we try to do so in
a repository format.

>> Oh, and one more thing to think about. If we have 1M keys in a file you
>> need a bloom filter of approximately 1MB to have a ~2% false positive
>> rate. Which in some ways seems pretty big. However, if you look at
>> bzr.dev *right now*, inventory.kndx is 1.3MB for 14,000 revisions.
>>
>> And if you just look at a file which list only revision ids, you have
>> about 60bytes per revid, or 60MB for 1M revisions.
>>
>> Bloom filters generally scale as O(N), their nice property is that their
>> constant is really small (1byte per N). I don't know how to do it as
>> O(log(N)) which is what you would really want for a master index.
>>
>> Also note that because blooms are O(N) splitting up 1M revisions into 10
>> files with 100k revisions still gives you 1MB of bloom data.
>>
>> Which is something that a "tree" structure (trie, something else) helps
>> with. Because the top node gives you part of the tree, and then you read
>> further. We are doing a bit of a tree with bloom filters, in that the
>> top node is a bloom filter, which helps you figure out what sub-node to
>> read. But the top node is of size O(N), and I don't think you can change
>> that with Blooms.
> 
> Right, and (I think) we're always going to have to read the whole
> filter - there is not good locality within it, as we might hope to
> have with a sorted index.
> 

Well, if you already know the size of the bloom filter (*that* is easy to
encode in the filename), and you have your revision ids, you know what offsets
in the target file to read. So over HTTP you can issue a nice small range
request (with 5-6 single byte ranges looking for 5-6 bits).

If you have several revision ids to check for, you can pack them together in a
single request. Now, those bytes are most likely to be evenly spread across the
bloom filter (it is hash based). Which is good for HTTP, but bad for sftp, and
not much of a win for local (but probably a small win).

However, with a 'trie' you can read the first page as a whole page and have a
decent amount of extra information.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGd8pWJdeBCYSNAAMRAhDzAJ4z+OC/bzH+4gr1gnzN7uhin+x/6wCg0Fyk
BeGWpm95ZxmZ2tavoGKfuf4=
=ai1W
-----END PGP SIGNATURE-----