[MERGE] repository docs (cleaner still) and GraphIndex

Fri Jul 13 22:48:38 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Sat, 2007-07-14 at 01:09 +1000, Robert Collins wrote:
>> This merge adds the repository docs writeup, with the index stuff I had
>> in it cleaned up and the generic parts factored out to a new developer
>> text file. 
>>
>> I've also added a GraphIndex - a graph aware index which is 20% smaller
>> than our knit indices today (at least for me on my repositories).
>>
>> I plan to work on hooking this up to knits now, but I think the
>> interface is sufficient for the time being, and I think its pretty clean
>> and amenable to layering on top of/parallel implementations.
>>
>> These things are quite closely related - the index api pressure comes
>> from repositories needs, which is why I put them in the same branch.
> 
> This update changes the API to be more convenient for using in a bulk
> manner - it makes missing keys not raise, which is analogous to SQL.
> 
> -Rob
> 

If I was reading correctly, this is not an append-only index, right? (Just
looking at stuff like:

nodes = sorted(self._nodes.items(), reverse=True)

Which means that adding a new node could insert itself anywhere in the file.

I don't quite understand why you need multiple reference lists. Is it just so
you can have different graphs within the same index? (Revision parent, versus
file-delta parent, etc?)

Why reverse? At the very least, the built-in python 'bisect.bisect_*' functions
expect things to be in forward-sorted order. Not that we can't re-implement
them, but we should at least have a good reason for wanting to.

+        nodes = sorted(self._nodes.items(),reverse=True)
                                            ^- space here

+        # we only need to pre-pass if we have reference lists at all.
+        if self.reference_lists:
+            non_ref_bytes = prefix_length
+            total_references = 0
+            # TODO use simple multiplication for the constants in this loop.
+            # TODO: factor out the node length calculations so this loop
+            #       and the next have less (no!) duplicate code.
+            for key, (absent, references, value) in nodes:
+                # key is literal, value is literal, there are 3 null's, 1 NL
+                non_ref_bytes += len(key) + len(value) + 3 + 1
+                # one byte for absent if set.
+                if absent:
+                    non_ref_bytes += 1

v- I'm pretty sure you already checked this
+                if self.reference_lists:
+                    # (ref_lists -1) tabs
+                    non_ref_bytes += self.reference_lists - 1
+                    # (ref-1 cr's per ref_list)
+                    for ref_list in references:
+                        # how many references across the whole file?
+                        total_references += len(ref_list)
+                        # accrue reference separators
+                        if ref_list:
+                            non_ref_bytes += len(ref_list) - 1
+            # how many digits are needed to represent the total byte count?
+            digits = 1
+            possible_total_bytes = non_ref_bytes + total_references*digits
+            while 10 ** digits < possible_total_bytes:
+                digits += 1
+                possible_total_bytes = non_ref_bytes + total_references*digits
+            # resolve key addresses.
+            key_addresses = {}
+            current_offset = prefix_length
+            for key, (absent, references, value) in nodes:
+                key_addresses[key] = current_offset
+                # key is literal, value is literal, there are 3 null's, 1 NL
+                current_offset += len(key) + len(value) + 3 + 1
+                # one byte for absent if set.
+                if absent:
+                    current_offset+= 1
                                   ^- space

+                if self.reference_lists:
+                    # (ref_lists -1) tabs
+                    current_offset += self.reference_lists - 1
+                    # (ref-1 cr's per ref_list)
+                    for ref_list in references:
+                        # accrue reference bytes
+                        current_offset += digits * len(ref_list)
+                        # accrue reference separators
+                        if ref_list:
+                            # accrue reference separators
+                            current_offset += len(ref_list) - 1

One way to collapse the two loops would be to just keep a list as you go along
the first time, which tracks the cumulative number of references, and the
current number of 'other' bytes. Something like:

key_offset_info = []

non_ref_bytes = prefix_length
total_references = 0

for key, (absent, references, value) in nodes:
    key_offset_info.append((key, non_ref_bytes, total_references))
    non_ref_bytes = len(key) + len(value) + 3 + 1
    if absent:
      cum_non_ref_bytes += 1
    # (ref_lists -1) tabs
    non_ref_bytes += self.reference_lists - 1
    # (ref-1 cr's per ref_list)
    for ref_list in references:
      # how many references across the whole file?
      total_references += len(ref_list)
      # accrue reference separators
      if ref_list:
         non_ref_bytes += len(ref_list) - 1

Then your second loop is just:

key_addresses = {}
for key, non_ref_bytes, total_references in key_offset_info:
    key_addresses[key] = non_ref_bytes + total_references*digits

Why is your api returning a File-like object (StringIO()) when you are doing
all the work and storing it as a string in memory?

Also, I just realized that it would be reasonable to assert that your offsets
are correct as you build up your final output. At the very least, you should
check that the total length of the final string is as you expected.

For example, it is possible for you to get the number of digits incorrect
(based on off-by-one or some weird error). And then 1 of your references is 3
bytes, while all the others are 2 bytes. But suddenly, all offsets become
invalid. So a simple check is to just make sure the final text is the right length.

Also, why pad with '0' rather than ' '? int(' 1234') and int('1234 ') both
work, and it doesn't look like you are writing numbers in octal. It would also
make the file easier to read for humans. If you have a reason, it isn't a big
deal, though.

Are you trying to optimize any of this code yet? Or just trying to get it to
follow the basic format that you want?

For example:

key, absent, references, value = line[:-1].split('\0')

That has to create an extra String (with 1 less character) and then split it. I
*think* it is better to actually do:
key, absent, references, value, trailing = line.split('\0')

While they are both arguably creating an extra string, 'trailing' will be a
single character long.

Unless maybe it is just that 'value' would end up with an extra character. Even
so, I think 'value = value[:-1]' would be better, because it has to copy fewer
total characters.

+        for key, absent, references, value in self.keys_by_offset.values():

I think you want 'iter_values()' so you don't have to build up the full list.
Unless you are mutating keys_by_offset somehow.

This looks a little weird, though I realize now why it is correct:
+    def test_multiple_reference_lists_are_tab_delimited(self):
+        builder = GraphIndexBuilder(reference_lists=2)
+        builder.add_node('reference', ([], []), 'data')
+        builder.add_node('key', (['reference'], ['reference']), 'data')
+        stream = builder.finish()
+        contents = stream.read()
+        self.assertEqual("Bazaar Graph Index 1\nnode_ref_lists=2\n"
+            "reference\0\0\t\0data\n"
+            "key\0\x0038\t38\0data\n"
+            "\n", contents)
+

Basically, \038 would be treated as an octal 03 followed by 8, rather than NULL
followed by 38 so you use the hex escape code \x00.
It might be more obvious if you used '\x00' everywhere. At least on this line,
but I don't really care.

Your docs talk about creating an ``IndexBuilder`` and calling ``insert(key,
value)``. But your GraphIndexBuilder is using 'add_node()' not 'insert'.

I do see that your IndexBuilder api has been defined as returning a stream, I'm
just not sure what your motivation for that is.

I appreciate that we want to be able to read only part of an index, and for
local operations it makes sense to bisect into it. The problem is that for
remote indexes this doesn't make as much sense. I suppose you could just bump
up the bisect page size to 64kB or so, and that would be an approximately
reasonable way to bisect through a remote index.

Anyway, probably an overall +1 conditional.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGl/M1JdeBCYSNAAMRAgqNAJ9Ba237gHxFMpHcpi6gA6mZmALlowCgp7fj
jAyfgbRq9csXybZVOSEE6E8=
=VmIp
-----END PGP SIGNATURE-----