[MERGE] repository docs (cleaner still) and GraphIndex

Fri Jul 13 23:11:38 BST 2007

On Fri, 2007-07-13 at 16:48 -0500, John Arbash Meinel wrote:
> If I was reading correctly, this is not an append-only index, right? (Just
> looking at stuff like:

Right. I thought I made that explicit in the docs ?

> nodes = sorted(self._nodes.items(), reverse=True)
> 
> Which means that adding a new node could insert itself anywhere in the file.
> 
> I don't quite understand why you need multiple reference lists. Is it just so
> you can have different graphs within the same index? (Revision parent, versus
> file-delta parent, etc?)

Exactly - thats precisely the case I've used it for. I figure thats
extremely useful for data based on Aaron's mpdiff or other similar
compressors.

> Why reverse? At the very least, the built-in python 'bisect.bisect_*' functions
> expect things to be in forward-sorted order. Not that we can't re-implement
> them, but we should at least have a good reason for wanting to.

Well its experimental at the moment, and theres a slight thinko in that
I thought our revids were DATE-PERSON-.... We iterate newest-oldest
typically, so having the newest revids at the front should be a goal. I
suspect reversed(topo_sorted()) is whats needed.

> One way to collapse the two loops would be to just keep a list as you go along
> the first time, which tracks the cumulative number of references, and the
> current number of 'other' bytes. Something like:

> Then your second loop is just:
> 
> key_addresses = {}
> for key, non_ref_bytes, total_references in key_offset_info:
>     key_addresses[key] = non_ref_bytes + total_references*digits

Thats a good approach, I'll try slotting it in - thanks.

> Why is your api returning a File-like object (StringIO()) when you are doing
> all the work and storing it as a string in memory?

It seemed like a good idea at the time.

> Also, I just realized that it would be reasonable to assert that your offsets
> are correct as you build up your final output. At the very least, you should
> check that the total length of the final string is as you expected.
> 
> For example, it is possible for you to get the number of digits incorrect
> (based on off-by-one or some weird error). And then 1 of your references is 3
> bytes, while all the others are 2 bytes. But suddenly, all offsets become
> invalid. So a simple check is to just make sure the final text is the right length.

Thats a good idea too. Actually it won't catch some errors (I found a
bug after submitting ;)) but some >> none.

> Also, why pad with '0' rather than ' '? int(' 1234') and int('1234 ') both
> work, and it doesn't look like you are writing numbers in octal. It would also
> make the file easier to read for humans. If you have a reason, it isn't a big
> deal, though.

No particular reason. I think its clearer that its padded though, so I'd
rather stay with '0' padding to prevent 'why don't you shrink it'
questions :).

> Are you trying to optimize any of this code yet? Or just trying to get it to
> follow the basic format that you want?

I want the API to be rock solid before optimising. Actually using bisect
is not ideal here - I'd like to implement a b*tree, and/or a trie based
version of the GraphIndex API. As long as the API is solid this should
be a simple drop in change with no leakage at all.

> For example:
> 
> key, absent, references, value = line[:-1].split('\0')
> 
> That has to create an extra String (with 1 less character) and then split it. I
> *think* it is better to actually do:
> key, absent, references, value, trailing = line.split('\0')
> 
> While they are both arguably creating an extra string, 'trailing' will be a
> single character long.

Cute :).

> Unless maybe it is just that 'value' would end up with an extra character. Even
> so, I think 'value = value[:-1]' would be better, because it has to copy fewer
> total characters.

line ends with \n. So value has a trailing \n.

> +        for key, absent, references, value in self.keys_by_offset.values():
> 
> I think you want 'iter_values()' so you don't have to build up the full list.
> Unless you are mutating keys_by_offset somehow.

heck no, brain fart as it happens :).

> Your docs talk about creating an ``IndexBuilder`` and calling ``insert(key,
> value)``. But your GraphIndexBuilder is using 'add_node()' not 'insert'.
> 
> I do see that your IndexBuilder api has been defined as returning a stream, I'm
> just not sure what your motivation for that is.

Thanks for catching that, the difference between speculation and code.
What do you think reads better?
The reason for using a stream is scaling - I figure some indices will be
so big they start filesorting on disk during creation; so getting the
final thing will be seek(0); return filehandle.

> I appreciate that we want to be able to read only part of an index, and for
> local operations it makes sense to bisect into it. The problem is that for
> remote indexes this doesn't make as much sense. I suppose you could just bump
> up the bisect page size to 64kB or so, and that would be an approximately
> reasonable way to bisect through a remote index.

I was figuring on 64K or even 128K page sizes, once paging is
implemented. Note that with a toposorted, newest-at-front, scheme,
accessing the first page will be enough for most merges.

Thanks for the review, I'll fix up the stuff and send it in, ok ?

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070714/8c67a01b/attachment.pgp