[MERGE] Packs. Kthxbye.

Fri Oct 19 01:42:38 BST 2007

On Thu, 2007-10-18 at 18:56 -0500, John Arbash Meinel wrote:

> More comments:
> 
> +        request_groups = {}
> +        for index, key, value, references in nodes:
> +            if index not in request_groups:
> +                request_groups[index] = []
> +            request_groups[index].append((key, value, references))
> 
> ^- I used to think the idiom
> request_groups.setdefault(index, []).append((key, value, references))
> would be faster.
> 
> I did some timing, though, and at least for integer keys, doing
> 
> import random
> 
> unique_vals = 100
> total_count = 100 * 1000
> data = [random.randint(0, unique_vals) for x in xrange(total_count)]
> 
> # Method 1
> d = {}
> for x in data:
>   if x not in d:
>     d[x] = []
>   d[x].append(x)
> 
> # Method 2
> d = {}
> for x in data:
>   d.setdefault(x, []).append(x)
> 
> I get the table:
> 
> unique_vals	Method 1	Method 2
> 100		 73ms		107ms
> 1000		102ms		142ms
> 10000		154ms		188ms
> 
> So it seems a hash lookup for integers is faster than allocating a temporary list.

index here is an object, I don't know if its faster or not; thats
possibly something to look for in profiling if this function ever shows
up.

> +    def _copy_nodes_graph(self, nodes, index_map, writer, write_index,
> 
> ^- I find it a little strange that this function does double duty as a
> generator if you supply output_lines=True, versus just repacking the data if
> you don't.

Right; Its a bit ugly, but better than duplicate code IMO.

> I understand why you do, it just seems a little odd.
> 
> Also the statement:
> +        else:
> +            # eat the iterator to cause it to execute.
> +            list(inv_lines)
> +            text_filter = None
> 
> ^- This still builds up a list with every entry in all of the inventories.
> Which seems like a fairly large memory consumption.
> Maybe
> # Iterate the lines to execute the copy
> for l in inv_lines:
>   pass
> 
> Or at least
> [None for l in inv_lines]
> Which still builds up a huge list, but at least it re-uses a single object.

So the autopack operation is currently nice and fast IMO; possibly a
better thing to would be to not set output_lines when no revision_id
limit is in place.

> Am I correct in saying that you don't currently do any intelligent ordering of
> the output? You put revisions at the front, then inventories and then texts.
> And the Indexes are sorted by key (so you can bisect). But you don't do
> anything like try to group related revisions close together, right?

Thats right. https://bugs.edge.launchpad.net/bzr/+bug/154129

> +        self.allocate(new_pack)
> +        if 'fetch' in debug.debug_flags:
> +            # XXX: size might be interesting?
> +            mutter('%s: create_pack: pack renamed into place: %s%s->%s%s
> t+%6.3fs',
> +                time.ctime(), self._upload_transport.base, new_pack.random_name,
> +                new_pack.pack_transport, new_pack.name,
> +                time.time() - new_pack.start_time)
> +        if 'fetch' in debug.debug_flags:
> +            # XXX: size might be interesting?
> +            mutter('%s: create_pack: finished: %s%s t+%6.3fs',
> +                time.ctime(), self._upload_transport.base, new_pack.random_name,
> +                time.time() - new_pack.start_time)
> 
> I'm not sure what the differences here are.

Heh, I missed moving one of those to NewPack.finish(), done. And the
finished() one is now obsolete with the 3-way merge of names, it used to
actually write to disk during that method.

> In _obsolete_packs you have:
> +            # TODO: Probably needs to know all possible indexes for this pack
> +            # - or maybe list the directory and move all indexes matching this
> +            # name whether we recognize it or not?
> +            for suffix in ('.iix', '.six', '.tix', '.rix'):
> +                self._index_transport.rename(pack.name + suffix,
> +                    '../obsolete_packs/' + pack.name + suffix)
> 
> 
> You could do:
> 
> for suffix, num in NewPack.indices.itervalues():

I agree, this facility is relatively new and underused.

> Or maybe we should be adding a list on packs themselves. So instead you would do:

> +        for pack in packs:
> +            for suffix in pack.index_suffixes:
> +            ... rename indices
> +            pack.pack_transport.rename(pack.file_name(),
> +                '../obsolete_packs/' + pack.file_name())
> 
> It does seem to me that we should be renaming the indexes before we rename the
> pack file itself. Especially if we have a preferred order to read the indexes in.

I disagree, the 5 files are an atomic group - any missing file makes the
whole lot unusable; moving the pack first was quite deliberate; that
said Martin is currently questioning this part of the implementation
anyhow.

> ...
> 
> +class GraphKnitRevisionStore(KnitRevisionStore):
> +    """An object to adapt access from RevisionStore's to use GraphKnits.
> +
> +    This should not live through to production: by production time we should
> +    have fully integrated the new indexing and have new data for the
> +    repository classes; also we may choose not to do a Knit1 compatible
> +    new repository, just a Knit3 one. If neither of these happen, this
> +    should definately be cleaned up before merging.
> 
> ^-          definitely
> Obviously this is living in production.
> I would personally prefer to just have the Knit3 form. It will take longer to
> upgrade (since we are rebuilding every inventory, AIUI), but we should really
> be moving forward.

So, this comment is obsolete, it has been cleaned up. We're not
rebuilding inventories to copy data to packs per se.

> 
> ...
> +    def _refresh_data(self):
> +        if self._write_lock_count == 1 or self.control_files._lock_count==1:
> 
> ^- Extra spaces around the '==

done

> During fetch() you have:
> +        if pack is not None:
> +            self.target._packs._save_pack_names()
> +            # Trigger an autopack. This may duplicate effort as we've just done
> +            # a pack creation, but for now it is simpler to think about as
> +            # 'upload data, then repack if needed'.
> +            self.target._packs.autopack()
> +            return pack.get_revision_count()
> 
> Do we want to auto-repack a remote repository? For bzr+ssh it makes sense. But
> for sftp/ftp/http it means we have to download all of that content and
> re-upload it all. I really think it would make more sense for it not to do this.

Remote repositories are the ones that matter the most.

The dominating scaling factor for packs is the number of packs - they
increase latency linearly. So the autopack matters proportional to
latency. Remember our JBOF repository on disk - fast locally, terrible
on the network. Thats what packs degrade to if you don't autopack.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20071019/4f69a713/attachment-0001.pgp