[RFC] Pack-specific smart server verbs: check_references and autopack

John Arbash Meinel john at arbash-meinel.com
Thu Jun 19 16:09:57 BST 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
| This patch isn't quite ready for merging, but I would like to let
people know
| about it.  It adds some infrastructure and smart server verbs to
optimise some
| pack repository operations, rather than treating all repositories
more-or-less
| independently of their on-disk format as we do now.
|
| The reason for this is that when pushing, two of the major causes of
slow downs
| are:
|
|   * Packer._check_references.  After uploading a pack of new revisions the
|     packer checks that all compression parents not present in that
pack exist
|     elsewhere in the repository, in order to make sure all the file
texts will
|     be reconstructable.  With VFS operations that tends to be many,
many readvs
|     of all the .tix files.  This happens on every push, and is often a
large
|     fraction of the total push time, e.g. 25% depending on the exact
details of
|     the push.

I'm curious if Robert's new bisecting might help this. It seems like we
know all the keys we want to check, we just need to do a lookup across
all of them.

|
|   * RepositoryPackCollection.autopack.  If certain thresholds are
reached after
|     adding a pack, an autopack will be triggered to combine several
packs into a
|     single pack.  At the moment this involves pulling down all that
data and
|     then reuploading it.  It only happens about one in every ten
pushes (and
|     with varying amounts of work to do), but it bites hard when it
happens.


This is something I certainly think we should have. As you say, it could
be hidden by a smarter verb above it. But honestly, when this bites on a
big project, it bites *hard*.

|
| A single “stream some revisions into the remote repo” verb will
probably to deal
| with these intrinsically, but that hasn't been written yet.  So as a cheap
| interim measure I thought I'd try writing some pack-specific HPSS code to
| perform those operations in a single round trip each.
|
| The main part is adding a new InterRepository, InterPacktoRemotePack.
  It turned
| out to be fairly easy to hook it all up.
|
| The Packer._check_references part seems to work well, but I haven't
yet thought
| carefully about if it's always going to be better than the status quo,
or if it
| might sometimes be much worse.  I think it's probably a worthwhile
change, but
| feedback on this idea is welcome (even if it's just to say “yes, that
is totally
| fine, please do that”).

Well, it depends on what data you have to send for this. If you just
send "there is a pack, go check it". In fact, I would recommend doing
just that if we can. Can we upload the text index (since we will need to
do that anyway), and just send the location of the files to check. The
code looks like:

~    def _external_compression_parents_of_texts(self):
~        keys = set()
~        refs = set()
~        for node in self.text_index.iter_all_entries():
~            keys.add(node[1])
~            refs.update(node[3][1])
~        return refs - keys

I suppose it would mean changing some of the _write_index calls to write
to 'upload' and then rename them into 'indices'.

So the big issue with sending a large request versus readv is bandwidth.
At least on my machine upload is 10x slower than download. Latency
certainly comes into play.

So if you could upload the index that is going to be checked, and have
the full list checked on the server, I think that would be a complete
success. (We are uploading data we have to upload anyway, and all the
info is read on the other side.)

The current ordering is:
~        self._check_references() # check references
~        if not self._use_pack(new_pack):
~ 	    # If we didn't actually add anything, abort
~            new_pack.abort()
~            return None
~        # upload the indices and rename the pack file.
~        self.pb.update("Finishing pack", 5)
~        new_pack.finish()

It seems to me that we could do the "_check_references()" as part of
'.finish()', and write the indices to the 'upload' directory like we do
for the .pack.

|
| The autopack verb I've added I'm sure is worthwhile, but it doesn't
quite work
| right yet and has no tests.  It seems to do the right thing on the
server, but
| then leave the client in a state where it has the wrong pack-names cached,
| causing a traceback after most of the push is done.  Probably that's
not too
| hard to fix, but I wonder if maybe I could hook it up in a better way.
| Thoughts?
|
| -Andrew.
|
|

So... for autopack, we just need the call to return whatever "side
effects" happen. For example, it could return the pack-names file, or it
could return the logical data. Or it could just return a flag if
something changed, so we know to re-sync our local cache.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhadsUACgkQJdeBCYSNAANxJgCgpnGwvTwGTzfYuuYPYDd5H6fy
yp8AoNb4466akS6jY+DpL0yOSGTiUEPY
=ESge
-----END PGP SIGNATURE-----



More information about the bazaar mailing list