[merge] Repository.tarball hpss command

Tue Apr 17 18:49:19 BST 2007

Martin Pool wrote:
> This adds a new Repository.tarball rpc, which returns in the response
> body a tarball containing the contents of the repository.  The goal is
> to reduce the number of roundtrips needed for an initial repository
> download.
> 
> This diff is against the hpss branch.
> 
> On both the sender and the receiver this is not taken from or written
> directly into the real repository but rather a temporary directory is
> used.

This probably has a net loss for local branching. Since you have "copy
to temp, create tarball, bzip tarball, copy over the wire, unpack into
temp, copy from temp to target".

Versus a "copy to target one by one". It is a net gain for getting
something from Launchpad.net, though, so I'm not going to say we
shouldn't do it.

> 
> This is used by copy_content_into, which is used when making a new
> branch.
> 
> At the moment there is no scope for detecting data that's already
> present on the client.  You can only get the whole repository.  In the
> next release we should look at faster downloads of just the missing
> data.
> 
> For a server 32ms away, this cuts the time to make a new branch of bzr
> from about 6m to about 4m.  For higher latency I expect it would be
> larger but I have not done a proper measurement yet.
> 
> There is still one test failure in this branch,
> 
> bzrlib.tests.bzrdir_implementations.test_bzrdir.TestBzrDir.test_sprout_bzrdir_repository_branch_only_source_under_shared(BzrDirMetaFormat1)
> 
> which is caused by this change:
> 
> -                result_repo.fetch(source_repository, revision_id=revision_id)
> +                # would rather do 
> +                # source_repository.copy_content_into(result_repo, revision_id=revision_id)
> +                # so we can override the copy method
> +                if new_result_repo:
> +                    # can only do this when the destination is empty
> +                    source_repository.copy_content_into(result_repo, revision_id=revision_id)
> +                else:
> +                    result_repo.fetch(source_repository, revision_id=revision_id)
> 
> It looks like the easiest way to fix it is to make fetch use that
> tarball operation as well.
> 
> I have a couple of questions someone might be able to help with:
> 
> Is there still a useful distinction between Repository.copy_content_into
> and fetch?
> 
> It seems strange that RemoteBranch overrides Branch.sprout, but
> RemoteBzrDir doesn't need to override sprout.

My understanding of 'fetch' versus 'copy_content_into' is that
'copy_content_into' was meant to make the target identical to this (sort
of like clone) including overwriting whatever was already there. (So it
sets things like 'no-working-trees', 'shared-storage', etc).

Versus "fetch()" which is just meant to fetch revision information and
store it. It also has the directions turned around
"source.copy_content_into(target)" versus "target.fetch(source)".

I think that means that 'fetch()' thinks of 'target' as local and source
as remote, I'm not sure if 'copy_content_into' thinks of source as local
or not.

...

v- I'm pretty sure you want "%(error_code)s" not error_cde
   This is one of the reasons you should have an entry in
"test_errors.py" so that you are sure the formatting is correct.

> +    _fmt = "Smart Server encountered error %(error_cde)s: %(error_msg)s."
> +
> +    def __init__(self, error_code, error_msg=''):
> +        self.error_code = error_code
> +        self.error_msg = error_msg
> +
> 

...

>      def copy_content_into(self, destination, revision_id=None):
> -        self._ensure_real()
> -        return self._real_repository.copy_content_into(
> -            destination, revision_id=revision_id)
> +        # get a tarball of the remote repository, and copy from that into the
> +        # destination
> +        from bzrlib import osutils
> +        import tarfile
> +        import tempfile

v- You really should be using a cStringIO here. But even better would be
having the ability to write it to a temporary file rather than storing
the whole thing in memory.

I think you are just running into the problem because Remote* all deals
in byte strings, so you don't have the option.

But imaging branching a Moz repository, and you get a 3GB tarball in memory.

Also, I'm not sure how efficient these things are when you are
discussing copies. (I can say that both cStringIO and StringIO don't
seem to make a copy of the string as long as you don't modify it. At
least when I did cStringIO.StringIO(a_36MB_string) my memory consumption
didn't go up)

Anyway, to me it means that "_get_tarball()" should be returning a file
object (possibly just the socket) rather than returning a byte-string.
But I think you have a bug open about having a Remote function that can
return a file object.

> +        from StringIO import StringIO
> +        # TODO: Maybe a progress bar while streaming the tarball?
> +        note("Copying repository content as tarball...")
> +        tar_file = StringIO(self._get_tarball('bz2'))
> +        tar = tarfile.open('repository', fileobj=tar_file,
> +            mode='r:bz2')
> +        tmpdir = tempfile.mkdtemp()
> +        try:
> +            tar.extractall(tmpdir)
> +            tmp_bzrdir = BzrDir.open(tmpdir)
> +            tmp_repo = tmp_bzrdir.open_repository()
> +            tmp_repo.copy_content_into(destination, revision_id)
> +        finally:
> +            osutils.rmtree(tmpdir)
> +        # TODO: if the server doesn't support this operation, maybe do it the
> +        # slow way using the _real_repository?

^- Also, if we are just looking for speed, "tar xjf -" is a lot faster
than "tar.extractall". At least from my testing, tarfile is pretty slow.
(I was looking at using compressed .tar.bz2 files for creating a
memoized test object, but it was faster to do Branch.initialize() and a
commit, rather than extracting from a tarball).

>  
>      def set_make_working_trees(self, new_value):
>          raise NotImplementedError(self.set_make_working_trees)
> @@ -907,9 +940,8 @@
>          # format, because RemoteBranches can't be created at arbitrary URLs.
>          # XXX: if to_bzrdir is a RemoteBranch, this should perhaps do
>          # to_bzrdir.create_branch...
> -        self._ensure_real()
>          result = branch.BranchFormat.get_default_format().initialize(to_bzrdir)
> -        self._real_branch.copy_content_into(result, revision_id=revision_id)
> +        self.copy_content_into(result, revision_id=revision_id)
>          result.set_parent(self.bzrdir.root_transport.base)
>          return result
>  

...

v- We would save quite a bit of time if we just wrote a function to take
the existing repository and copy data into it. It also would give us an
opportunity to strip out unreferenced data (which is what we do now when
requesting a specific revision).

> +    def do_repository_request(self, repository, compression):
> +        from bzrlib import osutils
> +        repo_transport = repository.control_files._transport
> +        tmp_dirname, tmp_repo = self._copy_to_tempdir(repository)
> +        try:
> +            controldir_name = tmp_dirname + '/.bzr'
> +            return self._tarfile_response(controldir_name, compression)
> +        finally:
> +            osutils.rmtree(tmp_dirname)
> +
> +    def _copy_to_tempdir(self, from_repo):
> +        tmp_dirname = tempfile.mkdtemp(prefix='tmpbzrclone')
> +        tmp_bzrdir = from_repo.bzrdir._format.initialize(tmp_dirname)
> +        tmp_repo = from_repo._format.initialize(tmp_bzrdir)
> +        from_repo.copy_content_into(tmp_repo)
> +        return tmp_dirname, tmp_repo
> +
> +    def _tarfile_response(self, tmp_dirname, compression):
> +        temp = tempfile.NamedTemporaryFile()
> +        try:
> +            self._tarball_of_dir(tmp_dirname, compression, temp.name)
> +            # all finished; write the tempfile out to the network
> +            temp.seek(0)
> +            return SmartServerResponse(('ok',), temp.read())
> +            # FIXME: Don't read the whole thing into memory here; rather stream it
> +            # out from the file onto the network. mbp 20070411
> +        finally:
> +            temp.close()
> +
> +    def _tarball_of_dir(self, dirname, compression, tarfile_name):
> +        tarball = tarfile.open(tarfile_name, mode='w:' + compression)
> +        try:
> +            # The tarball module only accepts ascii names, and (i guess)
> +            # packs them with their 8bit names.  We know all the files
> +            # within the repository have ASCII names so the should be safe
> +            # to pack in.
> +            dirname = dirname.encode(sys.getfilesystemencoding())
> +            # python's tarball module includes the whole path by default so
> +            # override it
> +            assert dirname.endswith('.bzr')
> +            tarball.add(dirname, '.bzr') # recursive by default
> +        finally:
> +            tarball.close()

In the 'export' code, I use 'utf8' encoding. I think that is much more
reasonable. Than using "mbcs" encoding on windows.

I think we can just define that all tarballs have filenames encoded in utf8.

...

v- This is probably easier to understand if it was in BASE64, but not a
big deal. You could even do it as:

tarball_content = """
QlpoOTFBWSZTWdGkj3wAAWF/k8aQACBIB//A9+8cIX/v33AACEAYABAECEACNzJqsgJJFPTSnk1A
3qh6mTQAAAANPUHkagkSTEkaA09QaNAAAGgAAAcwCYCZGAEYmJhMJghpiaYBUkKammSHqNMZQ0NA
BkNAeo0AGneAevnlwQoGzEzNVzaYxp/1UkxXzA1CQX0BJMZZLcPBrluJir5SQyijWHYZ6ZUtVqql
YDdB2QoCwa9GyWwGYDMAOQYhkpLt/OKFnnlT8E0PmO8+ZNSo2WWqeCzGB5fBXZ3IvV7uNJVE7DYn
Wj6qwBk5DJDIrQ5OQHHIjkS9KqwG3mc3t+F1+iujb89ufyBNIKCgeZBWrl5cXxbMGoMsc9JuUkg5
YsiVcaZJurc6KLi6yKOkgCUOlIlOpOoXyrTJjK8ZgbklReDdwGmFgtdkVsAIslSVCd4AtACSLbyh
LHryfb14PKegrVDba+U8OL6KQtzdM5HLjAc8/p6n0lgaWU8skgO7xupPTkyuwheSckejFLK5T4ZO
o0Gda9viaIhpD1Qn7JqqlKAJqCQplPKp2nqBWAfwBGaOwVrz3y1T+UZZNismXHsb2Jq18T+VaD9k
4P8DqE3g70qVJLurpnDI6VS5oqDDPVbtVjMxMxMg4rzQVipn2Bv1fVNK0iq3Gl0hhnnHKm/egynW
Q7QH/F3JFOFCQ0aSPfA=
""".decode(base64)

*I* find it nicer, but it is only a minor aesthetic thing.

> +
> +
> +class TestRepositoryTarball(TestRemoteRepository):
> +
> +    # This is a canned tarball reponse we can validate against
> +    tarball_content = (
> +        "BZh91AY&SY\xd1\xa4\x8f|\x00\x01a\x7f\x93\xc6\x90\x00 H\x07\xff\xc0"
> +        "\xf7\xef\x1c!\x7f\xef\xdfp\x00\x08@\x18\x00\x10\x04\x08@\x0272j\xb2"
> +        "\x02I\x14\xf4\xd2\x9eM@\xde\xa8z\x994\x00\x00\x00\r=A\xe4j\t\x12LI"
> +        "\x1a\x03OPh\xd0\x00\x00h\x00\x00\x070\t\x80\x99\x18\x01\x18\x98\x98"
> +        "L&\x08i\x89\xa6\x01RB\x9a\x9ad\x87\xa8\xd3\x19CC@\x06C at z\x8d\x00\x1aw"
> +        "\x80z\xf9\xe5\xc1\n\x06\xccL\xcdW6\x98\xc6\x9f\xf5RLW\xcc\rBA}\x01$"
> +        "\xc6Y-\xc3\xc1\xae[\x89\x8a\xbeRC(\xa3Xv\x19\xe9\x95-V\xaa\xa5`7A"
> +        "\xd9\n\x02\xc1\xafF\xc9l\x06`3\x009\x06!\x92\x92\xed\xfc\xe2\x85\x9e"
> +        "yS\xf0M\x0f\x98\xef>d\xd4\xa8\xd9e\xaax,\xc6\x07\x97\xc1]\x9d\xc8"
> +        "\xbd^\xee4\x95D\xec6'Z>\xaa\xc0\x199\x0c\x90\xc8\xad\x0eN at q\xc8\x8e"
> +        "D\xbd*\xac\x06\xdeg7\xb7\xe1u\xfa+\xa3o\xcfn\x7f M \xa0\xa0y\x90V"
> +        "\xae^\\_\x16\xcc\x1a\x83,s\xd2nRH9b\xc8\x95q\xa6I\xba\xb7:(\xb8"
> +        "\xba\xc8\xa3\xa4\x80%\x0e\x94\x89N\xa4\xea\x17\xca\xb4\xc9\x8c"
> +        "\xaf\x19\x81\xb9%E\xe0\xdd\xc0i\x85\x82\xd7dV\xc0\x08\xb2T\x95"
> +        "\t\xde\x00\xb4\x00\x92-\xbc\xa1,z\xf2}\xbdx<\xa7\xa0\xadP\xdb"
> +        "k\xe5<8\xbe\x8aB\xdc\xdd3\x91\xcb\x8c\x07<\xfe\x9e\xa7\xd2X\x1aYO,"
> +        "\x92\x03\xbb\xc6\xeaONL\xae\xc2\x17\x92rG\xa3\x14\xb2\xb9O\x86N\xa3"
> +        "A\x9dk\xdb\xe2h\x88i\x0fT'\xec\x9a\xaa\x94\xa0\t\xa8$)\x94\xf2\xa9"
> +        "\xdaz\x81X\x07\xf0\x04f\x8e\xc1Z\xf3\xdf-S\xf9FY6+&\\{\x1b\xd8\x9a"
> +        "\xb5\xf1?\x95h?d\xe0\xff\x03\xa8M\xe0\xefJ\x95$\xbb\xab\xa6p\xc8"
> +        "\xe9T\xb9\xa2\xa0\xc3=V\xedV313\x13 \xe2\xbc\xd0V*g\xd8\x1b\xf5}S"
> +        "J\xd2*\xb7\x1a]!\x86y\xc7*o\xde\x83)\xd6C\xb4\x07\xfc]\xc9\x14\xe1"
> +        "BCF\x92=\xf0"
> +        )
> +

Otherwise it seems good.

John
=:->