[merge] Repository.tarball hpss command

Mon Apr 23 07:39:05 BST 2007

On 4/18/07, John Arbash Meinel <john at arbash-meinel.com> wrote:
> > This diff is against the hpss branch.
> >
> > On both the sender and the receiver this is not taken from or written
> > directly into the real repository but rather a temporary directory is
> > used.
>
> This probably has a net loss for local branching. Since you have "copy
> to temp, create tarball, bzip tarball, copy over the wire, unpack into
> temp, copy from temp to target".

(Sadly I somehow lost my previously drafted reply...)

For strictly local operations of course this won't be active.  If you
are using bzr:// between two machines on the same lan then there may
be a slowdown.  I will measure it.

> My understanding of 'fetch' versus 'copy_content_into' is that
> 'copy_content_into' was meant to make the target identical to this (sort
> of like clone) including overwriting whatever was already there. (So it
> sets things like 'no-working-trees', 'shared-storage', etc).
>
> Versus "fetch()" which is just meant to fetch revision information and
> store it. It also has the directions turned around
> "source.copy_content_into(target)" versus "target.fetch(source)".
>
> I think that means that 'fetch()' thinks of 'target' as local and source
> as remote, I'm not sure if 'copy_content_into' thinks of source as local
> or not.

There is some historical cruft here which we could clean out, maybe in
the next release.  I think as an api we basically just want "fetch
revisions" and "copy configuration" (like shared-storage, etc.)  If
the caller happens to know that the destination is empty then maybe we
can have an option to give that hint.  I'll look at that for the next
release.

> v- I'm pretty sure you want "%(error_code)s" not error_cde
>    This is one of the reasons you should have an entry in
> "test_errors.py" so that you are sure the formatting is correct.
>
> > +    _fmt = "Smart Server encountered error %(error_cde)s: %(error_msg)s."
> > +
> > +    def __init__(self, error_code, error_msg=''):
> > +        self.error_code = error_code
> > +        self.error_msg = error_msg
> > +
> >

Good point.  I changed this to a different error
UnexpectedSmartServerResponse which does have such a test.

> >      def copy_content_into(self, destination, revision_id=None):
> > -        self._ensure_real()
> > -        return self._real_repository.copy_content_into(
> > -            destination, revision_id=revision_id)
> > +        # get a tarball of the remote repository, and copy from that into the
> > +        # destination
> > +        from bzrlib import osutils
> > +        import tarfile
> > +        import tempfile
>
> v- You really should be using a cStringIO here. But even better would be
> having the ability to write it to a temporary file rather than storing
> the whole thing in memory.
>
> I think you are just running into the problem because Remote* all deals
> in byte strings, so you don't have the option.
>
> But imaging branching a Moz repository, and you get a 3GB tarball in memory.

I will keep it in a temporary file, and change the Remote layer to
allow taking and receiving message bodies to and from them.

> ^- Also, if we are just looking for speed, "tar xjf -" is a lot faster
> than "tar.extractall". At least from my testing, tarfile is pretty slow.
> (I was looking at using compressed .tar.bz2 files for creating a
> memoized test object, but it was faster to do Branch.initialize() and a
> commit, rather than extracting from a tarball).

I'll just add a note about that for now.

> v- We would save quite a bit of time if we just wrote a function to take
> the existing repository and copy data into it. It also would give us an
> opportunity to strip out unreferenced data (which is what we do now when
> requesting a specific revision).

Copying it to a temporary directory gives us the chance to strip
unneeded data, but it's not very useful as we don't pass the desired
revision id at the moment.  It seems like maybe we should just pack
the repository directly from its current location if no revision id is
specified, or if one is given then extract to a temporary directory
and go from there.

This is rather unsatisfactory to me compared to having a format that
can slide more easily in and out of the repository, but it does seem
like the practical thing for now.
> > +    def _tarball_of_dir(self, dirname, compression, tarfile_name):
> > +        tarball = tarfile.open(tarfile_name, mode='w:' + compression)
> > +        try:
> > +            # The tarball module only accepts ascii names, and (i guess)
> > +            # packs them with their 8bit names.  We know all the files
> > +            # within the repository have ASCII names so the should be safe
> > +            # to pack in.
> > +            dirname = dirname.encode(sys.getfilesystemencoding())
> > +            # python's tarball module includes the whole path by default so
> > +            # override it
> > +            assert dirname.endswith('.bzr')
> > +            tarball.add(dirname, '.bzr') # recursive by default
> > +        finally:
> > +            tarball.close()
>
> In the 'export' code, I use 'utf8' encoding. I think that is much more
> reasonable. Than using "mbcs" encoding on windows.
>
> I think we can just define that all tarballs have filenames encoded in utf8.

Because it's just a tarball of the repository contents itself, we know
it will always have ascii names.  (At least all our current formats do
and I think we'd be crazy to ever do otherwise...)

We do just need the fsencoding to look up whereever tempfile decided
to put the tarball...

> v- This is probably easier to understand if it was in BASE64, but not a
> big deal. You could even do it as:

OK

I was aiming to have this in 0.16, and it may still squeak in if
someone helps me with another review today.  It seems like what is
needed is

 - make this be active when branching, without breaking the BzrDir.sprout tests
 - change the
 - avoid copying to a temporary directory on the server when

-- 
Martin