Call for testing: cvs2bzr

Greg Ward greg at gerg.ca
Thu Aug 20 04:11:26 BST 2009


On Wed, Aug 19, 2009 at 8:08 PM, Ian
Clatworthy<ian.clatworthy at canonical.com> wrote:
> bzr fast-import will handle blobs being defined once and reused over and
> over again. The trouble is that it doesn't know which ones get reused
> unless it does two passes, so it acts conservatively and keeps all of
> them in memory. Fine for small imports but lousy for large ones. Reusing
> mark idrefs or using inline blobs solves the problem implicitly.

IMHO, you really need to be able to handle ~100k commits and ~1 GB of
source code to be taken seriously.  (And I only use those figures
since that happens to be the scale of the CVS conversion I'm working
on this summer.)  And I'm pretty sure keeping all blobs in memory will
not work with a conversion that size.

So you either need two passes, or my hack of writing each blob to a
separate file, or my hack of storing offsets into the blob file, or
something similar.

>> (My "clever" idea for handling blobs: keep a dict mapping blob mark to
>> file offset.  Then when we need a blob, seek to that offset and read
>> the required number of bytes.  Never got around to implementing this,
>> and I'm not sure if it would save much I/O.  Fewer writes I suppose.)
>
> stdin as the data stream might be a problem, though.

Two passes over the dump also breaks this.  Reading fastimport from
stdin sounds nice, but IMHO it's too damn hard to be worth the effort.
 git can presumably get away with it because fastimport was designed
to fit git's storage format.  bzr and hg do not have that luxury, so
our fastimporters have to do ... something else.  And storing all
blobs in memory is just not scalable.

Anyways... getting off topic for cvs2svn list.  Sorry.

Greg



More information about the bazaar mailing list