bzr and biiig repo

Tue Dec 18 18:43:54 GMT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Alexander Belchenko wrote:
> This is continue of recent story of Vlad Adamenko. More details.
> 
> His current project is about 40GB in about of 80K files. svn history
> more than 1K.
> 
> AFAIK bzr currently unable to handle such big repositories?
> 
> But strange thing about problem with checkout on Windows machine and
> MemoryError. If he does full checkout or branching from the Linux server
> over sftp -- operation fails with MemoryError, and bzr process eats more
> than 750MB RAM. But when he doing lightweight checkout -- process
> successfully finished and bzr eats about 50MB RAM. Why?

Did you see my patch for sftp? He could try that. It seems our current sftp
implementation keeps a significant amount of data before it flushes it up to
the next layer.

In my simple testing I was able to drop RAM consumption from 390MB => 40MB.

> 
> When I recommend to Vlad to try bzr+ssh -- the same MemoryError, but in
> this case on server side. He said error message told about incorrect
> response from server:
> 
> vlad at VFX:$ bzr up
> bzr at server's password:
> bzr: ERROR: Could not understand response from smart server: ('error',
> 'out of memory')
> 

Well, at the moment we still don't have streaming hpss fetching. So it creates
a large "pack" in memory and sends it across. Andrew has work that will at
least update the client to allow the server to stream data to it (at the moment
the API requires that we know how many total bytes are going to be sent). And
then we can evaluate more how to get the server to stream the data, rather than
having it cache everything in ram before it is sent.

> 
> I suggest to file a bug report, but in the case with lightweight
> checkout is not clear what exactly should be in the bug report.

With a lightweight checkout, it doesn't have to stream the history, and is
probably correctly set up to just read one file at a time. (Note we are looking
into switching to iter_files_bytes which is a bit more likely to try to cache
too much, but we'll get there).

In general, there is a tension between making things faster by buffering, and
not consuming all memory.

> 
> Vlad willing to help in solving MemoryError problem, but he is not
> familiar with bzrlib codebase, and I cant help here much, because for me
> some part of bzrlib is hard to understand (or maybe I dont have enough
> time to dig and dig deeper and deeper).
> 
> Vlad also asks about nested tree support.

AFAIK, there are a couple patches from Andrew and one from me to buffer more
logically, rather than completely buffering everything.

Having someone around saying "hey this breaks because of too much ram" is a
decent way of poking holes in bzr. It is unfortunate that Python doesn't itself
have a way to track memory consumption. So it is hard to write a test which
says: "use no more than XX memory to do YY".

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHaBTqJdeBCYSNAAMRAkowAKDEjNXB0FJ8jiMsqybMfRrHUj1bjACg0jHu
jHvrOFaG7HuxSH1xMQdyvZE=
=SWPR
-----END PGP SIGNATURE-----