What to do with large repositories?

Abdulaziz Ghuloum aghuloum at gmail.com
Fri May 11 09:13:29 BST 2007


Greetings,

First, I would like to thank the team for putting together such an  
excellent revision control system.  It has been a great development  
tool.  I have recommended it to all my friends and even required the  
students to use it in class to submit their assignments (but that's  
another story).  I have used bzr to maintain my papers, dissertation  
(in progress), and my projects.  Let me say "thanks" again.

The main project that I have been working on for the last few years  
is a native-code incremental compiler for Scheme.  It is nearing  
maturity and I am contemplating releasing it as a free software.  I  
was thinking of publishing the development branch (in launchpad  
perhaps) for two reasons: it would eliminate the hassle of me having  
to export specific releases manually (when I can just push it), and  
makes it easy for interested parties to grab it and stay up to date  
with the development head.  The problem is: the repository is HUGE  
(almost 300MBs now in 600 revisions).

The main contributor to the size of the repository is the executable  
binary image that I have to maintain with every revision.  The  
executable is revisioned for a simple reason: without it, the source  
code is useless.  Every revision is self-hosting (the compiler can  
compile the source, and is compiled by it) so you can go back in time  
to any revision and get a working compiler (of that revision) and its  
source code.  Because the system is under heavy development (features  
and capabilities change very quickly), the source code of one  
revision may not be compilable using previous or later compilers!

The size of an executable is around 2MBs in the current system.   
Judging from the size of the repository, it seems that each revision  
adds about 500KB on average (a good hacking weekend inflates the  
repository by ~20MBs).  Most people are probably not interested in  
the executables (though I would like to keep them as historical  
records).  They are probably more interested in the current revision  
only and maybe some earlier source codes (not the executables).   
Surely, someone had a similar situation before, and I have seen some  
proposal on http://bazaar-vcs.org/HistoryHorizon that attempt to  
address this very issue (is this still being considered?).

So, is there a workaround that allows me to publish the development  
branch without asking people to download 300MBs of useless data?  Or  
should I just forget about it and only publish tgz snapshots?  Any  
advice would be greatly appreciated.

Also, have you considered using different "diff" tools for different  
types of files?  Executable binaries follow predictable patterns and  
tools like bsdiff http://www.daemonology.net/bsdiff/ generate very  
compact binary diff files (80KBs for my 2MB file).  There is even a  
python module for it at http://starship.python.net/crew/atuining/ 
cx_bsdiff/index.html .

Thank you very much.

Aziz,,,



More information about the bazaar mailing list