What to do with large repositories?
Abdulaziz Ghuloum
aghuloum at gmail.com
Fri May 11 09:13:29 BST 2007
Greetings,
First, I would like to thank the team for putting together such an
excellent revision control system. It has been a great development
tool. I have recommended it to all my friends and even required the
students to use it in class to submit their assignments (but that's
another story). I have used bzr to maintain my papers, dissertation
(in progress), and my projects. Let me say "thanks" again.
The main project that I have been working on for the last few years
is a native-code incremental compiler for Scheme. It is nearing
maturity and I am contemplating releasing it as a free software. I
was thinking of publishing the development branch (in launchpad
perhaps) for two reasons: it would eliminate the hassle of me having
to export specific releases manually (when I can just push it), and
makes it easy for interested parties to grab it and stay up to date
with the development head. The problem is: the repository is HUGE
(almost 300MBs now in 600 revisions).
The main contributor to the size of the repository is the executable
binary image that I have to maintain with every revision. The
executable is revisioned for a simple reason: without it, the source
code is useless. Every revision is self-hosting (the compiler can
compile the source, and is compiled by it) so you can go back in time
to any revision and get a working compiler (of that revision) and its
source code. Because the system is under heavy development (features
and capabilities change very quickly), the source code of one
revision may not be compilable using previous or later compilers!
The size of an executable is around 2MBs in the current system.
Judging from the size of the repository, it seems that each revision
adds about 500KB on average (a good hacking weekend inflates the
repository by ~20MBs). Most people are probably not interested in
the executables (though I would like to keep them as historical
records). They are probably more interested in the current revision
only and maybe some earlier source codes (not the executables).
Surely, someone had a similar situation before, and I have seen some
proposal on http://bazaar-vcs.org/HistoryHorizon that attempt to
address this very issue (is this still being considered?).
So, is there a workaround that allows me to publish the development
branch without asking people to download 300MBs of useless data? Or
should I just forget about it and only publish tgz snapshots? Any
advice would be greatly appreciated.
Also, have you considered using different "diff" tools for different
types of files? Executable binaries follow predictable patterns and
tools like bsdiff http://www.daemonology.net/bsdiff/ generate very
compact binary diff files (80KBs for my 2MB file). There is even a
python module for it at http://starship.python.net/crew/atuining/
cx_bsdiff/index.html .
Thank you very much.
Aziz,,,
More information about the bazaar
mailing list