Container repository data ordering

Wed Jul 18 15:06:40 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

The new container-based repositories will group data according to sets
of revisions.  Each container will provide the data for a set.

Since repositories are read more often than they are written, it makes
sense to optimize for read.  Ideally, we would be able to extract all
the data we need in a single pass with no seeking.

There are four main datatypes in a repository:
- - file versions
- - inventory data
- - revision data
- - signatures

For reconstruction, inventory data must come before file texts, because
we cannot know which file versions we need until we've examined the
inventories.

Some commands such as "log" and "missing" can use revision data without
reference to any other data.  By contrast, almost every use case for
inventory data includes using file versions.

Signatures can only be verified once the inventory and revision have
been read, so they shouldn't come before the inventory or revision data.

Therefore, I propose this ordering:
1. revision data (to be nice to log)
2. inventory data
3. signatures
4. file versions

This sequence should work well for tree-wide operations like "merge",
"revert", "status", "diff" and "checkout".  It won't provide as much
locality-of-reference as the current format does for single-file
commands like "cat" and "annotate".  I think that's inevitable.

File versions have an additional ordering problem:
What order should building happen in?

The current knit format orders data by file, and orders versions from
oldest to newest.  Deltas are intended to be applied in forward order.

What are our requirements?
1. Building recent versions should be fast
2. We should not hold more than one fulltext in memory at a time

The desire for fast recent versions motivates the suggestion that
patching should be done backwards: to generate an old version, you get
the fulltext of a new version and apply patches to it to turn it into
the old one.

However, the desire to avoid holding more than one fulltext in memory
means that once we have read a fulltext, we cannot have another fulltext
until all the relevant patches have been read.  So if the fulltext comes
first, this suggests that the relevant patches must come immediately
afterward.

This means that while reading recent revisions will have low CPU
requirements, it will have higher IO requirements, because it can only
stop reading the container once it has read (or seeked past) all the
fulltexts it needs, and (almost) all the deltas which are based on them.
 Reducing the size of the delta chain will reduce the amount of data
that needs to be read for recent revisions, at the cost of increasing
the amount of data that needs to be read for older revisions.

By contrast, a forwards-patching format would allow deltas to be written
for new versions, minimizing the number of fulltexts written.  Some
containers could contain only deltas, therefore reducing the need for
repository compaction.  Therefore, forwards-patching reduces storage
space and IO at the cost of increased CPU.  It also means that that
building recent revisions will be slower than building snapshotted
revisions, but it's not clear how much slower it would be.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGnh5w0F+nu1YWqI0RAhseAJ0eZ0zmXPjOVLjuApyS4aJpiuqz+QCfa8Xv
z5YRy9scOFThIWdCAwhwcFE=
=T48v
-----END PGP SIGNATURE-----