split knit indexes by date ranges (history horizon)

Sat Jan 27 01:26:45 GMT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I want to share also my thoughts about history horizon feature
(tonight I'm in creative mode). I'm thinking about very long time
and here my thoughts. I try to think out the simple solution for
implementation, because in many cases simple solution works fast.

When I join to bzr I read this article of James Blackwell:
http://jblack.linuxguru.net/node/7

And the more I think about history horizon I every time
return to this article.

bzr has the unique feature: it can say about date of revision
loking only at revision-id. It's not strictly true, but despite
revisions imported from baz, bzr always try to generate new id
with date inside. And this is really good idea.

The weak point from network point of view with projects that
have big old history is our current format of knit indexes.
It try to hold all indexes in one file. As result we have about
800KB of inventory.kndx and similar size for revisions.kndx.
And I assume that in project with mandatory gpg revision signing
policy the signatures.kndx wil has similar size.

Each of those kndx is the map for corresponding knit file.
And this files is really big.

So my thoughts is to split knit files and their indexes based
on dates. E.g. depends on project activity and size those knits
could correspond to some range of dates. E.g. for bzr project itself
this value is near optimal when to split per quarter or per month.

To store list of parts bzr needs another meta-index to specify
range of dates and the name of corresponding knit/kndx, e.g.

inventory-0 2005/03/01 2005/09/01
inventory-1 2005/09/02 2005/12/31
...
inventory-n 2007/01/01

last entry probably should be open range, because the actual range
could be vary depends on knit file size.

Because for typical operation like push, pull, commit or merge
of 1 new revision you don't need complete history, but only
recent, bzr will need to read meta-index and one pair of kndx/knit.
Furthermore, to support history horizon it's enough to keep
in local repository only latest part of splitted parts
inventory.xxxx/revisions.xxxx etc.

At the start of this post I say about date inside revision id.
Because we split knits by dates then we have very simple
method to looking for some part of splitted repository.
Of course it will not works with revision ids that was not
created by bzr itself... But this probably exist some additional
ways to obtain fast lookup of revision date.

Keep in local repository only part of the whole repository
also means that another prieces will be copied to local repository
by demand.

2 questions that unclear for me:

1) the initial clone of branch to produce branch with limited history.
2) To recreate working tree bzr don't need entire history and entire
knit of each file. So files' knits should be also splitted in similar way
(by date ranges). But because files in working tree changed at random
dates for each file there is need to create personal meta-index.

This approach add one more file per each knit storage, but could help
to save room on local disk and bandwidth for operations with remote
servers.

- --
Alexander
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFuqpVzYr338mxwCURAiCnAJ9dNJGcvH+IjO9AoZ/glM9jYQV9oQCcCyJ8
KXDbYF6iUKaBfOHi3WHM2oQ=
=1oXQ
-----END PGP SIGNATURE-----