Introduction to history deltas
Martin Pool
mbp at sourcefrog.net
Wed Dec 7 06:14:48 GMT 2005
On 6 Dec 2005, Aaron Bentley <aaron.bentley at utoronto.ca> wrote:
> Robert Collins wrote:
> | On Tue, 2005-12-06 at 08:58 -0500, Aaron Bentley wrote:
> |>With bzr.dev, the crossover point is 196 revisions. Downloading 197
> |>revisions with history deltas would be 197 downloads, but with per-file
> |>knits, it would be 196. (That is, if you have revno 1276, and you're
> |>updating to the current 1473.)
However, fetching from the knit files requires you to work out which
sections of the knits you want to fetch. In the previous design for
them, that meant reading the entire index for the changed knit files,
which can get too large.
> | And for remote access - such as
> | determining the history of a single file - wheeeoo. That really bites
> | for arch as every revision has to be downloaded.
>
> (unless you trust log headers, which of course you can't)
Perhaps you can't trust them for Arch, but there's no reason why this
format cannot have reliable descriptions of which files were modified in
which revision, and what they were modified from.
> | If our data is grouped
> | by revision we have to download every touched revision. if our data is
> | grouped by fileid we can pull down just one file.
On the other hand, if it is grouped by revision we have a very
reasonable way to cache the revisions in the form that they come over
the network, and in a way that can be shared between remote branches.
It will likely also work with http caching.
> I think this depends on how data is indexed. You could certainly have a
> file index that enabled you to download only the relevant sections of
> the relevant revisions. But you must be using a latency-beating
> strategy for this to work properly. The converse applies, too-- with a
> latency-beating strategy, you can have per-revision indices that enable
> you do only download the relevant sections of the relevant files.
>
> In a sense, filesystems are one-dimensional: their only axis is file
> path. Revision control systems are two-dimensional-- they have
> "revision" and "pathname" axes, and some operations traverse the
> filename axis (e.g. export), while other operations traverse the
> revision axis (e.g. annotate).
There are four main possibilities: group all changes in a revision,
group all changes to a file, a change file per file per revision, or one
big file with all of the history. By indexing the files you can get
similar random-access behaviour within a file to that of opening single
files. However, there are some differences in performance:
Operating on many small files is slower than getting all the data you
want from one large files[*].
Remote operations need to be expressed in terms of physical files, so
there is some latency for each one, and indexes may not help very much.
This may indicate that the indexes should be used for the operations
that are expected to be common on local trees, and the grouping into
files should be biased towards remote operations.
[*] I did a mini benchmark of writing one line to N files vs writing N
lines to one file; for small N the speed is comparable but for large N
it can be a hundred times slower. This may be because opening each file
uses up a fixed amount of memory in both the kernel and in Python -- for
example we are using 4kB of disk cache for each line.
--
Martin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051207/840187da/attachment.pgp
More information about the bazaar
mailing list