Container repository data ordering

Aaron Bentley aaron.bentley at utoronto.ca
Fri Jul 20 23:23:27 BST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> [On Thu, 2007-07-19 at 10:35 -0400, Aaron Bentley wrote:
> 
> 
> 
>> Robert Collins wrote:
> Ok, lets consider two cases. Disk and HTTP.

> For disk in our new format, we will get the addresses to get data from
> from the texts index, (which is hopefully small). Then its up to us to
> seek past the dvd or not - because we choose the layout. A single readv
> will do what we want.

readv is implemented via seek, right?  This agrees with "you'll have to
seek or read past 10 DVD images on average".

As for whether we pick the order, I suspect we'll want a deterministic
order, so that when we read a stacked repository, the files are in the
same order in each repository.

> if we choose to consider the indices untrustworthy it will need to all
> be streamed to us, and then pick it apart locally. (Which is perhaps the
> scenario you are concerned about?)

Yes.  In my initial email, I said "Ideally, we would be able to extract
all the data we need in a single pass with no seeking"

It is not really lack of faith in indices-- it's desire to avoid
roundtrips.  Obviously there are cases where doing another roundtrip is
or isn't useful.  If the 1-byte file happened to be before any DVDs, it
might be fastest to do it in a single pass.

> I was assuming that we had an index - which is what I've been working
> towards, and am nearly there.

You were assuming seeking, but seeking sucks.  Latency, etc.  I was
looking at the design with "single pass, no seeking" in mind.

>>> My proposed format above allows both forward and backwards patching.
>> Supporting only a single direction is more simpler and saves space.  Why
>> do you want this?
> 
> Right
> now we have 'compression parent' and 'left most ancestor' conflated by
> using a flag that says 'full-text' or 'line-delta'.

Okay, I thought you were proposing a bidirectional delta format.  I have
no problem with making build-dependencies independent of ancestry.

>>> Containers with only deltas drive repository compaction because they
>>> drive disk seeks, which are expensive.
>> If delta chains are kept together with the fulltexts, then the
>> compaction you describe will increase the distance between fulltexts of
>> different files, making overall access speed slower.
> 
> Slower than what - whats the benchmark you are measuring against here?

For the purpose of checkout, slower than they would otherwise be.
Consider two containers:

Container 1
|A1|A2|A3|A4|B1|B2|B3|B4

Container 2
|A5|A6|A7|A8|B5|B6|B7|B8


Say we want to check out A1 and B1:  We have to read A1, seek past 3
records, and read B1.  That's pretty good locality of reference,
especially since A2-A4 are probably a delta chain.

Now say we combine containers 1 and 2.  Since container 2 is
deltas-only, its records immediately follow the related records in the
old container

New Container:
|A1|A2|A3|A4|A5|A6|A7|A8|B1|B2|B3|B4|B5|B6|B7|B8

Now, in order to check out A1 and B1, we read A1, seek past 7 records,
and read B1.

I think that for reasons like this, it makes sense to keep only records
which we'll want access to frequently in the "first" container, however
that's determined.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGoTXf0F+nu1YWqI0RAuf2AJwOGUIV4tZzZbstYXH3EMnthy1X9gCfUOHB
+5lBOQaVLtDte8Rps5jtkhU=
=ZZsB
-----END PGP SIGNATURE-----



More information about the bazaar mailing list