Container repository data ordering

Fri Jul 20 01:02:06 BST 2007

[On Thu, 2007-07-19 at 10:35 -0400, Aaron Bentley wrote:

> Robert Collins wrote:
> > On Wed, 2007-07-18 at 10:06 -0400, Aaron Bentley wrote:
> >> This sequence should work well for tree-wide operations like "merge",
> >> "revert", "status", "diff" and "checkout".  It won't provide as much
> >> locality-of-reference as the current format does for single-file
> >> commands like "cat" and "annotate".  I think that's inevitable.
> > 
> > Hmmm, I think it should have no worse locality of reference, and
> > possibly better than the current formats, once you consider disk block
> > allocation.
> 
> I don't think that makes sense.  Say you have a one-byte file, and 20
> DVD images in your current revision.  To cat the one-byte file, you'll
> have to seek or read past 10 DVD images on average.

Ok, lets consider two cases. Disk and HTTP.
For disk, in our current format, the dvd images may be either between or
not between the inventory and the one byte file, and may or may not be
in the same block group as the one byte file but cannot? (not entirely
sure here) be in the same block as the inventory. If the dvds are
between the inventory and the one byte file, we have to seek past them
(head seek, not file seek). If they are not between, we don't have to
wait as long for the head seek.
For disk in our new format, we will get the addresses to get data from
from the texts index, (which is hopefully small). Then its up to us to
seek past the dvd or not - because we choose the layout. A single readv
will do what we want.

For direct use over HTTP we clearly pay exactly the same as disk as a
minimum, plus whatever encoding and roundtrip times added by http. So -
current format we pay at a minimum, a roundtrip to get index data, then
a roundtrip to readv the inventory, and a roundtrip to get the index for
the one byte fileid and finally a roundtrip to get the text and delta
data.

In the new format before we start demand paging the index, it will be
something like:
a roundtrip to get the texts index.
a roundtrip to readv the inventory data needed
a roundtrip to readv the one byte file text & deltas.
-or- 
if we choose to consider the indices untrustworthy it will need to all
be streamed to us, and then pick it apart locally. (Which is perhaps the
scenario you are concerned about?)

> >> What are our requirements?
> >> 1. Building recent versions should be fast
> >> 2. We should not hold more than one fulltext in memory at a time
> > 
> > This is for tree-building?
> 
> No, this is to move in the direction of "stop requiring full memory
> copies of files" from your "planned-performance-changes" document.  If
> the text-extraction routines can't guarantee they're not holding more
> than one fulltext in memory, there's no way we can achieve that.
> 
> To actually achieve "stop requiring full memory copies", the fulltext
> would need to come after the deltas, whatever the build order, so that
> the deltas could be applied to the fulltext while streaming it in.

So for regular texts its a matter of 'shrug' to a large extent. For
exceptional cases perhaps we want a tape-sort approach on disk. (and
just kick that in based on text size - more than 1 MB of data, go to
disk)

> > For merge I presume we want to get 2-3
> > fulltexts out at once?
> 
> We may.  If we're serious about not holding fulltexts in memory, perhaps
> we would dump them to disk.  I don't know.  It seems like a tough goal
> to achieve.  And versioning .isos doesn't seem like it should be a big
> concern of ours.  I would sacrifice better .iso handling for better
> kernel-tree handling in a heartbeat.

I agree that its tough, and some areas we would have to pay a big
tradeoff to reduce memory usage now - I think performance should come
first; and iso's, as long as they work, and we *head in the direction*
of having bzr not thrash when checking them in or out; well then I'm
entirely happy.

> > I think requirement 2 above could be explained a bit more; it doesn't
> > seem obvious to me that that is a requirement that is a problem.
> 
> I'm just trying to comply with the roadmap.

Thank you. I guess I didn't tie the two together - the roadmap talks
about trying to avoid needing even a single full text, aiming for just
one is nearly as hard as not needed any, but doesn't solve the
'versioning stupid large files' problem.

But lets go with one, its a good figure to have in mind.

> >> However, the desire to avoid holding more than one fulltext in memory
> >> means that once we have read a fulltext, we cannot have another fulltext
> >> until all the relevant patches have been read.  So if the fulltext comes
> >> first, this suggests that the relevant patches must come immediately
> >> afterward.
> > 
> > Agreed.
> 
> I think there's an error in my logic, though.  Even if the we
> backwards-patch, we are not obligated to put the fulltext before the
> delta chain.  If we put it after the delta chain, then we can read all
> relevant deltas into memory before we hit the fulltext.
> 
> If we have the relevant deltas in memory before we encounter the
> fulltext, we can apply the delta to the fulltext in a streaming fashion.
>  So this layer at least could meet the goal of "stop requiring full
> memory copies".

Cool.

> >> This means that while reading recent revisions will have low CPU
> >> requirements, it will have higher IO requirements, because it can only
> >> stop reading the container once it has read (or seeked past) all the
> >> fulltexts it needs, and (almost) all the deltas which are based on them.
> > 
> > I don't see how this increased IO load follows. Perhaps an example will
> > help?
> > given revisions A:[], B:[A], C:[B], D:[A], E:[D,C]
> > In my head we have something like:
> > record | Fileid | revision | compression-basis-list | DATA |
> > 0      |   X    |   E      | []                     | ...  |
> > 1      |   X    |   D      | [E]                    | ...  |
> > 2      |   X    |   C      | [E]                    | ...  |
> > 3      |   X    |   B      | [C]                    | ...  |
> > 4      |   X    |   A      | [D]                    | ...  |
> > 5      |   Y    |   E      | []                     | ...  |
> > 6      |   Y    |   D      | [E]                    | ...  |
> > ...
> > 
> > That is, to build the tree E, you read record 0 and record 5 to get the
> > texts of file X and Y.
> > 
> > In terms of disk cache and head activity this is paging in 2 4K pages
> > (or maybe only one 4K page).
> 
> I don't understand how you're getting any information about size here.
> How do you know that record 3 is not 300GB long?

I was assuming that we had an index - which is what I've been working
towards, and am nearly there.

> > To build the tree D, you read records 0 and 1, then 5 and 6, and so on.
> > 
> > On average, to access an arbitrary text in a file, whether you go
> > new-old or old-new you page in 1 fulltext and 1/2 the delta chain length
> > worth of patches.
> 
> Yes, but in your example, you have to read or seek past record 4 in
> order to start reading records for Y.  If you're not interested in X,
> then this is wasted effort.

Seeking within a file for 300GB is roughly equivalent to opening the
next file on disk, if the current file is 300GB long. So its not wasted
anymore than closing one file and opening the other.

> There are many cases where we are interested in only a subset of files.
>  But we still have to seek to get at the data we want, or for http, we
> may simply have to read until we get it.
> 
> The issue is that http doesn't let us seek in mid-stream.  So once we've
> started reading the container, we can't seek.  But we don't know what
> records we want until we've processed inventory.  So we don't know what
> records we want to skip until we've already started reading the container.

Right. I think we will end up working in readv's of the container. One
readv for the inventory data, then another readv for all the deltas and
fulltexts we want.

> >>  Reducing the size of the delta chain will reduce the amount of data
> >> that needs to be read for recent revisions, at the cost of increasing
> >> the amount of data that needs to be read for older revisions.
> > 
> > Reducing the delta chain will *increase* the amount of data to be read
> > to reconstruct recent revisions, AIUI.
> 
> It will be reduced for those revisions contained in the reduced delta
> chain, because we'll reach the end of the delta chain sooner and can
> start more productively reading the next set of file data.

In the case that we don't have a seek() method this is too. When we seek
the data never gets pulled off disk (for local), and we should be able
to readv remotely.

> >> By contrast, a forwards-patching format would allow deltas to be written
> >> for new versions, minimizing the number of fulltexts written.
> > 
> > My proposed format above allows both forward and backwards patching.
> 
> Supporting only a single direction is more simpler and saves space.  Why
> do you want this?

To give us the opportunity to optimise repositories statically. Right
now we have 'compression parent' and 'left most ancestor' conflated by
using a flag that says 'full-text' or 'line-delta'.

If we decouple these, and say:
history parents     (revision id list)
compression parents (revision id list)
compressor type     (0 = knit line delta, 1 = not yet defined)

then we can:
 - compress forwards or backwards as makes the most sense at the time.
   during commit we probably want to make a forward delta or no delta.
   during pack we might want to make a backward delta to get the best
   space saving.
 - use mpdiffs (define compressor type 1)
 - have someone experiment with other compressors while other folk work
at different parts of the format 

> >>   Some
> >> containers could contain only deltas, therefore reducing the need for
> >> repository compaction.
> > 
> > Containers with only deltas drive repository compaction because they
> > drive disk seeks, which are expensive.
> 
> If delta chains are kept together with the fulltexts, then the
> compaction you describe will increase the distance between fulltexts of
> different files, making overall access speed slower.

Slower than what - whats the benchmark you are measuring against here?

-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070720/e861400a/attachment.pgp