[RFC] Bundles as repositories

Tue Jun 19 03:52:40 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Fri, 2007-06-15 at 03:20 -0400, Aaron Bentley wrote:

>> This is a good description of what I'm working on now, but I think that
>> checking preludes should always be done.
> 
> I think that checking depends on the context.
> 
> Places to check the prelude are when we can reasonably expect that its
> being used as 'previewed data':
> 'bzr patch BUNDLE' or 'cat BUNDLE | bzr patch'
> 
> When its being used as any foreign branch is used then we should not
> need to check the prelude:
> 'bzr pull BUNDLE'
> 'bzr missing BUNDLE'
> 'bzr merge BUNDLE'
> 'bzr diff BUNDLE'

I think that anything that has a prelude should have its prelude
checked.  If we don't want to do that, we shouldn't include preludes at all.

> Checking of preludes is conceptually hard simply because of line
> endings: email transmission will munge line endings and thus prelude
> checking cannot be binary based; it has to do whitespace tolerant diffs
> and other such complications.

I think it's conceptually quite simple.  It's like doing
case-insensitive comparisons by turning both inputs into lowercase.

Do the absolute maximum damage that can be done to it via whitespace
munging, then record the sha1 sum of the munged prelude.

Apply the same damage on the other side, and compare the resulting sha1.

> If the patch content is going to be shown
> in another fashion (for instance pull can generate a patch as it goes to
> show - I think we've had this requested as a feature) then checking the
> prelude is duplicate effort.

No it's not, because even if people have that feature turned on, they
may not pay close attention to the diff that pull produces.

> The first thought that comes to mind is that the data section of the
> bundle should always be binary only; that is the data shouldn't change
> if you have or dont have a prelude (this is why I've been calling it a
> prelude - it comes before :)). This would make checking preludes
> something that cannot be disabled by toggling a flag in the content - it
> will always happen according to whatever policy we have agreed on/the
> user has set.

I propose we have two variants of the format.

One variant is for human consumption, has a prelude and a base64 wrapper
on the data, and has its prelude checked by default.

The other is not for human consumption, has no prelude, and is not
base-64 wrapped.  That way, it's hard to mistake one format for the other.

The data, whether base-64 wrapped or not, does not change.

>>> AIUI we want bundles to have the following properties:
>>>  - compact representation
>>>  - able to be used without their contained data being added to
>>> repositories
>> ^^^ This was not one of my goals.
> 
> Do you object to it being a goal?

If we're talking about the 1.0alpha format I've been working on, yes.
Doing that before merging it would probably mean missing the release
window for 0.18.

As another format, or future work, that would be fine.  However, there
is tension between the desire for compactness and the desire to use them
in-place, because extraction speed with no snapshots would be glacial.

>>>  - fast to create
>>>  - fast to extract data from
>> I'm trying to accomplish fast installation, (e.g. of knit records), not
>> fast extraction of fulltexts.  And I'm specifically choosing size over
>> speed, because of how bundles are usually used.
> 
> I think these goals are aligned; fast installation if you do not ship
> ready-to-use repository data (e.g. knit gz hunks) implies creating a
> fulltext and doing a regular knit insert as quickly as possible. Unless
> I've missed something:).

Single-parent MPDiffs ought to be easy to convert into knit deltas
without extracting any fulltexts.  You'll pay the cost of gzipping, but
not of file comparison.

And heck, I haven't ruled out bundling knit hunks either.

> I'm also thinking that it would be nice if everything for a merge
> directive could live in-branch ready to be used.

I'm confused.  As opposed to living in the repository?

> I think we can do that based on the percentage of the file we've read
> both more accurately than a number of items count; and without forcing a
> pre-calculation step to bundle creation.

I don't know that it would be more accurate.  It's not uncommon for a
bundle prelude to comprise 75% or more of the bundle file.  The actual
data will be more expensive to read, I assume.

> Generally speaking I think we
> want to move away from percentage indicators and towards 'amount of work
> done' reporting; where we happen to have more information lets display
> it, but lets not do excess work [unless its key to that part of the UI].

It's not important now, but I'd like to get a better idea what you mean
later.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGd0T40F+nu1YWqI0RAjQGAJ0bk45mnBIPHS+69i0jfEUMj4oJuwCfcUyT
Ht83ceKXMFfIyNguO4yohrw=
=uLMd
-----END PGP SIGNATURE-----