[RFC][2.0] large file support design

Fri Jun 5 05:34:54 BST 2009

This is a proposed change for 2.0.

I think I have enough angles covered that this can be done quite
quickly.

This design aims to meet the following criteria:
- improve our behaviour without locking us into long term support for
the particular mechanism this approach uses
- incremental deployment. We can start with just the very core using it
and extend awareness deeper through the code base later.
 + This allows us to get the core in and polish in point releases.
- policy decoupled from mechanism - its optional to store a text in this
manner
 + this means we can tune the rules for when to store an object 
- does not alter inventory contents
 + this means that if this design is wrong we won't invalidate nested
sha1 sums and the like.
- intended for gc formats and up only.
- robust against existing content that may look like metadata

Overview
--------

Allow any stored bytestring in a VersionedFiles object to be fragmented
in the CHK VF. This requires adjusting some API's and storing extra
metadata in every record in our VF stores. By fragmenting in the lowest
level store we can get some early wins and we don't need a long test
cycle to be sure its a good answer, as we would need with a higher level
semantic approach.

Fragment page
-------------

This is a serialised list of fragments, in CHK key format, that should
be combined to make up a larger document. 
PAGE ::= SIG FRAGMENT*
FRAGMENT ::= START LENGTH CHKREF CR
SIG ::= "chkfragment:" CR
START ::= uint64_t
LENGTH ::= uint32_t
CHKREF ::= sha1:hexdigest

Metadata
--------

Currently some objects - those in the chk VF - have a per-object header
with the following characteristics:
 - its part of the string stored and hashed by the VF
 - the VF is ignorant of the header
 - repositories know about the header as part of the inventory storage.

To allow substituting large texts with many smaller ones
semi-transparently we need the VF layer to be able to do the
substitution. Accordingly we will add:
 - a one byte 'kind' to be stored with every text. This could in
principle be considered part of the bytes, but that would alter the sha1
and make validation harder, as well as being a layering issue with the
goal of having it be transparent.
 - we define two initial kinds:
   0x00: bytes
   0x01: Fragment page
 - the sha1 of a text needs two variants: logical and raw.
   - for kind 0x00 it is the sha1 of the bytes following the kind (so
     that current inventories are not destablised)
   - for all other kinds it is the kind and the following bytes (
     so different sorts of lists with the same formatting are given
     different sha1s)
   - the logical sha1 of a fragment page is the sha1 obtained by 
     generating a sha1 over the bytes gotten by expanding the fragments
     pointed at by the fragment page. (*NB: fragments can point at more
     fragments to form a tree).
 - content collisions (storing a bytesequence identical to a fragment
   page with 0x01 prepended) are possible. If a user creates a file
   on disk with this content it will either get replaced with an 
   existing fragment page, or will prevent such a fragment page being
   stored. We can, if we care to, catch such texts when they are being
   introduced, by checking if the text starts with 0x01[known  
   signatures].

APIs
----

Current VF apis are focused on pure bytes-in bytes-out storage. This is
great for layering and allowing optimisation. The existing APIs will
transparently expand content that has been fragmented.

insert_record_stream could choose to fragment every (say) 2MB
get_record_stream will combine fragments into one larger chunked text.

A small api change - to add a 'expand_fragments=False' parameter to
get_record_stream would allow parts of bzrlib that are aware of
fragmentation to avoid gathering all the fragments up front. for
instance, diff could diff fragment at a time.

compression
-----------

The gc_optimal sort order for compression should take fragmentation into
consideration. Something like putting all the first fragments, then all
the second fragments etc, for a given file id together.

Alternatives and questions
--------------------------

- should fragment pages cache the sha1 of the bytes that they eventually
resolve to?
- If we drop the requirement not to destablise existing inventories, we
could avoid the ability for user files to ever collide with fragment
pages or other such kinds by always including the kind in the sha1.
However, that would mean that the sha1 of combined fragments would need
special handling (include the first kind, but not the latter ones). I
think the current compromise is reasonable.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20090605/1e6abfa4/attachment.pgp