Rev 2508: (robertc) Create the top level changes-list from the London sprint for reference. in http://people.ubuntu.com/~robertc/baz2.0/integration

Wed Jun 6 02:07:25 BST 2007

At http://people.ubuntu.com/~robertc/baz2.0/integration

------------------------------------------------------------
revno: 2508
revision-id: robertc at robertcollins.net-20070606010722-21uhn0868sm5m2he
parent: pqm at pqm.ubuntu.com-20070605164810-ay1hxyvqofffy0me
parent: robertc at robertcollins.net-20070606010158-aqtdldnzo5bj5z74
committer: Robert Collins <robertc at robertcollins.net>
branch nick: integration
timestamp: Wed 2007-06-06 11:07:22 +1000
message:
  (robertc) Create the top level changes-list from the London sprint for reference.
added:
  doc/developers/planned-performance-changes.txt plannedperformancech-20070604053752-bnjdhako613xfufb-1
modified:
  doc/developers/performance-roadmap.txt performanceroadmap.t-20070507174912-mwv3xv517cs4sisd-2
    ------------------------------------------------------------
    revno: 2485.4.10
    revision-id: robertc at robertcollins.net-20070606010158-aqtdldnzo5bj5z74
    parent: robertc at robertcollins.net-20070604053802-cx9toxyasgfip83n
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: roadmap
    timestamp: Wed 2007-06-06 11:01:58 +1000
    message:
      Review feedback.
    ------------------------------------------------------------
    revno: 2485.4.9
    revision-id: robertc at robertcollins.net-20070604053802-cx9toxyasgfip83n
    parent: robertc at robertcollins.net-20070604040720-c5ti0k49w0ye8zcl
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: roadmap
    timestamp: Mon 2007-06-04 15:38:02 +1000
    message:
      Create the top level changes-list from the London sprint for reference.
=== added file 'doc/developers/planned-performance-changes.txt'

--- a/doc/developers/planned-performance-changes.txt	1970-01-01 00:00:00 +0000
+++ b/doc/developers/planned-performance-changes.txt	2007-06-06 01:01:58 +0000
@@ -0,0 +1,174 @@
+Planned changes to the bzr core
+-------------------------------
+
+Delivering the best possible performance requires changing the bzr core design
+from that present in 0.16. Some of these changes are incremental and can be
+done with no impact on disk format. Many of them however do require changes to
+the disk format, and these can be broken into two sets of changes, those which
+are sufficiently close to the model bzr uses today to interoperate with the
+0.16 disk formats, and those that are not able to interoperate with the 0.16
+disk formats - specifically some planned changes may result in data which
+cannot be exported to bzr 0.16's disk formats and then imported back to the new
+format without losing critical information. If/when this takes place it will be
+essentially a migration for users to switch from their bzr 0.16 repository to a
+bzr that supports them. We plan to batch all such changes into one large
+'experimental' repository format, which will be complete stable and usable
+before we migrate it to become a supported format. Getting new versions of bzr
+in widespread use at that time will be very important, otherwise the user base
+may be split in two - users that have upgraded and users that have not.
+
+The following changes are grouped according to their compatability impact:
+library only, disk format but interoperable, disk format interoperability
+unknown, and disk format, not interoperable.
+
+Library changes
+===============
+
+These changes will change bzrlib's API but will not affect the disk format and
+thus do not pose a significant migration issue.
+
+ * For our 20 core use cases, we plan to add targeted API's to bzrlib that are
+   repository-representation agnostic. These will instead reflect the shape of
+   data access most optimal for that case.
+
+ * Deprecate 'versioned files' as a library concept. Instead of asking for
+   information about a file-over-time as a special case, we will move to an API
+   that assumes less coupling between the historical information and the
+   ability to obtain texts/deltas etc. Specifically, we need to remove all
+   API's that act in terms of on disk representation except those within a
+   given repository implementation.
+
+ * Create a validator for revisions that is more amenable to use by other parts
+   of the code base than just the gpg signing facility. This can be done today
+   without changing disk, possibly with a performance hit until the disk
+   formats match the validatory logic. It will be hard to tell if we have the
+   right routine for that until all the disk changes are complete, so while
+   this is a library only change, its likely one that will be delayed to near
+   the end of the process.
+
+ * Add an explicit API for managing cached annotations. While annotations are
+   considered a cache this is not exposed in such a way that cache operations
+   like 'drop the cache' can be performed. On current disk formats the cache is
+   mandatory, but an API to manage would allow refreshing of the cache (e.g.
+   after ghosts are filled in in baz conversions).
+
+ * Use the _iter_changes API to perform merges. This is a small change that may
+   remove the need to use inventories in merge, making a dramatic difference to
+   merge performance once the tree shape comparison optimisations are
+   implemented.
+
+ * Create a network-efficient revision graph API. This is the logic at the
+   start of push and pull operations, which currently scales O(graph size).
+   Fixing the scaling can be done, but there are tradeoffs to latency and
+   performance to consider, making it a little tricky to get right.
+
+ * Working tree disk operation ordering. We plan to change the order in which
+   some operations are done (specifically TreeTransform ones) to improve
+   performance. There is already a 66% performance boost in that area going
+   through review.
+
+ * Stop requiring full memory copies of files. Currently bzr requires that it
+   can hold 3 copies of any file its versioning in memory. Solving this is
+   tricky, particularly without performance regressions on small files, but
+   without solving it versioning of .iso and other large objects will continue
+   to be extremely painful.
+
+ * Add an API for per-file graph access that alllows incremental access and is
+   suitable for on-demand generation if desired.
+
+ * Repository stacking API. Allowing multiple databases to be stacked to give a
+   single 'repository' will allow implementation of some long desired features
+   like history horizons, and bundle usage where the bundle is not added to the
+   local repository just to examine its contents.
+
+ * Revision data manipulation API. We need a single streaming API for adding
+   data to or getting it from a repository. This will need to allow hints such
+   as 'optimise for size', or 'optimise for fast-addition' to meet the various
+   users planned, but it is a core part of the library today, and its not
+   sufficiently clean to let us simplify/remove a lot of related code today.
+
+Interoperable disk changes
+==========================
+
+ * New container format to allow single-file description of multiple named
+   objects. This will provide the basis for transmission of revisions over the
+   network, the new bundle format, and possibly a new repository format as
+   well.
+
+ * Separate the annotation cache from the storage of actual file texts and make
+   the annotation style, and when to do it, configurable. This will reduce data
+   sent over the wire when repositories have had 'needs-annotations' turned
+   off, which very large trees may choose to do - generating just-in-time
+   annotations may be desirable for those trees (even when performing
+   annotation based merges).
+
+ * Repository disk operation ordering. The order that tasks access data within
+   the repository and the layout of the data should be harmonised. This will
+   require disk format changes but does not inherently alter the model, so its
+   straight forward to export from a repository that has been optimised in this
+   way to a 0.16 based repository.
+
+ * Inventory representation. An inventory is a logical description of the shape
+   of a version controlled tree. Currently we operate on the whole inventory as
+   a tree broken down per directory, but we store it as a flat file. This scale
+   very poorly as even a minor change between inventories requires us to scan
+   the entire file, and in large trees this is many megabytes of data to
+   consider. We are investigating the exact form, but the intent is to change
+   the serialisation of inventories so that comparing two inventories can be
+   done in some smaller time - e.g. O(log N) scaling. Whatever form this takes,
+   a repository that can export it directly will be able to perform operations
+   between two historical trees much more efficiently than the current
+   repositories.
+
+ * Delta storage optimisation. We plan to change the delta storage logic to use
+   a binary delta like xdelta rather than using line based deltas from python.
+   These binary deltas could be done along ancestry ordering, or other
+   arbitrary patterns chosen for their intended use. Line based deltas will
+   still be created for cached annotations. This is still under some discussion.
+   http://bazaar-vcs.org/PerformanceRoadmap/Xdelta
+
+ * Greatest distance from origin cache. This is a possible change to introduce,
+   but it may be unnecessary - listed here for completeness till it has been
+   established as [un]needed.
+
+Possibly non-interoperable disk changes
+=======================================
+
+ * Removing of derivable data from the core of bzr. Much of the data that bzr
+   stores is derivable from the users source files. For instance the
+   annotations that record who introduced a line. Given the full history for a
+   repository we can recreate that at any time. We want to remove the
+   dependence of the core of bzr on any data that is derivable, because doing
+   this will give us the freedom to:
+
+   * Improve the derivation algorithm over time.
+   * Deal with bugs in the derivation algorithms without having 'corrupt
+     repositories' or such things.
+
+   However, some of the data that is technically derived, like the per-file
+   merge graph, is both considered core, and can be generated differently when
+   certain circumstances arive, by bzr 0.16. Any change to the 'core' status of
+   that data will discard data that cannot be recreated and thus lead to the
+   inability to export from a format where that is derived data to bzr 0.16's
+   formats without errors occuring in those circumstances. Some of the data
+   that may be considered for this includes:
+
+   * Per file merge graphs
+   * Annotations
+
+Non-interoperable disk changes
+==============================
+
+ * Drop the per-file merge graph 'cache' currently held in the FILE-ID.kndx
+   files. A specific case of removing derivable data, this may allow smaller
+   inventory metadata and also make it easier to allow two different trees (in
+   terms of last-change made, e.g. if one is a working tree) to be compared
+   using a hash-tree style approach.
+
+ * Use hash based names for some objects in the bzr database. Because it would force
+   total-knowledge-of-history on the graph revision objects will not be namable
+   via hash's and neither will revisio signatures. Other than that though we
+   can in principle use hash's e.g. SHA1 for everything else. There are many
+   unanswered questions about hash based naming related to locality of
+   reference impacts, which need to be answered before this becomes a definite
+   item.

=== modified file 'doc/developers/performance-roadmap.txt'
--- a/doc/developers/performance-roadmap.txt	2007-06-04 00:51:54 +0000
+++ b/doc/developers/performance-roadmap.txt	2007-06-04 05:38:02 +0000
@@ -7,6 +7,8 @@
 
 .. include:: performance-roadmap-rationale.txt
 
+.. include:: planned-performance-changes.txt
+
 Analysis of use cases
 #####################