Rev 2539: (robertc) Propose integration ordering for the performance changes. in file:///home/pqm/archives/thelove/bzr/%2Btrunk/

Wed Jun 20 04:37:28 BST 2007

At file:///home/pqm/archives/thelove/bzr/%2Btrunk/

------------------------------------------------------------
revno: 2539
revision-id: pqm at pqm.ubuntu.com-20070620033726-baiap8oniaidhdf1
parent: pqm at pqm.ubuntu.com-20070620002213-fvt1s1yu2iujulio
parent: robertc at robertcollins.net-20070620030958-6ou886tyo5zpc3u4
committer: Canonical.com Patch Queue Manager<pqm at pqm.ubuntu.com>
branch nick: +trunk
timestamp: Wed 2007-06-20 04:37:26 +0100
message:
  (robertc) Propose integration ordering for the performance changes.
added:
  doc/developers/planned-change-integration.txt plannedchangeintegra-20070619004702-i1b3ccamjtfaoq6w-1
modified:
  .bzrignore                     bzrignore-20050311232317-81f7b71efa2db11a
  Makefile                       Makefile-20050805140406-d96e3498bb61c5bb
  doc/developers/performance-roadmap.txt performanceroadmap.t-20070507174912-mwv3xv517cs4sisd-2
  doc/developers/performance.dot performance.dot-20070527173558-rqaqxn1al7vzgcto-3
  doc/developers/planned-performance-changes.txt plannedperformancech-20070604053752-bnjdhako613xfufb-1
    ------------------------------------------------------------
    revno: 2522.3.3
    merged: robertc at robertcollins.net-20070620030958-6ou886tyo5zpc3u4
    parent: robertc at robertcollins.net-20070619005518-r2n8pmtgnf9lq9yo
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: integration
    timestamp: Wed 2007-06-20 13:09:58 +1000
    message:
      Handle dot not being installed
    ------------------------------------------------------------
    revno: 2522.3.2
    merged: robertc at robertcollins.net-20070619005518-r2n8pmtgnf9lq9yo
    parent: robertc at robertcollins.net-20070619004822-wsop5g2arwu1lti4
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: roadmap
    timestamp: Tue 2007-06-19 10:55:18 +1000
    message:
      Make gdfo cache under discussion in the graph
    ------------------------------------------------------------
    revno: 2522.3.1
    merged: robertc at robertcollins.net-20070619004822-wsop5g2arwu1lti4
    parent: pqm at pqm.ubuntu.com-20070612021742-uetsy3g747iq3xkk
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: roadmap
    timestamp: Tue 2007-06-19 10:48:22 +1000
    message:
      Draft proposed integration order for performance changes.
=== added file 'doc/developers/planned-change-integration.txt'

--- a/doc/developers/planned-change-integration.txt	1970-01-01 00:00:00 +0000
+++ b/doc/developers/planned-change-integration.txt	2007-06-19 00:48:22 +0000
@@ -0,0 +1,141 @@
+Integration of performance changes
+==================================
+
+To deliver a version of bzr with all our planned changes will require
+significant integration work. Minimally each change needs to integrate with
+some aspect of the bzr version it's merged into, but in reality many of these
+changes while conceptually independent will in fact have to integrate with the
+other changes we have planned before can have a completed system.
+
+Additionally changes that alter disk formats are inherently more tricky to
+integrate because we will often need to alter apis throughout the code base to
+expose the increased or reduced model of the preferred disk format.
+
+The dot file performance.dot graphs out the dependencies to let us make
+accurate assessments of the changes needed in terms of code and API, hopefully
+minimising the number of different integration steps we have to take, while
+giving us a broad surface area for development. Its based on a sumary in the
+next section of this document of the planned changes with their expected
+collaborators and dependencies. Where a command is listed, the expectation is
+that all uses of that command - local, remote, dumb transport and smart
+transport are being addressed together.
+
+
+The following provides a summary of the planned changes and their expected
+collaborators within the code base, along with an estimate of whether they are
+likely to require changes to their collaborators to be considered 'finished'.
+
+ * Use case target APIs: Each of these is likely to alter the Tree interface.
+   Some few of them focus on Branch and will alter Branch and Repository
+   accordingly. As they are targeted APIs we can deep changes all the way down
+   the stack to the underlying representation to make it all fit well.
+   Presenting a top level API for many things will be possible now as long as
+   the exposed data is audited for things we plan to make optional, or remove:
+   Such things cannot be present in the final API. Writing these APIs now will
+   provide strong feedback to the design process for those things which are
+   considered optional or removable, so these APIs should be implemented
+   before removing or making optional existing data.
+ 
+ * Deprecating versioned files as a supported API: This collaborates with the
+   Repository API but can probably be done by adding a replacement API for
+   places where the versioned-file api is used. We may well want to keep a
+   concept of 'a file over time' or 'inventories over time', so the existing
+   repository model of exposing versioned file objects may be ok; what we need
+   to ensure we do is remove the places in the code base where you create or
+   remove or otherwise describe manipulation of the storage by knit rather than
+   talking at the level of file ids and revision ids. The current
+   versioned-file API would be a burden for implementors of a blob based
+   repository format, so the removal of callers, and deprecation of those parts
+   of the API should be done before creating a blob based repository format.
+
+ * Creating a revision validator: Revision validators may depend on storage
+   layer changes to inventories so while we can create a revision validator
+   API, we cannot create the final one until we have the inventory structural
+   changes completed.
+ 
+ * Annotation caching API: This API is a prerequisite for new repository
+   formats. If written after they are introduced we may find that the
+   repository is lacking in functionality, so the API should be implemented
+   first.
+
+ * _iter_changes based merging: If the current _iter_changes_ API is
+   insufficient, we should know about that before designing the disk format for
+   generating fast _iter_changes_ output.
+
+ * Network-efficient revision graph API: This influences what questions we will
+   want to ask a local repository very quickly; as such it's a driver for the
+   new repository format and should be in place first if possible. Its probably
+   not sufficiently different to local operations to make this a hard ordering
+   though.
+
+ * Working tree disk ordering: Knowing the expected order for disk operations
+   may influence the needed use case specific APIs, so having a solid
+   understanding of what is optimal - and why - and whether it is pessimal on
+   non linux platforms is rather important.
+
+ * Be able to version files greater than memory in size: This cannot be
+   achieved until all parts of the library which deal with user files are able
+   to provide access to files larger than memory. Many strategies can be
+   considered for this - such as temporary files on disk, memory mapping etc.
+   We should have enough of a design laid out that developers of repository and
+   tree logic are able to start exposing apis, and considering requirements
+   related to them, to let this happen.
+
+ * Per-file graph access API: This should be implemented on top of or as part
+   of the newer API for accessing data about a file over time. It can be a
+   separate step easily; but as it's in the same area of the library should not
+   be done in parallel.
+  
+ * Repository stacking API: The key dependency/change required for this is that
+   repositories must individually be happy with having partial data - e.g. many
+   ghosts. However the way the API needs to be used should be driven from the
+   command layer in, because its unclear at the moment what will work best.
+
+ * Revision stream API: This API will become clear as we streamline commands.
+   On the data insertion side commit will want to generate new data. The
+   commands pull, bundle, merge, push, possibly uncommit will want to copy
+   existing data in a streaming fashion.
+ 
+ * New container format: Its hard to tell what the right way to structure the
+   layering is. Probably having smooth layering down to the point that code
+   wants to operate on the containers directly will make this more clear. As
+   bundles will become a read-only branch & repository, the smart server wants
+   streaming-containers, and we are planning a pack based repository, it
+   appears that we will have three different direct container users. However,
+   the bundle user may in fact be fake - because it really is a repository.
+
+ * Separation of annotation cache: Making the disk changes to achieve this
+   depends on the new API being created. Bundles probably want to be
+   annotation-free, so they are a form of implementation of this and will need
+   the on-demand annotation facility.
+
+ * Repository operation disk ordering: Dramatically changing the ordering of
+   disk operations requires a new repository format. We have most of the
+   analysis done to be able to specify the desired ordering, so it should be
+   possible to write such a format now based on the container logic, but
+   without any of the inventory representation or delta representation changes.
+   This would for instance involve pack combining ordering the existing diffs
+   in reverse order.
+
+ * Inventory representation: This has a dependency on what data is
+   dropped from the core and what is kept. Without those changes being known we
+   can implement a new representation, but it won't be a final one. One of the
+   services the new inventory representation is expected to deliver is one of
+   validators for subtrees -- a means of comparing just subtrees of two
+   inventories without comparing all the data within that subtree.
+
+ * Delta storage optimisation: This has a strict dependency on a new repository
+   format. Optimisation takes many forms - we probably cannot complete the
+   desired optimisations under knits though we could use xdelta within a
+   knit-variation. 
+
+ * Greatest distance from origin cache: The potential users of this exist
+   today, it is likely able to be implemented immediately, but we are not sure
+   that its needed anymore, so it is being shelved.
+
+ * Removing derivable data: Its very hard to do this while the derived data is
+   exposed in API's but not used by commands. Implemented the targeted API's
+   for our core use cases should allow use to remove accidental use of derived
+   data, making only explicit uses of it visible, and isolating the impact of
+   removing it : allowing us to experiment sensibly. This covers both dropping
+   the per-file merge graph and the hash-based-names proposals.

=== modified file '.bzrignore'
--- a/.bzrignore	2007-05-09 15:36:06 +0000
+++ b/.bzrignore	2007-06-19 00:48:22 +0000
@@ -35,3 +35,4 @@
 ./pretty_docs
 ./api
 doc/**/*.htm
+doc/developers/performance.png

=== modified file 'Makefile'
--- a/Makefile	2007-06-08 03:25:44 +0000
+++ b/Makefile	2007-06-20 03:09:58 +0000
@@ -113,7 +113,7 @@
 man1/bzr.1: $(MAN_DEPENDENCIES)
 	python generate_docs.py -o $@ man
 
-ALL_DOCS = $(htm_files) $(MAN_PAGES) doc/developers/HACKING.htm $(dev_htm_files)
+ALL_DOCS = $(htm_files) $(MAN_PAGES) doc/developers/HACKING.htm $(dev_htm_files) doc/developers/performance.png
 docs: $(ALL_DOCS)
 
 copy-docs: docs
@@ -127,7 +127,12 @@
 # clean produced docs
 clean-docs:
 	python tools/win32/ostools.py remove $(ALL_DOCS) \
-	$(HTMLDIR) $(PRETTYDIR) doc/bzr_man.txt
+	$(HTMLDIR) $(PRETTYDIR) doc/bzr_man.txt doc/developers/performance.png
+
+
+# build a png of our performance task list
+doc/developers/performance.png: doc/developers/performance.dot
+	@dot -Tpng $< -o$@ || echo "Dot not installed; skipping generation of $@"
 
 
 # make bzr.exe for win32 with py2exe

=== modified file 'doc/developers/performance-roadmap.txt'
--- a/doc/developers/performance-roadmap.txt	2007-06-06 07:45:14 +0000
+++ b/doc/developers/performance-roadmap.txt	2007-06-19 00:48:22 +0000
@@ -18,6 +18,8 @@
 
 .. include:: planned-performance-changes.txt
 
+.. include:: planned-change-integration.txt
+
 Analysis of use cases
 #####################
 

=== modified file 'doc/developers/performance.dot'
--- a/doc/developers/performance.dot	2007-05-27 18:47:54 +0000
+++ b/doc/developers/performance.dot	2007-06-19 00:55:18 +0000
@@ -1,32 +1,132 @@
 /* ESTIMATES ARE VERY ROUGH APPROXIMATIONS */
-digraph performance {
-  gdfo_api -> gdfo_cache;
-  gdfo_api -> gdfo_usage;
-  gdfo_api[label="GDFO API\n1 day"];
-  gdfo_cache[label="GDFO Cache\n1 week"];
-  gdfo_usage[label="GDFO Usage\n3 days"];
-  data_collation[label="Data co-location API\n1 month"];
+strict digraph performance {
+  /* completed node list */
+  node[color="green"];
+  add_analysis[label="Work required analysis for add"];
+  branch_analysis[label="Work required analysis for branch"];
+  bundle_analysis[label="Work required analysis for creating a bundle"];
+  wt_disk_order[label="Working Tree disk ordering\n6-8 weeks"];
+
+  /* uncompleted node list - add new tasks here */
+  node[color="blue"];
+  annotate_analysis[label="Work required analysis for annotate"];
+  status_analysis[label="Work required analysis for status"];
+  commit_analysis[label="Work required analysis for commit"];
+  fetch_analysis[label="Work required analysis for push/pull"];
+  log_analysis[label="Work required analysis for log"];
+  log_path_analysis[label="Work required analysis for log of selected paths."];
+  diff_analysis[label="Work required analysis for diff"];
+  diff_path_analysis[label="Work required analysis for diff of selected paths"];
+  revert_analysis[label="Work required analysis for revert"];
+  revert_path_analysis[label="Work required analysis for revert of selected paths"];
+  merge_analysis[label="Work required analysis for merge"];
+  uncommit_analysis[label="Work required analysis for uncommit"];
+  missing_analysis[label="Work required analysis for missing"];
+  update_analysis[label="Work required analysis for update"];
+  cbranch_analysis[label="Work required analysis for cbranch"];
+
+  add_api_stack[label="Targeted API stack for add"];
+  branch_api_stack[label="Targeted API stack for branch"];
+  bundle_api_stack[label="Targeted API stack for creating a bundle"];
+  annotate_api_stack[label="Targeted API stack for annotate"];
+  status_api_stack[label="Targeted API stack for status"];
+  commit_api_stack[label="Targeted API stack for commit"];
+  fetch_api_stack[label="Targeted API stack for push/pull"];
+  log_api_stack[label="Targeted API stack for log"];
+  log_path_api_stack[label="Targeted API stack for log of selected paths."];
+  diff_api_stack[label="Targeted API stack for diff"];
+  revert_api_stack[label="Targeted API stack for revert"];
+  revert_path_api_stack[label="Targeted API stack for revert of selected paths"];
+  merge_api_stack[label="Targeted API stack for merge"];
+  uncommit_api_stack[label="Targeted API stack for uncommit"];
+  missing_api_stack[label="Targeted API stack for missing"];
+  update_api_stack[label="Targeted API stack for update"];
+  cbranch_api_stack[label="Targeted API stack for cbranch"];
+
+  data_collation[label="Stream API for inserting/obtaining revision data.\n1 month"];
   repository_stacking[label="Repository stacking API\n2 months"];
-  bundle_container[label="Bundle container format\n2 weeks"]
+  new_container[label="New container format\n2 weeks"]
   xdelta[label="Xdelta sanity/learning\n2 weeks"];
   xdelta_imp[label="Xdelta implementation\n1 week"];
-  xdelta -> xdelta_imp;
   q_splitting[label="Question radix directory splitting\n2 weeks"];
-  i_splitting[label="Implement inventory splitting\n6-8 weeks"]
-  q_splitting -> i_splitting;
-  get_weave[label="deprecate get_weave\n1 week"];
-  per_file_graph -> get_weave;
-  per_file_graph[label="Access for per-file graph data\n1 days"];
-  repo_apis[label="For each use case, build targeted repo agnostic APIs\n10-40 days"];
-  rev_validators[label="Revision validators (use in GPG sigs etc) may poly\n3 days"];
+  i_splitting[label="Inventory storage changed to answer what-changed quickly\n6-8 weeks"]
+  per_file_graph[label="Provide an API for per-file graph data rather than physical storage coupled knits api.\n1 days"];
+  deprecate_versionedfile_api[label="Deprecate the public API for access to physical knit storage."];
   anno_cache[label="Annotations become a cache:\n logically separate data\n2 weeks"]
   anno_regen[label="Annotation regeneration\n"];
   anno_kinds[label="Different styles of annotation"];
-  anno_regen -> anno_kinds;
-  anno_cache -> anno_regen;
   memory_copies[label="Stop requiring full memory copies of files"];
-  wt_disk_order[label="Working Tree disk ordering\n6-8 weeks"];
   repo_disk_order[label="Repository disk ordering\n1 month"];
+  pack_repository[label="Pack based repository format"];
   graph_api[label="Network-efficient revision-graph API\n3 week"];
   iter_merge[label="iter_changes based merge\n2 days"];
+  validators[label="Build new validators for revisions and trees."];
+
+  /* under discussion/optional */
+  node[color="yellow"];
+  hash_names[label="Use hashes as names for some objects\n(to reduce tracking metadata and ease interoperability."];
+  gdfo_api[label="GDFO API\n1 day"];
+  gdfo_cache[label="GDFO Cache\n1 week"];
+  gdfo_usage[label="GDFO Usage\n3 days"];
+
+  /* dependencies */
+  gdfo_api -> gdfo_cache;
+  gdfo_api -> gdfo_usage;
+  xdelta -> xdelta_imp;
+  q_splitting -> i_splitting;
+  per_file_graph -> deprecate_versionedfile_api;
+  anno_regen -> anno_kinds;
+  anno_cache -> anno_regen;
+  add_analysis -> add_api_stack;
+  annotate_analysis -> annotate_api_stack -> anno_cache;
+  annotate_api_stack -> per_file_graph -> graph_api;
+  annotate_api_stack -> memory_copies;
+  annotate_api_stack -> hash_names;
+  branch_analysis -> branch_api_stack -> repository_stacking;
+  branch_api_stack -> memory_copies;
+  bundle_analysis -> bundle_api_stack -> data_collation;
+  bundle_api_stack -> repository_stacking;
+  bundle_api_stack -> validators;
+  bundle_api_stack -> graph_api;
+  bundle_api_stack -> memory_copies;
+  bundle_api_stack -> new_container;
+  bundle_analysis -> hash_names;
+  cbranch_analysis -> cbranch_api_stack;
+  commit_analysis -> commit_api_stack -> data_collation;
+  commit_api_stack -> per_file_graph;
+  commit_api_stack -> validators;
+  commit_api_stack -> memory_copies;
+  commit_api_stack -> hash_names;
+  diff_analysis -> diff_api_stack;
+  diff_api_stack -> memory_copies;
+  diff_path_analysis -> diff_api_stack -> i_splitting;
+  diff_api_stack -> hash_names;
+  fetch_analysis -> fetch_api_stack -> data_collation;
+  fetch_api_stack -> repository_stacking;
+  fetch_api_stack -> graph_api;
+  fetch_api_stack -> memory_copies;
+  fetch_api_stack -> hash_names;
+  repository_stacking -> graph_api;
+  hash_names -> i_splitting;
+  log_analysis -> log_api_stack -> i_splitting;
+  log_path_analysis -> log_path_api_stack;
+  log_path_api_stack -> per_file_graph;
+  merge_analysis -> merge_api_stack -> iter_merge -> i_splitting;
+  merge_api_stack -> memory_copies;
+  missing_analysis -> missing_api_stack -> repository_stacking;
+  new_container -> pack_repository;
+  pack_repository -> xdelta_imp;
+  pack_repository -> repo_disk_order;
+  per_file_graph -> hash_names;
+  repository_stacking -> pack_repository;
+  repository_stacking -> new_container;
+  revert_analysis -> revert_api_stack -> data_collation;
+  revert_path_analysis -> revert_path_api_stack;
+  revert_api_stack -> memory_copies;
+  status_analysis -> status_api_stack;
+  status_api_stack -> memory_copies;
+  uncommit_analysis -> uncommit_api_stack -> data_collation;
+  uncommit_api_stack -> graph_api;
+  update_analysis -> update_api_stack;
+  update_api_stack -> memory_copies;
 }

=== modified file 'doc/developers/planned-performance-changes.txt'
--- a/doc/developers/planned-performance-changes.txt	2007-06-06 07:45:14 +0000
+++ b/doc/developers/planned-performance-changes.txt	2007-06-19 00:48:22 +0000
@@ -93,7 +93,7 @@
  * New container format to allow single-file description of multiple named
    objects. This will provide the basis for transmission of revisions over the
    network, the new bundle format, and possibly a new repository format as
-   well.
+   well. [Core implemented] 
 
  * Separate the annotation cache from the storage of actual file texts and make
    the annotation style, and when to do it, configurable. This will reduce data