Rev 2539: (robertc) Propose integration ordering for the performance changes. in file:///home/pqm/archives/thelove/bzr/%2Btrunk/

Canonical.com Patch Queue Manager pqm at pqm.ubuntu.com
Wed Jun 20 04:37:28 BST 2007


At file:///home/pqm/archives/thelove/bzr/%2Btrunk/

------------------------------------------------------------
revno: 2539
revision-id: pqm at pqm.ubuntu.com-20070620033726-baiap8oniaidhdf1
parent: pqm at pqm.ubuntu.com-20070620002213-fvt1s1yu2iujulio
parent: robertc at robertcollins.net-20070620030958-6ou886tyo5zpc3u4
committer: Canonical.com Patch Queue Manager<pqm at pqm.ubuntu.com>
branch nick: +trunk
timestamp: Wed 2007-06-20 04:37:26 +0100
message:
  (robertc) Propose integration ordering for the performance changes.
added:
  doc/developers/planned-change-integration.txt plannedchangeintegra-20070619004702-i1b3ccamjtfaoq6w-1
modified:
  .bzrignore                     bzrignore-20050311232317-81f7b71efa2db11a
  Makefile                       Makefile-20050805140406-d96e3498bb61c5bb
  doc/developers/performance-roadmap.txt performanceroadmap.t-20070507174912-mwv3xv517cs4sisd-2
  doc/developers/performance.dot performance.dot-20070527173558-rqaqxn1al7vzgcto-3
  doc/developers/planned-performance-changes.txt plannedperformancech-20070604053752-bnjdhako613xfufb-1
    ------------------------------------------------------------
    revno: 2522.3.3
    merged: robertc at robertcollins.net-20070620030958-6ou886tyo5zpc3u4
    parent: robertc at robertcollins.net-20070619005518-r2n8pmtgnf9lq9yo
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: integration
    timestamp: Wed 2007-06-20 13:09:58 +1000
    message:
      Handle dot not being installed
    ------------------------------------------------------------
    revno: 2522.3.2
    merged: robertc at robertcollins.net-20070619005518-r2n8pmtgnf9lq9yo
    parent: robertc at robertcollins.net-20070619004822-wsop5g2arwu1lti4
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: roadmap
    timestamp: Tue 2007-06-19 10:55:18 +1000
    message:
      Make gdfo cache under discussion in the graph
    ------------------------------------------------------------
    revno: 2522.3.1
    merged: robertc at robertcollins.net-20070619004822-wsop5g2arwu1lti4
    parent: pqm at pqm.ubuntu.com-20070612021742-uetsy3g747iq3xkk
    committer: Robert Collins <robertc at robertcollins.net>
    branch nick: roadmap
    timestamp: Tue 2007-06-19 10:48:22 +1000
    message:
      Draft proposed integration order for performance changes.
=== added file 'doc/developers/planned-change-integration.txt'
--- a/doc/developers/planned-change-integration.txt	1970-01-01 00:00:00 +0000
+++ b/doc/developers/planned-change-integration.txt	2007-06-19 00:48:22 +0000
@@ -0,0 +1,141 @@
+Integration of performance changes
+==================================
+
+To deliver a version of bzr with all our planned changes will require
+significant integration work. Minimally each change needs to integrate with
+some aspect of the bzr version it's merged into, but in reality many of these
+changes while conceptually independent will in fact have to integrate with the
+other changes we have planned before can have a completed system.
+
+Additionally changes that alter disk formats are inherently more tricky to
+integrate because we will often need to alter apis throughout the code base to
+expose the increased or reduced model of the preferred disk format.
+
+The dot file performance.dot graphs out the dependencies to let us make
+accurate assessments of the changes needed in terms of code and API, hopefully
+minimising the number of different integration steps we have to take, while
+giving us a broad surface area for development. Its based on a sumary in the
+next section of this document of the planned changes with their expected
+collaborators and dependencies. Where a command is listed, the expectation is
+that all uses of that command - local, remote, dumb transport and smart
+transport are being addressed together.
+
+
+The following provides a summary of the planned changes and their expected
+collaborators within the code base, along with an estimate of whether they are
+likely to require changes to their collaborators to be considered 'finished'.
+
+ * Use case target APIs: Each of these is likely to alter the Tree interface.
+   Some few of them focus on Branch and will alter Branch and Repository
+   accordingly. As they are targeted APIs we can deep changes all the way down
+   the stack to the underlying representation to make it all fit well.
+   Presenting a top level API for many things will be possible now as long as
+   the exposed data is audited for things we plan to make optional, or remove:
+   Such things cannot be present in the final API. Writing these APIs now will
+   provide strong feedback to the design process for those things which are
+   considered optional or removable, so these APIs should be implemented
+   before removing or making optional existing data.
+ 
+ * Deprecating versioned files as a supported API: This collaborates with the
+   Repository API but can probably be done by adding a replacement API for
+   places where the versioned-file api is used. We may well want to keep a
+   concept of 'a file over time' or 'inventories over time', so the existing
+   repository model of exposing versioned file objects may be ok; what we need
+   to ensure we do is remove the places in the code base where you create or
+   remove or otherwise describe manipulation of the storage by knit rather than
+   talking at the level of file ids and revision ids. The current
+   versioned-file API would be a burden for implementors of a blob based
+   repository format, so the removal of callers, and deprecation of those parts
+   of the API should be done before creating a blob based repository format.
+
+ * Creating a revision validator: Revision validators may depend on storage
+   layer changes to inventories so while we can create a revision validator
+   API, we cannot create the final one until we have the inventory structural
+   changes completed.
+ 
+ * Annotation caching API: This API is a prerequisite for new repository
+   formats. If written after they are introduced we may find that the
+   repository is lacking in functionality, so the API should be implemented
+   first.
+
+ * _iter_changes based merging: If the current _iter_changes_ API is
+   insufficient, we should know about that before designing the disk format for
+   generating fast _iter_changes_ output.
+
+ * Network-efficient revision graph API: This influences what questions we will
+   want to ask a local repository very quickly; as such it's a driver for the
+   new repository format and should be in place first if possible. Its probably
+   not sufficiently different to local operations to make this a hard ordering
+   though.
+
+ * Working tree disk ordering: Knowing the expected order for disk operations
+   may influence the needed use case specific APIs, so having a solid
+   understanding of what is optimal - and why - and whether it is pessimal on
+   non linux platforms is rather important.
+
+ * Be able to version files greater than memory in size: This cannot be
+   achieved until all parts of the library which deal with user files are able
+   to provide access to files larger than memory. Many strategies can be
+   considered for this - such as temporary files on disk, memory mapping etc.
+   We should have enough of a design laid out that developers of repository and
+   tree logic are able to start exposing apis, and considering requirements
+   related to them, to let this happen.
+
+ * Per-file graph access API: This should be implemented on top of or as part
+   of the newer API for accessing data about a file over time. It can be a
+   separate step easily; but as it's in the same area of the library should not
+   be done in parallel.
+  
+ * Repository stacking API: The key dependency/change required for this is that
+   repositories must individually be happy with having partial data - e.g. many
+   ghosts. However the way the API needs to be used should be driven from the
+   command layer in, because its unclear at the moment what will work best.
+
+ * Revision stream API: This API will become clear as we streamline commands.
+   On the data insertion side commit will want to generate new data. The
+   commands pull, bundle, merge, push, possibly uncommit will want to copy
+   existing data in a streaming fashion.
+ 
+ * New container format: Its hard to tell what the right way to structure the
+   layering is. Probably having smooth layering down to the point that code
+   wants to operate on the containers directly will make this more clear. As
+   bundles will become a read-only branch & repository, the smart server wants
+   streaming-containers, and we are planning a pack based repository, it
+   appears that we will have three different direct container users. However,
+   the bundle user may in fact be fake - because it really is a repository.
+
+ * Separation of annotation cache: Making the disk changes to achieve this
+   depends on the new API being created. Bundles probably want to be
+   annotation-free, so they are a form of implementation of this and will need
+   the on-demand annotation facility.
+
+ * Repository operation disk ordering: Dramatically changing the ordering of
+   disk operations requires a new repository format. We have most of the
+   analysis done to be able to specify the desired ordering, so it should be
+   possible to write such a format now based on the container logic, but
+   without any of the inventory representation or delta representation changes.
+   This would for instance involve pack combining ordering the existing diffs
+   in reverse order.
+
+ * Inventory representation: This has a dependency on what data is
+   dropped from the core and what is kept. Without those changes being known we
+   can implement a new representation, but it won't be a final one. One of the
+   services the new inventory representation is expected to deliver is one of
+   validators for subtrees -- a means of comparing just subtrees of two
+   inventories without comparing all the data within that subtree.
+
+ * Delta storage optimisation: This has a strict dependency on a new repository
+   format. Optimisation takes many forms - we probably cannot complete the
+   desired optimisations under knits though we could use xdelta within a
+   knit-variation. 
+
+ * Greatest distance from origin cache: The potential users of this exist
+   today, it is likely able to be implemented immediately, but we are not sure
+   that its needed anymore, so it is being shelved.
+
+ * Removing derivable data: Its very hard to do this while the derived data is
+   exposed in API's but not used by commands. Implemented the targeted API's
+   for our core use cases should allow use to remove accidental use of derived
+   data, making only explicit uses of it visible, and isolating the impact of
+   removing it : allowing us to experiment sensibly. This covers both dropping
+   the per-file merge graph and the hash-based-names proposals.

=== modified file '.bzrignore'
--- a/.bzrignore	2007-05-09 15:36:06 +0000
+++ b/.bzrignore	2007-06-19 00:48:22 +0000
@@ -35,3 +35,4 @@
 ./pretty_docs
 ./api
 doc/**/*.htm
+doc/developers/performance.png

=== modified file 'Makefile'
--- a/Makefile	2007-06-08 03:25:44 +0000
+++ b/Makefile	2007-06-20 03:09:58 +0000
@@ -113,7 +113,7 @@
 man1/bzr.1: $(MAN_DEPENDENCIES)
 	python generate_docs.py -o $@ man
 
-ALL_DOCS = $(htm_files) $(MAN_PAGES) doc/developers/HACKING.htm $(dev_htm_files)
+ALL_DOCS = $(htm_files) $(MAN_PAGES) doc/developers/HACKING.htm $(dev_htm_files) doc/developers/performance.png
 docs: $(ALL_DOCS)
 
 copy-docs: docs
@@ -127,7 +127,12 @@
 # clean produced docs
 clean-docs:
 	python tools/win32/ostools.py remove $(ALL_DOCS) \
-	$(HTMLDIR) $(PRETTYDIR) doc/bzr_man.txt
+	$(HTMLDIR) $(PRETTYDIR) doc/bzr_man.txt doc/developers/performance.png
+
+
+# build a png of our performance task list
+doc/developers/performance.png: doc/developers/performance.dot
+	@dot -Tpng $< -o$@ || echo "Dot not installed; skipping generation of $@"
 
 
 # make bzr.exe for win32 with py2exe

=== modified file 'doc/developers/performance-roadmap.txt'
--- a/doc/developers/performance-roadmap.txt	2007-06-06 07:45:14 +0000
+++ b/doc/developers/performance-roadmap.txt	2007-06-19 00:48:22 +0000
@@ -18,6 +18,8 @@
 
 .. include:: planned-performance-changes.txt
 
+.. include:: planned-change-integration.txt
+
 Analysis of use cases
 #####################
 

=== modified file 'doc/developers/performance.dot'
--- a/doc/developers/performance.dot	2007-05-27 18:47:54 +0000
+++ b/doc/developers/performance.dot	2007-06-19 00:55:18 +0000
@@ -1,32 +1,132 @@
 /* ESTIMATES ARE VERY ROUGH APPROXIMATIONS */
-digraph performance {
-  gdfo_api -> gdfo_cache;
-  gdfo_api -> gdfo_usage;
-  gdfo_api[label="GDFO API\n1 day"];
-  gdfo_cache[label="GDFO Cache\n1 week"];
-  gdfo_usage[label="GDFO Usage\n3 days"];
-  data_collation[label="Data co-location API\n1 month"];
+strict digraph performance {
+  /* completed node list */
+  node[color="green"];
+  add_analysis[label="Work required analysis for add"];
+  branch_analysis[label="Work required analysis for branch"];
+  bundle_analysis[label="Work required analysis for creating a bundle"];
+  wt_disk_order[label="Working Tree disk ordering\n6-8 weeks"];
+
+  /* uncompleted node list - add new tasks here */
+  node[color="blue"];
+  annotate_analysis[label="Work required analysis for annotate"];
+  status_analysis[label="Work required analysis for status"];
+  commit_analysis[label="Work required analysis for commit"];
+  fetch_analysis[label="Work required analysis for push/pull"];
+  log_analysis[label="Work required analysis for log"];
+  log_path_analysis[label="Work required analysis for log of selected paths."];
+  diff_analysis[label="Work required analysis for diff"];
+  diff_path_analysis[label="Work required analysis for diff of selected paths"];
+  revert_analysis[label="Work required analysis for revert"];
+  revert_path_analysis[label="Work required analysis for revert of selected paths"];
+  merge_analysis[label="Work required analysis for merge"];
+  uncommit_analysis[label="Work required analysis for uncommit"];
+  missing_analysis[label="Work required analysis for missing"];
+  update_analysis[label="Work required analysis for update"];
+  cbranch_analysis[label="Work required analysis for cbranch"];
+
+  add_api_stack[label="Targeted API stack for add"];
+  branch_api_stack[label="Targeted API stack for branch"];
+  bundle_api_stack[label="Targeted API stack for creating a bundle"];
+  annotate_api_stack[label="Targeted API stack for annotate"];
+  status_api_stack[label="Targeted API stack for status"];
+  commit_api_stack[label="Targeted API stack for commit"];
+  fetch_api_stack[label="Targeted API stack for push/pull"];
+  log_api_stack[label="Targeted API stack for log"];
+  log_path_api_stack[label="Targeted API stack for log of selected paths."];
+  diff_api_stack[label="Targeted API stack for diff"];
+  revert_api_stack[label="Targeted API stack for revert"];
+  revert_path_api_stack[label="Targeted API stack for revert of selected paths"];
+  merge_api_stack[label="Targeted API stack for merge"];
+  uncommit_api_stack[label="Targeted API stack for uncommit"];
+  missing_api_stack[label="Targeted API stack for missing"];
+  update_api_stack[label="Targeted API stack for update"];
+  cbranch_api_stack[label="Targeted API stack for cbranch"];
+
+  data_collation[label="Stream API for inserting/obtaining revision data.\n1 month"];
   repository_stacking[label="Repository stacking API\n2 months"];
-  bundle_container[label="Bundle container format\n2 weeks"]
+  new_container[label="New container format\n2 weeks"]
   xdelta[label="Xdelta sanity/learning\n2 weeks"];
   xdelta_imp[label="Xdelta implementation\n1 week"];
-  xdelta -> xdelta_imp;
   q_splitting[label="Question radix directory splitting\n2 weeks"];
-  i_splitting[label="Implement inventory splitting\n6-8 weeks"]
-  q_splitting -> i_splitting;
-  get_weave[label="deprecate get_weave\n1 week"];
-  per_file_graph -> get_weave;
-  per_file_graph[label="Access for per-file graph data\n1 days"];
-  repo_apis[label="For each use case, build targeted repo agnostic APIs\n10-40 days"];
-  rev_validators[label="Revision validators (use in GPG sigs etc) may poly\n3 days"];
+  i_splitting[label="Inventory storage changed to answer what-changed quickly\n6-8 weeks"]
+  per_file_graph[label="Provide an API for per-file graph data rather than physical storage coupled knits api.\n1 days"];
+  deprecate_versionedfile_api[label="Deprecate the public API for access to physical knit storage."];
   anno_cache[label="Annotations become a cache:\n logically separate data\n2 weeks"]
   anno_regen[label="Annotation regeneration\n"];
   anno_kinds[label="Different styles of annotation"];
-  anno_regen -> anno_kinds;
-  anno_cache -> anno_regen;
   memory_copies[label="Stop requiring full memory copies of files"];
-  wt_disk_order[label="Working Tree disk ordering\n6-8 weeks"];
   repo_disk_order[label="Repository disk ordering\n1 month"];
+  pack_repository[label="Pack based repository format"];
   graph_api[label="Network-efficient revision-graph API\n3 week"];
   iter_merge[label="iter_changes based merge\n2 days"];
+  validators[label="Build new validators for revisions and trees."];
+
+  /* under discussion/optional */
+  node[color="yellow"];
+  hash_names[label="Use hashes as names for some objects\n(to reduce tracking metadata and ease interoperability."];
+  gdfo_api[label="GDFO API\n1 day"];
+  gdfo_cache[label="GDFO Cache\n1 week"];
+  gdfo_usage[label="GDFO Usage\n3 days"];
+
+  /* dependencies */
+  gdfo_api -> gdfo_cache;
+  gdfo_api -> gdfo_usage;
+  xdelta -> xdelta_imp;
+  q_splitting -> i_splitting;
+  per_file_graph -> deprecate_versionedfile_api;
+  anno_regen -> anno_kinds;
+  anno_cache -> anno_regen;
+  add_analysis -> add_api_stack;
+  annotate_analysis -> annotate_api_stack -> anno_cache;
+  annotate_api_stack -> per_file_graph -> graph_api;
+  annotate_api_stack -> memory_copies;
+  annotate_api_stack -> hash_names;
+  branch_analysis -> branch_api_stack -> repository_stacking;
+  branch_api_stack -> memory_copies;
+  bundle_analysis -> bundle_api_stack -> data_collation;
+  bundle_api_stack -> repository_stacking;
+  bundle_api_stack -> validators;
+  bundle_api_stack -> graph_api;
+  bundle_api_stack -> memory_copies;
+  bundle_api_stack -> new_container;
+  bundle_analysis -> hash_names;
+  cbranch_analysis -> cbranch_api_stack;
+  commit_analysis -> commit_api_stack -> data_collation;
+  commit_api_stack -> per_file_graph;
+  commit_api_stack -> validators;
+  commit_api_stack -> memory_copies;
+  commit_api_stack -> hash_names;
+  diff_analysis -> diff_api_stack;
+  diff_api_stack -> memory_copies;
+  diff_path_analysis -> diff_api_stack -> i_splitting;
+  diff_api_stack -> hash_names;
+  fetch_analysis -> fetch_api_stack -> data_collation;
+  fetch_api_stack -> repository_stacking;
+  fetch_api_stack -> graph_api;
+  fetch_api_stack -> memory_copies;
+  fetch_api_stack -> hash_names;
+  repository_stacking -> graph_api;
+  hash_names -> i_splitting;
+  log_analysis -> log_api_stack -> i_splitting;
+  log_path_analysis -> log_path_api_stack;
+  log_path_api_stack -> per_file_graph;
+  merge_analysis -> merge_api_stack -> iter_merge -> i_splitting;
+  merge_api_stack -> memory_copies;
+  missing_analysis -> missing_api_stack -> repository_stacking;
+  new_container -> pack_repository;
+  pack_repository -> xdelta_imp;
+  pack_repository -> repo_disk_order;
+  per_file_graph -> hash_names;
+  repository_stacking -> pack_repository;
+  repository_stacking -> new_container;
+  revert_analysis -> revert_api_stack -> data_collation;
+  revert_path_analysis -> revert_path_api_stack;
+  revert_api_stack -> memory_copies;
+  status_analysis -> status_api_stack;
+  status_api_stack -> memory_copies;
+  uncommit_analysis -> uncommit_api_stack -> data_collation;
+  uncommit_api_stack -> graph_api;
+  update_analysis -> update_api_stack;
+  update_api_stack -> memory_copies;
 }

=== modified file 'doc/developers/planned-performance-changes.txt'
--- a/doc/developers/planned-performance-changes.txt	2007-06-06 07:45:14 +0000
+++ b/doc/developers/planned-performance-changes.txt	2007-06-19 00:48:22 +0000
@@ -93,7 +93,7 @@
  * New container format to allow single-file description of multiple named
    objects. This will provide the basis for transmission of revisions over the
    network, the new bundle format, and possibly a new repository format as
-   well.
+   well. [Core implemented] 
 
  * Separate the annotation cache from the storage of actual file texts and make
    the annotation style, and when to do it, configurable. This will reduce data




More information about the bazaar-commits mailing list