Rev 2551: (mbp) design notes for commit and uncommit in file:///home/pqm/archives/thelove/bzr/%2Btrunk/

Tue Jun 26 05:55:38 BST 2007

At file:///home/pqm/archives/thelove/bzr/%2Btrunk/

------------------------------------------------------------
revno: 2551
revision-id: pqm at pqm.ubuntu.com-20070626045533-qni2zzf3cv41c1t8
parent: pqm at pqm.ubuntu.com-20070625174647-pocypsjmp861qgv7
parent: mbp at sourcefrog.net-20070626041504-sxdk4kjal60vnbxl
committer: Canonical.com Patch Queue Manager<pqm at pqm.ubuntu.com>
branch nick: +trunk
timestamp: Tue 2007-06-26 05:55:33 +0100
message:
  (mbp) design notes for commit and uncommit
added:
  doc/developers/dirstate.txt    dirstate.txt-20070618020404-cdhv0ecgrukomemg-2
  doc/developers/scratch.txt     scratch.txt-20070618020404-cdhv0ecgrukomemg-3
  doc/developers/uncommit.txt    uncommit.txt-20070621042721-4clw8ucb9u9yda2h-1
renamed:
  doc/developers/performance-commit.txt => doc/developers/commit.txt performancecommit.tx-20070606061633-4y4rawskx5ejb99w-1
modified:
  doc/developers/performance-roadmap.txt performanceroadmap.t-20070507174912-mwv3xv517cs4sisd-2
  doc/developers/performance-use-case-analysis.txt performanceusecasean-20070508045640-zneiu1yzbci574c6-2
  doc/developers/commit.txt      performancecommit.tx-20070606061633-4y4rawskx5ejb99w-1
    ------------------------------------------------------------
    revno: 2513.1.7
    merged: mbp at sourcefrog.net-20070626041504-sxdk4kjal60vnbxl
    parent: mbp at sourcefrog.net-20070626032049-whv447fmgyj2lf19
    committer: Martin Pool <mbp at sourcefrog.net>
    branch nick: doc
    timestamp: Tue 2007-06-26 14:15:04 +1000
    message:
      fix another broken doc link
    ------------------------------------------------------------
    revno: 2513.1.6
    merged: mbp at sourcefrog.net-20070626032049-whv447fmgyj2lf19
    parent: mbp at sourcefrog.net-20070621074424-ayfp2ycf98edc9ma
    committer: Martin Pool <mbp at sourcefrog.net>
    branch nick: doc
    timestamp: Tue 2007-06-26 13:20:49 +1000
    message:
      Remove broken link to status design document
    ------------------------------------------------------------
    revno: 2513.1.5
    merged: mbp at sourcefrog.net-20070621074424-ayfp2ycf98edc9ma
    parent: mbp at sourcefrog.net-20070621054143-ut9cw6zbkt3vwdzc
    committer: Martin Pool <mbp at sourcefrog.net>
    branch nick: doc
    timestamp: Thu 2007-06-21 17:44:24 +1000
    message:
      More analysis of commit
    ------------------------------------------------------------
    revno: 2513.1.4
    merged: mbp at sourcefrog.net-20070621054143-ut9cw6zbkt3vwdzc
    parent: mbp at sourcefrog.net-20070621042747-e3g0tdn8if750mv5
    committer: Martin Pool <mbp at sourcefrog.net>
    branch nick: doc
    timestamp: Thu 2007-06-21 15:41:43 +1000
    message:
      More commit documentation
    ------------------------------------------------------------
    revno: 2513.1.3
    merged: mbp at sourcefrog.net-20070621042747-e3g0tdn8if750mv5
    parent: mbp at sourcefrog.net-20070618065424-awsn4t4tv2bi4okt
    committer: Martin Pool <mbp at sourcefrog.net>
    branch nick: doc
    timestamp: Thu 2007-06-21 14:27:47 +1000
    message:
      More commit specs
    ------------------------------------------------------------
    revno: 2513.1.2
    merged: mbp at sourcefrog.net-20070618065424-awsn4t4tv2bi4okt
    parent: mbp at sourcefrog.net-20070618020420-3e205xzlugq682j9
    committer: Martin Pool <mbp at sourcefrog.net>
    branch nick: doc
    timestamp: Mon 2007-06-18 16:54:24 +1000
    message:
      Remove duplicated commit use case documentation
    ------------------------------------------------------------
    revno: 2513.1.1
    merged: mbp at sourcefrog.net-20070618020420-3e205xzlugq682j9
    parent: pqm at pqm.ubuntu.com-20070606144200-rmsd3gyelimh8kal
    committer: Martin Pool <mbp at sourcefrog.net>
    branch nick: doc
    timestamp: Mon 2007-06-18 12:04:20 +1000
    message:
      (in progress) analysis of commit
=== added file 'doc/developers/dirstate.txt'

--- a/doc/developers/dirstate.txt	1970-01-01 00:00:00 +0000
+++ b/doc/developers/dirstate.txt	2007-06-18 02:04:20 +0000
@@ -0,0 +1,3 @@
+Don't really need the hashes of the current versions - just knowing
+whether they've changed or not will generally be enough - and just the
+mtime and ctime of a point in time may be enough?

=== added file 'doc/developers/scratch.txt'
--- a/doc/developers/scratch.txt	1970-01-01 00:00:00 +0000
+++ b/doc/developers/scratch.txt	2007-06-18 02:04:20 +0000
@@ -0,0 +1,30 @@
+"use-case apis" - means not just the api called by the Command layer, but
+also the stack going down.
+
+top level - an atomic api - whole thing should either commit or not
+
+needs to come back and interact with the user to get the commit message, if
+not previously specified
+ 
+hooks that will modify the wt before commit takes place;
+line endings - actually not an issue for commit because we propose to always commit what's in the tree and just transform them when reading back out
+ 
+problem with commit builder api - you need to build a whole tree in it.
+
+takes contents of a workingtree, logically puts it into a branch and the
+branch's repository
+point is that it should only store the new things in the tree which are 
+not already in the tree
+* possibly updates the workingtree after the commit to tell it about the new
+  basis revision
+
+if you commit only a single file, you should only need to tell it about that 
+single file.
+
+if implemented with dirstate present, will still make good use of dirstate.
+
+requirements wrt branches - bound branches do complicate it, so does bzr-svn. 
+
+commit is an operation between the tree and the branch.
+
+api should only be told about content changes by the tree.

=== added file 'doc/developers/uncommit.txt'
--- a/doc/developers/uncommit.txt	1970-01-01 00:00:00 +0000
+++ b/doc/developers/uncommit.txt	2007-06-21 04:27:47 +0000
@@ -0,0 +1,21 @@
+Uncommit Performance Notes
+==========================
+
+Specification of uncommit
+-------------------------
+
+``uncommit`` removes revisions from the head of a branch.  (By default, only
+the very latest revision is removed, but optionally more can be taken.)
+Uncommit does not affect the repository (garbage collection is a separate
+step and not done by default).  The working tree is not logically
+modified (revert is a different operation), except as described below
+about merges.
+
+Uncommit can be performed on either a branch or a working tree (and
+implicitly its branch.)
+
+If the uncommitted revisions includes one or more merges, after the
+uncommit those revisions are in the working tree's list of pending merges,
+because their tree changes are still present in the tree.
+
+For a bound branch, uncommit fails unless the local branch is up to date.

=== renamed file 'doc/developers/performance-commit.txt' => 'doc/developers/commit.txt'
--- a/doc/developers/performance-commit.txt	2007-06-06 06:17:45 +0000
+++ b/doc/developers/commit.txt	2007-06-21 07:44:24 +0000
@@ -1,16 +1,23 @@
 Commit Performance Notes
 ========================
 
-.. contents::
+Changes to commit
+-----------------
+
+We want to improve the commit code in two phases.
+
+Phase one is to have a better separation from the format-specific logic,
+the user interface, and the general process of committing.
+
+Phase two is to have better interfaces by which a good workingtree format
+can efficiently pass data to a good storage format.  If we get phase one
+right, it will be relatively easy and non-disruptive to bring this in.
+
+
 
 Commit: The Minimum Work Required
 ---------------------------------
 
-This is Martin Pool's email to the mailing list on the minimum work that
-commit needs to do. Be sure to check the mailing list archives 
-(https://lists.ubuntu.com/archives/bazaar/2007q2/025791.html)
-for follow-up comments from Robert and others ...
-
 Here is a description of the minimum work that commit must do.  We
 want to make sure that our design doesn't cost too much more than this
 minimum.  I am trying to do this without making too many assumptions
@@ -57,7 +64,7 @@
 the current scheme we get the file id, last-revision from the
 dirstate, look into the knit for that text, extract that text in
 total, generate a delta, then store that into the knit.  Most delta
-operations are O(n2) to O(n3) in the size of the modified files.
+operations are O(n**2) to O(n**3) in the size of the modified files.
 
 1.4 - Cache annotation information for the changes: at the moment this
 is done as part of the delta storage.  There are some flaws in that
@@ -260,3 +267,368 @@
 sooner you can stop evaluating the 99% that you don't care about, the
 less work you do.
 
+
+Avoiding work: avoiding reading parent data
+-------------------------------------------
+
+We would like to avoid the work of reading any data about the parent
+revisions.  We should at least try to avoid reading anything from the
+repository; we can also consider whether it is possible or useful to hold
+less parent information in the working tree.
+
+When a commit of selected files is requested, the committed snapshot is a
+composite of some directories from the parent revision and some from the
+working tree.  In this case it is logically necessary to have the parent
+inventory information.
+
+If file last-change information or per-file graph information is stored
+then it must be available from the parent trees.
+
+If the Branch's storage method does delta compression at commit time it
+may need to retrieve file or inventory texts from the repository.
+
+It is desirable to avoid roundtrips to the Repository during commit,
+particularly because it may be remote.  If the WorkingTree can determine
+by itself that a text was in the parent and therefore should be in the
+Repository that avoids one roundtrip per file.  
+
+There is a possibility here that the parent revision is not stored, or not
+correctly stored, in the repository the tree is being committed into, and
+so the committed tree would not be reconstructable.  We could check that
+the parent revision is present in the inventory and rely on the invariant
+that if a revision is present, everything to reconstruct it will be
+present too.
+
+
+Code structure
+--------------
+
+Caller starts a commit
+
+>>> Branch.commit(from_tree, options)
+
+This creates a CommitBuilder object matched to the Branch, Repository and
+Tree.  It can vary depending on model differences or by knowledge of what
+is efficient with the Repository and Tree.  Model differences might
+include whether no-text-change merges need to be reported, and whether the 
+
+The basic CommitBuilder.commit structure can be 
+
+1. Ask the branch if it is ready to commit (up to date with master if
+   any.)
+
+2. Ask the tree if it is ready to commit to the branch (up to date with
+   branch?), no conflicts, etc
+
+3. Commit changed files; prototype implementation:
+
+   a. Ask the working tree for all committable files; for each it should
+      return the per-file parents, stat information, kind, etc.
+
+   b. Ask the repository to store the new file text; the repository should
+      return the stored sha1 and new revision id.
+
+4. Commit changed inventory
+
+5. Commit revision object
+
+
+
+
+
+
+
+
+
+Complications of commit
+-----------------------
+
+Bazaar (as of 0.17) does not support selective-file commit of a merge;
+this could be done if we decide how it should be recorded - is this to be
+stored as an overall merge revision; as a preliminary non-merge revisions;
+or will the per-file graph diverge from the revision graph.
+
+There are several checks that may cause the commit to be refused, which
+may be activated or deactivated by options.
+
+* presence of conflicts in the tree
+
+* presence of unknown files
+
+* the working tree basis is up to date with the branch tip
+
+* the local branch is up to date with the master branch, if there 
+  is one and --local is not specified
+
+* an empty commit message is given, 
+
+* a hook flags an error
+
+* a "pointless" commit, with no inventory changes
+
+Most of these require walking the tree and can be easily done while
+recording the tree shape.  This does require that it be possible to abort
+the commit after the tree changes have been recorded.  It could be ok to
+either leave the unreachable partly-committed records in the repository,
+or to roll back.
+
+Other complications:
+
+* when automatically adding new files or deleting missing files during
+  commit, they must be noted during commit and written into the working
+  tree at some point
+
+* refuse "pointless" commits with no file changes - should be easy by
+  just refusing to do the final step of storing a new overall inventory
+  and revision object
+
+* heuristic detection of renames between add and delete (out of scope for
+  this change)
+
+* pushing changes to a master branch if any
+
+* running hooks, pre and post commit
+
+* prompting for a commit message if necessary, including a list of the
+  changes that have already been observed
+
+* if there are tree references and recursing into them is enabled, then
+  do so
+
+Commit needs to protect against duplicated file ids 
+
+
+Updates that need to be made in the working tree, either on conclusion
+of commit or during the scan, include
+
+* Changes made to the tree shape, including automatic adds, renames or
+  deletes
+
+* For trees (eg dirstate) that cache parent inventories, the old parent
+  information must be removed and the new one inserted
+
+* The tree hashcache information should be updated to reflect the stat
+  value at which the file was the same as the committed version, and the
+  content hash it was observed to have.  This needs to be done carefully to
+  prevent inconsistencies if the file is modified during or shortly after
+  the commit.  Perhaps it would work to read the mtime of the file before we
+  read its text to commit.
+
+
+Interface stack
+---------------
+
+The commit api is invoked by the command interface, and copies information
+from the tree into the branch and its repository, possibly updating the
+WorkingTree afterwards.
+
+The command interface passes:
+
+* a commit message (from an option, if any),
+* or an indication that it should be read interactively from the ui object;
+* a list of files to commit
+* an option for a dry-run commit
+* verbose option, or callback to indicate 
+* timestamp, timezone, committer, chosen revision id
+* config (for what?)
+* option for local-only commit on a bound branch
+* option for strict commits (fail if there are unknown or missing files)
+* option to allow "pointless" commits (with no tree changes)
+ 
+(This is rather a lot of options to pass individually and just for code tidyness maybe some of them should be combine into objects.)
+
+>>> Branch.commit(from_tree, message, files_to_commit, ...)
+
+There will be different implementations of this for different Branch
+classes, whether for foreign branches or Bazaar repositories using
+different storage methods.
+
+Most of the commit should occur during a single lockstep iteration across
+the workingtree and parent trees.  The WorkingTree interface needs to
+provide methods that give commit all it needs.  Some of these methods
+(such as answering the file's last change revision) may be deprecated in
+newer working trees and there we have a choice of either calculating the
+value from the data that is present, or refusing to support commit to
+newer repositories.
+
+For a dirstate tree the iteration of changes from the parent can easily be
+done within its own iter_changes.
+
+Dirstate inventories may be most easily updated in a single operation at
+the end; however it may be best to accumulate data as we proceed through
+the tree rather than revisiting it at the end.
+
+Showing a progress bar for commit may not be necessary if we report files
+as they are committed.  Alternatively we could transiently show a progress
+bar for each directory that's scanned, even if no changes are observed.
+
+This needs to collect a list of added/changed/removed files, each of which
+must have its text stored (if any) and containing directory updated.  This
+can be done by calling Tree._iter_changes on the source tree, asking for
+changes 
+
+In the 0.17 model the commit operation needs to know the per-file parents
+and per-file last-changed revision.  
+
+(In this and other operations we must avoid having multiple layers walk
+over the tree separately.  For example, it is no good to have the Command
+layer walk the tree to generate a list of all file ids to commit, because
+the tree will also be walked later.  The layers that do need to operate
+per-file should probably be bound together in a per-dirblock iterator,
+rather than each iterating independently.)
+
+Branch->Tree interface
+----------------------
+
+The Branch commit code needs to ask the Tree what should be committed, in
+terms of changes from the parent revisions.  If the Tree holds all the
+necessary parent tree information itself it can do it single handed;
+otherwise it may need to ask the Repository for parent information.
+
+This should be a streaming interface, probably like iter_changes returning
+information per directory block.
+
+The interface should not return a block for directories that are
+recursively unchanged.
+
+The tree's idea of what is possibly changed may be more conservative than
+that of the branch.  For example the tree may report on merges of files
+where the text is identical to the parents: this must be recorded for
+Bazaar branches that record per-file ancestry but is not necessary for all
+branches.  If the tree is responsible for determining when directories
+have been recursively modified then it will report on all the parents of
+such files.  There are several implementation options:
+
+1. Return all files and directories the branch might want to commit, even
+if the branch ends up taking no action on them.
+
+2. When starting the iteration, the branch can specify what type of change
+is considered interesting.
+
+Since these types of changes are probably (??) rare compared to files that
+are either completely unmodified or substantially modified, the first may
+be the best and simplest option.
+
+The branch needs to build an inventory to commit, which must include
+unchanged files within changed directories.  This should be returned from
+the working tree too.  Repositories that store per-directory inventories
+will want to build and store these from the lowest directories up.
+For 0.17 format repositories with an all-in-one inventory it may be
+easiest to accumulate inventory entries in arbitrary order into an
+in-memory Inventory and then serialize it.
+
+It ought to be possible to commit any Tree into a Branch, without
+requiring a WorkingTree; the commit code should cope if the tree is not
+interested in updating hashcache information or does not have a
+``last_revision``.
+
+
+Information from the tree to repository
+---------------------------------------
+
+The main things the tree needs to tell the Branch about are:
+
+* A file is modified from its parent revision (in text, permissions,
+  other), and so its text may need to be stored.  
+  
+  Files should also be reported if they have more than one unique parent
+  revision, for repositories that store per-file graphs or last-change
+  revisions.  Perhaps this behaviour should be optional.
+
+  **XXX:** are renames/deletions reported here too?
+
+* The complete contents of a modified directory, so that its inventory
+  text may be stored.  This should be done after all the contained files
+  and directories have been reported.  If there are unmodified files, 
+  or unselected files carried through from 
+
+  XXX: Actually perhaps not grouped by directory, but rather grouped 
+  appropriately for the shape of inventory storage in the repository. 
+
+  In a zoomed-in checkout the workingtree may not have all the shape data
+  for the entire tree.
+
+* A file is missing -- could cause either automatic removal or an aborted
+  commit.
+
+* Any unknown files -- can cause automatic addition, abortion of a strict
+  commit, or just reporting.
+
+
+Information from the repository to the tree
+-------------------------------------------
+
+After the commit the tree needs to be updated to the new revision.  Some
+information which was accumulated during the commit must be made available
+to the workingtree.  It's probably reasonable to hold it all in memory and
+allow the workingtree to get it in whatever order it wants.
+
+* A list of modified entries, and for each one:
+
+  * The stat values observed when the file was first read.
+
+  * The hash of the committed file text.
+
+  * The file's last-change revision, if appropriate.
+
+  This should include any entries automatically added or removed.
+
+This might be construed as an enhanced version of ``set_parent_trees``.
+We can avoid a stat on each file by using the value that was observed when
+it was first read.
+ 
+
+
+Selective commit
+----------------
+
+For a partial commit the directory contents may need to contain a mix of
+entries from the working tree and parent trees.  This code probably
+shouldn't live in a specific tree implementation; maybe there should be a
+general filter that selects paths from one tree into another?
+
+However, the tree walking code does probably need to know about selected
+paths to avoid examining unselected files or directories.
+
+We never refuse selective file commits (except of merges).  
+
+
+
+Common commit code
+------------------
+
+What is common to all commit implementations, regardless of workingtree or
+repository format?
+
+* Prompting for a commit message?
+* Strictness/conflict checks?
+* Auto add/remove?
+
+How should this be separated?
+
+
+
+Order of traversal
+------------------
+
+For current and contemplated Bazaar storage formats, we can only finally
+commit a directory after its contained files and directories have been
+committed.
+
+The dirstate workingtree format naturally iterates by directory in order
+by path, yielding directories before their contents.  This may also be the
+most efficient order in which to stat and read the files.
+
+One option would be to construe the interface as a visitor which reports
+when files are detected to be changed, and also when directories are
+finished.
+
+
+Open question: per-file graphs
+------------------------------
+
+**XXX:** If we want to retain explicitly stored per-file graphs, it would
+seem that we do need to record per-file parents.  We have not yet finally
+settled that we do want to remove them or treat them as a cache.  This api
+stack is still ok whether we do or not, but the internals of it may
+change.

=== modified file 'doc/developers/performance-roadmap.txt'
--- a/doc/developers/performance-roadmap.txt	2007-06-19 00:48:22 +0000
+++ b/doc/developers/performance-roadmap.txt	2007-06-26 04:55:33 +0000
@@ -43,3 +43,7 @@
 .. include:: merge-scaling.txt
 
 .. include:: bundle-creation.txt
+
+.. include:: commit.txt
+
+.. include:: uncommit.txt

=== modified file 'doc/developers/performance-use-case-analysis.txt'
--- a/doc/developers/performance-use-case-analysis.txt	2007-06-05 08:02:04 +0000
+++ b/doc/developers/performance-use-case-analysis.txt	2007-06-18 02:04:20 +0000
@@ -1,3 +1,7 @@
+.. This document describes _how_ to do use case analyses and what we want
+.. to get out of them; for the specific cases see the files referenced by
+.. performance-roadmap.txt 
+
 Analysing a specific use case
 =============================
 
@@ -49,7 +53,7 @@
    - The time to get bzr ready to begin the use case.
 
 - scaling: how does performance change when any of the follow aspects
-  of the system are ratched massively up or down:
+  of the system are ratcheted massively up or down:
 
    - number of files/dirs/symlinks/subtrees in a tree (both working and 
      revision trees)
@@ -100,7 +104,7 @@
 Many performance problems only become visible when changing the scaling knobs
 upwards to large trees. On small trees its our baseline performance that drives
 incremental improvements; on large trees its the amount of processing per item
-that drives performance. A significant goal therefore is to keep the amouont of
+that drives performance. A significant goal therefore is to keep the amount of
 data to be processed under control. Ideally we can scale in a sublinear fashion
 for all operations, but we MUST NOT scale even linearly for operations that
 invoke a latency multiplier. For example, reading a file on disk requires