Rev 2486: Incremental push-pull performance anlysis draft. in file:///home/robertc/source/baz/roadmap/
robertc at robertcollins.net
Wed May 9 16:36:14 BST 2007
revision-id: robertc at robertcollins.net-20070509153606-m556z2rb3d76b2f3
parent: robertc at robertcollins.net-20070508041401-rreosr07ptz41d6c
committer: Robert Collins <robertc at robertcollins.net>
branch nick: roadmap
timestamp: Thu 2007-05-10 01:36:06 +1000
Incremental push-pull performance anlysis draft.
=== added file 'doc/developers/incremental-push-pull.txt'
--- a/doc/developers/incremental-push-pull.txt 1970-01-01 00:00:00 +0000
+++ b/doc/developers/incremental-push-pull.txt 2007-05-09 15:36:06 +0000
@@ -0,0 +1,242 @@
+This use case covers pulling in or pushing out some number of revisions which
+is typically a small fraction of the number already present in the target
+repository. Pushing and pulling are defined as branch level operations for ease
+of interaction with VCS systems that have no repository abstraction (such as
+bzr-svn or GNU Arch) but within bzrlib's core they are currently the
+responsibility of the Repository object.
+A push or pull operation must:
+ * Copy all the data to reconstruct the selected revisions in the target
+ branch. This is the goal of push and pull after all.
+ * Reject corrupt data. As bzr has no innate mechanism for discarding corrupted
+ data, corrupted data should not be incorporated accidentally.
+Factors which should add work for push/pull
+ * Baseline overhead: The time to connect to both branches.
+ * Actual new data in the revisions being pulled (drives the amount of data to
+ move around, includes the commit messages etc)
+ * Number of revisions in the two repositories (scaling affects the
+ determination of what revisions to move around).
+1. New data is identified in the source repository.
+2. That data is read from the source repository.
+3. The same data is verified and written to the target repository in such a
+ manner that its not visible to readers until its ready for use.
+New data identification
+We have a single top level data object: revisions. Everything else is
+subordinate to revisions, so determining the revisions to propogate should be
+all thats needed. This depends on revisions with partial data - such as those
+with no signature - being flagged in some efficient manner.
+We could do this in two manners: determine revisions to sync and signatures to sync in two passes, or change the 'value' of a revision implicitly when the signature is different. E.g. by using merkle hash trees with the signature data a separate component the signatures will naturally be identified to sync.
+We want to only exchange data proportional to the number of new revisions and
+signatures in the system though. One way to achieve this for revisions is to
+walk the graph out from the desired tips until the surface area intersection is
+found. For signatures a set difference seems to be needed as there is no DAG of signatures: the presence of one has no implications on the presence of another, so a full pass over the set of signatures would be required to confirm no new signatures are needed (let alone replaced signatures).
+IFF we can determine 'new revisions' and 'new signatures' without full graph access then we can scale acceptable for push and pull.
+Ghosts are revisions which are not present in a particular repository. Filling ghosts refers to removing ghosts in the target repository when the ghost is present in the source repository. Filling ghosts can be either an explicit or implicit action. The common case is no ghosts.
+Set synchronisation approaches
+A set synchronisation approach is one which synchronises two sets without
+regard for innate structure. This can be very efficient but requires adding a
+new node to be processed with every commit. Caching of the results of the
+various set based syncs I've seen is possible but because the data structures
+look different depending on the tip revision being synced up to the cache needs
+to be very complex. I recommend not using such an approach for the common case
+pull because of the failure to scale. We can use such an approach for
+synchronisation of new signatures and ghosts, which should be an explicit
+option in both cases.
+DAG synchronisation approaches
+A DAG based approach to synchronistion is one that uses the DAG structure to
+determine the difference in present nodes. It can as a result operate from the
+tip of the DAG backwards. A dag based approach should allow incremental access
+to data and not require a full-graph scan for incremental operations.
+File level scaling
+We should read roughly as much of the revision level graph as is needed from
+each repository to determine the node difference. If requested we should
+perform a detailed scan to pick up ghost revisions and revisions which have had
+signatures added. This should not be the default as it requires full history
+access in both cases.
+Expected file IO and access pattern:
+ * Common case: repo with many branches of one project, to the same.
+ 1. Source and Target branch tips read.
+ 2. Find the tip of each branch in their repo (will require reading some of
+ the revision graph but is typically near the end of the graph).
+ 3. Read and parse increasing amounts of the revision graph until one is
+ found to be a subset of the other, or a complete list of revisions to be
+ transmitted is created.
+ * Uncommon cases:
+ 1. Repositories with many projects or branches which are very old may
+ require reading a lot of unrelated graph data.
+ 1. Initial push/pull scenarios should not require reading an entire graph.
+ 1. Get branch tips.
+ 2. Determine one sided graph difference. To avoid obtaining a full graph over
+ the wire this needs to be done without reference to the full graph, and
+ with some logarthmic scaling algorithm. There are several already available
+ for this.
+With ghost and new-signature detection:
+ * File IO access pattern will read the entire graph on the 'target' side - if
+ no ghosts are present then stop, otherwise seek the new revisions on the
+ source side with the regular algorithm and also explicitly search for the
+ ghost points from the target; plus a set difference search is needed on
+ * Semantic level can probably be tuned, but as its also complex I suggest
+ deferring analysis for optimal behaviour of this use case.
+When transferring information about a revision the graph of data for the
+revision is walked: revision -> inventory, revision -> matching signature,
+inventory -> file ids:revision pairs.
+File level scaling
+As we're reading already committed data, as long as nothing is mutating data on
+disk reading should be race free. We will:
+ - read each revision object
+ - read the matching inventory delta
+ - attempt to read a signature object
+ - parse the inventory delta
+ - read the fileid:revisionid compressed chunk for each line in the inventory
+Theres no point validating that the data read is valid, as transmission through to the client writing the data might invalidate it; we need to validate before we write.
+Given that we have established the revisions needed, a single API call should
+suffice to obtain all data; the API should present the data in such an order
+that it can be validated as it arrives and thus not require large scale
+buffering on disk. Specifically each item of data should be validatable (e.g.
+for some file data we want the fileid:revisionid:validationhash + content).
+Data Verification and writing
+New data written to a repository should be completed intact when it is made
+visible. This suggests that either all the data for a revision must be made
+atomically visible (e.g. by renaming a single file) or the leaf nodes of the
+reference graph must become visible first.
+Data is referred to via the following graph:
+revision -> revision
+revision -> signature
+revision -> inventory
+inventory -> fileid:revisionid
+fileid:revisionid -> fileid:revisionid
+Data is verifiable via a different ordering:
+signature -> revision -> inventory -> fileid:revisionid texts.
+We dont gpg verify each revision today; this analysis only speaks to hash
+verification of contents.
+To validate a revision we need to validate the data it refers to. But to
+validate the contents of a revision we need the new texts in the inventory for
+the revision - to check a fileid:revisionid we need to know the expected sha1
+of the full text and thus also need to read the delta chain to construct the
+text as we accept it to determine if its valid. Providing separate validators
+for the chosen representation would address this.
+e.g: For an inventory entry FILEID:REVISIONID we store the validator of the
+full text :SHA1:. If we also stored the validator of the chosen disk
+representation (:DELTASHA1:) we could validate the transmitted representation
+without expanding the delta in the common case. If that failed we could expand
+the delta chain and try against the full text validator, and finally fail. As
+different delta generators might generate different deltas, :DELTASHA1: should
+not become part of the revision validator, only the inventory disk encoding. In
+a related manner a transmission format that allowed cheap validation of content
+without applying locally stored deltas would be advantageous because no local
+reads would be incurred to validate new content. For instance, always sending a
+full text for any file, possibly with a delta-chain when transmitting multiple
+revisionids of the file, would allow this. (git pack-files have this property).
+A single-file local format would allow safe atomic addition of data while
+allowing optimisal transmission order of data. Failing this the validation of
+data should be tuned to not require reading local texts during data addition
+even in the presence of delta chains. We should have transmission-validators
+separate from content validators that allow validation of the delta-transmitted
+form of objects.
+File level scaling
+* Every new file text requires transmission and local serialisation.
+* Every commit requires transmission and storage of a revision, signature and inventory.
+Thus 4000 commits to a 50000 path tree of 10 files on averages requires (with
+knits) between 26 writes (2*(3+10)) and 80006 (2*(4000*10 + 3)) writes. In all
+cases there are 4000 * 13 distinct objects to record.
+Grouping data by fileid, content and metadata, gives the figures above.
+* File per full identifier (fileid:revisionid:meta|content): 104000
+* Delta-chain per object: object id count * constant overhead per object id
+ (26 -> 80006)
+* Collation/pack file: 1
+Performance for these depends heavily on implementation:
+ - Using full ids we could name by validator or by id, giving best performance
+ that depends on either receiving data in validator order or in id order.
+ - using delta-chain per object we get least seek overhead and syscall overhead
+ if we recieve in topological order within the object id, and object ids in
+ lexical order.
+ - Using a collation/pack file we can stream it into place and validate as we go,
+ giving near ideal performance.
+The api for writing new data recieved over the network will need to be geared
+to the transmission and local storage method. What we need is for the
+transmission method to reasonably closely match the desired write ordering
+locally. This suggests that once we decide on the best local storage means we
+should design the api.
=== added file 'doc/developers/performance-use-case-analysis.txt'
--- a/doc/developers/performance-use-case-analysis.txt 1970-01-01 00:00:00 +0000
+++ b/doc/developers/performance-use-case-analysis.txt 2007-05-09 15:36:06 +0000
@@ -0,0 +1,108 @@
+Analysing a specific use case
+The analysis of a use case needs to provide as outputs:
+ * The functional requirements that the use case has to satisfy.
+ * The file level operations and access patterns that will give the best
+ * A low friction API which will allow the use case to be implemented.
+ * The release of bzr (and thus the supported features) for which the analysis
+ was performed. The feature set of bzr defines the access patterns and data
+ required to implement any use case. So when we add features, their design
+ changes the requirements for the parts of the system they alter, so we need
+ to re-analyse use cases when bzr's feature set changes. If future plans are
+ considered in the analysis with the intention of avoiding rework, these
+ should also be mentioned.
+Performing the analysis
+The analysis needs to be able to define the characteristics of the
+involved disk storage and APIs. That means we need to examine the data
+required for the operation, in what order it is required, on both the
+read and write sides, and how that needs to be presented to be
+consistent with our layering.
+As a quick example: 'annotation of a file requires the file id looked up
+from the tree, the basis revision id from the tree, and then the text of
+that fileid-revisionid pair along with the creating revision id
+allocated to each line, and the dotted revision number of each of those
+revision ids.' All three of our key domain objects are involved here,
+but we haven't defined any characteristics of the api or disk facilities
+yet. We could then do that by saying something like 'the file-id lookup
+should degrade gracefully as trees become huge. The tree basis id should
+be constant time. Retrieval of the annotated text should be roughly
+constant for any text of the same size regardless of the number of
+revisions contributing to its content. Mapping of the revision ids to
+dotted revnos could be done as the text is retrieved, but its completely
+fine to post-process the annotated text to obtain dotted-revnos.'
+What factors should be considered?
+Obviously, those that will make for an extremely fast system :). There
+are many possible factors, but the ones I think are most interesting to
+design with are:
+- baseline overhead:
+ - The time to get bzr ready to begin the use case.
+- scaling: how does performance change when any of the follow aspects
+ of the system are ratched massively up or down:
+ - number of files/dirs/symlinks/subtrees in a tree (both working and
+ revision trees)
+ - size of any particular file
+ - number of elements within a single directory
+ - length of symlinks
+ - number of changes to any file over time
+ (subordinately also the number of merges of the file)
+ - number of commits in the ancestry of a branch
+ (subordinately also the number of merges)
+ - number of revisions in a repository
+ - number of fileids in a repository
+ - number of ghosts in a given graph (revision or per-file)
+ - number of branches in a repository
+ - number of concurrent readers for a tree/branch/repository
+ - number of concurrent writers for objects that support that.
+ - latency to perform file operations (e.g. slow disks, network file systems,
+ our VFS layer and FTP/SFTP/etc)
+ - bandwidth to the disk storage
+ - latency to perform semantic operations (hpss specific)
+ - bandwidth when performing semantic operations.
+- locality of reference: If an operation requires data that is located
+ within a small region at any point, we often get better performance
+ than with an implementation of the same operation that requires the
+ same amount of data but with a lower locality of reference. Its
+ fairly tricky to add locality of reference after the fact, so I think
+ its worth considering up front.
+Using these factors, to the annotate example we can add that its
+reasonable to do two 'semantic' round trips to the local tree, one to
+the branch object, and two to the repository. In file-operation level
+measurements, in an ideal world there would be no more than one round
+trip for each semantic operation. What there must not be is one round
+trip per revision involved in the revisionid->dotted number mapping, nor
+per each revision id attributed to a line in the text.
+Not all the items mentioned above are created equal. The analysis should
+include the parameters considered and the common case values for each - the
+optimisation should be around the common cases not around the exceptions.
+For instance, we have a smart server now; file level operations are relatively
+low latency and we should use that as the common case. At this point we intend
+to preserve the performance of the dumb protocol networking, but focus on
+improving network performance via the smart server and thus escape the
+file-level operation latency considerations.
+Many performance problems only become visible when changing the scaling knobs
+upwards to large trees. On small trees its our baseline performance that drives
+incremental improvements; on large trees its the amount of processing per item
+that drives performance. A significant goal therefore is to keep the amouont of
+data to be processed under control. Ideally we can scale in a sublinear fashion
+for all operations, but we MUST NOT scale even linearly for operations that
+invoke a latency multiplier. For example, reading a file on disk requires
+finding the inode for the file, then the block with the data and returning the
+contents. Due to directory grouping logic we pay a massive price to read files
+if we do not group the reads of files within the same directory.
=== modified file '.bzrignore'
--- a/.bzrignore 2007-04-12 20:27:42 +0000
+++ b/.bzrignore 2007-05-09 15:36:06 +0000
@@ -34,3 +34,4 @@
=== modified file 'doc/developers/performance-roadmap.txt'
--- a/doc/developers/performance-roadmap.txt 2007-05-07 17:49:30 +0000
+++ b/doc/developers/performance-roadmap.txt 2007-05-09 15:36:06 +0000
@@ -7,4 +7,9 @@
.. include:: performance-roadmap-rationale.txt
+Analysis of use cases
+.. include:: performance-use-case-analysis.txt
+.. include:: incremental-push-pull.txt
More information about the bazaar-commits