performance analysis of commit

Robert Collins robertc at robertcollins.net
Thu May 10 07:56:30 BST 2007


On Thu, 2007-05-10 at 16:38 +1000, Martin Pool wrote:
> Here is a description of the minimum work that commit must do.  We
> want to make sure that our design doesn't cost too much more than this
> minimum.  I am trying to do this without making too many assumptions
> about the underlying storage, but am assuming that the ui and basic
> architecture (wt, branch, repo) stays about the same.

Excellent. This is precisely what I was hoping we would do during the
start of the roadmap.

> [After discussion I will put this into doc/developer.  Please do let
> me know if you think this is useful, or if you like it pick a case and
> do one yourself.]
> 
> The basic purpose of commit is to
> 1 - create and store a new revision based on the contents of the working tree
> 2 - make this the new basis revision for the working tree
> 
> We can do a selected commit of only some files or subtrees.
> 
> The best performance we could hope for is:
>  - stat each versioned selected working file once
>  - read from the workingtree and write into the repository any new file texts
>  - in general, do work proportional to the size of the shape (eg
> inventory) of the old and new selected trees, and to the total size of
> the modified files

In terms of total work this is true. I think that phrasing this as best
performance is a little unclear - in terms of performance we can change
the time radically by interactions between stat order, file content
reading and repository writing. So I'd say 'the minimal work we could
hope to do is' - and add a note about updating the working tree
description.

> In more detail:
> 
> 1.0 - Store new file texts: if a versioned file contains a new text
> there is no avoiding storing it.  To determine which ones have changed
> we must go over the workingtree and at least stat each file.  If the
> file is modified since it was last hashed, it must be read in.
> Ideally we would read it only once, and either notice that it has not
> changed, or store it at that point.
> 
> On the other hand we want new code to be able to handle files that are
> larger than will fit in memory.  We may then need to read each file up
> to two times: once to determine if there is a new text and calculate
> its hash, and again to store it.

Perhaps we for files that miss the stat check we could both store and
hash it at the same time, as the common case for a file that is not in
the stat cache is one that has been modified and will need storage. In
the uncommon case we would have a stored text that we can discard/let
get gc'd/some other thing.

> In fact there are some user interface niceties that complicate this:
> 
> 3 - Before starting the commit proper, we prompt for a commit message
> and in that commit message editor we show a list of the files that
> will be committed: basically the output of bzr status.  This is
> basically the same as the list of changes we detect while storing the
> commit, but because the user will sometimes change the tree after
> opening the commit editor and expect the final state to be committed I
> think we do have to look for changes twice.  Since it takes the user a
> while to enter a message this is not a big problem as long as both the
> status summary and the commit are individually fast.

I thought we deferred this now until the end of the commit process?
(Look for message_callback in commit.py). We have not followed through
on this to actually reduce work, but in principle the information needed
for the commit prompt is gatherable during the commit logic that creates
the new inventory. There is some interactions here with respect to
unknowns and ignored files. I think we should aim for no-rework though.

> 5 - Bazaar currently allows for missing files to be automatically
> marked as removed at the time of commit.  Leaving aside the ui
> consequences, this means that we have to update the working inventory
> to mark these files as removed.  Since as discussed above we always
> have to rewrite the dirstate on commit this is not substantial, though
> we should make sure we do this in one pass, not two.  I have
> previously proposed to make this behaviour a non-default option.

Agreed. I think the hg way of having -A both auto add and auto remove
would be nice to copy.

> We may need to run hooks or generate signatures during commit, but
> they don't seem to have substantial performance consequences.

Generating a signature requires a manifest of the tree; thats O(tree
size) always (as opposed to O(size of named paths)). 

> If one wanted to optimize solely for the speed of commit I think
> hash-addressed  file-per-text storage like in git (or bzr 0.1) is very
> good.  

Agreed.

> Remarkably, it does not need to read the inventory for the
> previous revision. For each versioned file, we just need to get its
> hash, either by reading the file or validating its stat data.

Well we do have per-file merge graphs that we maintain. While its true
that this does not imply reading the inventories for the previous
revisions, it does imply some historical-referencing of data.

> Variations on this are possible.  Rather than writing a single file
> into the repository for each text, we could fold them into a single
> collation or pack file.  That would create a smaller number of files
> in the repository, but looking up a single text would require looking
> into their indexes rather than just asking the filesystem.
> 
> Rather than using hashes we can use file-id/rev-id pairs as at
> present, which has several consequences pro and con.

Or a combination: index by hash, name by file-id:rev-id. I think we
should examine this closely in London.

This analysis looks very nice and extremely useful to have. It would be
good to drill into the IO and API implications a bit more.

Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20070510/ae9bca6b/attachment.pgp 


More information about the bazaar mailing list