dirstate2 comments

Mon May 31 07:09:16 BST 2010

I had a look at the dirstate2 branch too.  The full diff is attached
for ease of reading by others.  I want to start just looking at the
format documentation inline.

+class IndirectedDirState(AbstractDirState):
+    """DirState implemented using an indirection pointer and hash-named files.
+
+    Unlike DirState, IndirectedDirState is nearly atomic on the file system.
+    The base name of the dirstate specifies a directory containing a 'current'
+    file, and many content files named with their hash. Updates to
+    IndirectedDirState are very robust and do not lock out other readers.

Perhaps it's useful to document what our expectations of the
filesystem are; we want them to be lower than for current dirstate and
I think we want the following:

 * we never use OS locks: writers may block each other out but readers do not
 * however, we also cannot count on being able to
delete/rename/overwrite a file while it's being read (because this
doesn't work on windows)
 * if a file is deleted, renamed, or renamed-over while being read, we
can't make any assumptions about what will happen: the reader may keep
reading the old file (typical on unix) or the attempted update may
fail (typical on windows) or the reader may start reading from the new
file (possible on network filesystems)
 * readers shouldn't need to physically write (ie they shouldn't be
required to recover from an interrupted write to proceed)

This seems to imply that readers must be able to cope with the file
being yanked from under them, and writers must be able to cope with
attempts to rename/delete/overwrite any particular file being at least
momentarily denied.

and maybe performance constraints (they kind of follow from building
on dirstate)
 * looking up any single entry in the dirstate takes work proportional
ln(size_of_tree) + a constant overhead
 * reading the whole state is O(size of tree) + constant overhead

+
+    This is intended to be used with a WorkingTree object that has an explicit
+    LockDir lock to mutually exclude semantic writers. Thus there are two forms
+    of changes that can be made: changes from logical read-locks, which are
+    limited to stat cache updates, and changes from logical write-locks which
+    can add file ids, change parents etc. The design and semantics therefore
+    assume that semantic changes are externally limited to only one process
+    at a time, and that stat cache updates may happen anytime.

OK, and what's worth calling out here is that the changes made during
a logical read lock can proceed without holding the lock dir.  They
can silently cancel themselves if they find the dirstate was updated
since their lock scope began, and they can also be silently discarded
by any writer that overlaps with them.

+    Disk layout
+    -----------
+    basepath/
+      format = "bzr IndirectDirState 1"
+      current = "$MD5HASH\\n".
+      $MD5HASH.check = "$OTHERMD5HASH\\n". Used when checking if current should
+        be replaced.
+      $MD5HASH.delete = "$OTHERMD5HASH\\n". Used when a state file is
queued for
+        deletion.
+      $MD5HASH = a dirstate state file

whose hash is as in its name?

+      $MD5HASH.current = a candidate replacement for current, about to be
+        moved into place.

Does that mean $MD5HASH.current just contains the string $MD5HASH\n?

Perhaps it would be easier to read if you say there is:

 * one format marker: 'format'
 * any number of dirstate state files, named by their md5 hash
 * several pointer files, each of which contain an md5hash followed by
a newline:
     current => $m: the current state is in state file $m
     $a.check = $b: meaning what?
     $a.current => $b: ?
     $a.delete => $b: ?

So 'check' seems to mean: this is the current file, moved away so that
it can be updated.  The named state should exist.  The update can be
rolled back by moving this back to current.

Like John, I don't find these names very obvious.  Perhaps we could
have "last", "next", "delete".

+
+    Invariants
+    ----------
+
+    Only one of the following statements is true at any time:
+     - There is a file 'current'
+     - There is a file '*.check'
+     - There is a file pair 'MD5HASH.current + MD5HASH.delete'
+    This invariant gives us an unambiguous pointer at any point in time.

Yay, though this description is not quite unambiguous :-)  I think you mean:

 - if there's a current file, that is the current state
 - if there's exactly one file ending in .check, that points to the
current state?
 - if there's a file pair of that kind then what

+    File system arbitrary-ordering can in principle break this invariant, as
+    we don't have fbarriers. Excluding that, a crash after renaming a .check
+    to .delete leaves the dirstate with no current, forcing recovery, and
+    preventing the creation of more .delete files. A crash after renaming
+    MD5HASH.current to current means that we cannot have both MD5HASH.current
+    and MD5HASH.delete, keeping the pointer unambiguous.

There's no 'in principle" about it: machines or processes do stop
leaving files behind all the time.  In particular it seems very common
to have files either missing or zero-length.  So I'd like to see a
description of how we'd interpret any logically possible state, not
just any state that should occur, and if necessary how we clean up
from those that shouldn't occur.

This description doesn't really make sense at this point in the
document because you haven't described what the pointers mean yet.

+
+    Operations
+    ----------

For ease of reading I would put the read state first as it seems like
the shallow end.

+
+    The insert sequence is:
+     - statefilename = md5(content)

ok, so you have to either accumulate the whole thing in memory, or in
a temporary file, then hash it.

+     - write new statefilename
+     - write statefilename.current

containing the hash?

+     - rename current statefilename.check
+       If this step fails, then another process is attempting to write, or
+       the dirstate is in need of recovery. If a logical write is being
+       attempted, spinning is recommended; if a state cache update is in
+       progress, rollback is the most appropriate action.
+     - At this point we're in a mutex - no other processes can enter this
+       state without performing recovery. Checks such as 'was the current
+       pointer changed concurrently' can be performed here. This is needed
+       so that stat-cache only changes ('logical read-lock') can rollback
+       if the current pointer was changed (so that they do not overwrite
+       changes from a logical write-lock).
+     [rollback]
+      - rename statefilename.check current if current was renamed succesfully.
+        nb: other processes may be reading the .check file and thus this step
+        requires a small spinloop.
+      - remove statefilename.current
+      - remove statefilename
+     [complete]
+      - rename statefilename.check statefilename.delete
+        nb: other processes may be reading the .check file and thus this step
+        requires a small spinloop.
+      - rename statefilename.current current
+      - remove the old statefile named in statefilename.delete
+      - remove statefilename.delete

Why rename it rather than just deleting it in one go?

Experience suggests that the .delete file is not going to function
very well as a barrier marker if the machine stops abruptly.

Is the point of it perhaps to distinguish between a file that's not
referenced because it's just been created, and a file that's not
referenced because we wanted to delete it but failed?

+    Read sequence:
+     - get the current pointer: read current, or if it is absent *.check (
+       handling a listed-but-gone when reading .check file as an absent pointer
+       - that is loop).
+     - read the pointer file, or if it is absent loop on getting a pointer

I assume the first one means: if there is no current file, look at any
.check files that are present (in arbitrary order?) until you find one
whose file exists?

It seems we should handle the empirically common case of a file being
present but empty.

+
+    Closing a dirstate - lazy deletion of in-use MD5HASH files.
+     - consult the current pointer, if it does not refer to the last file read
+       then attempt to delete that file, as it is no longer in use.

This is unclear but I think you mean: if I'm exiting a write(read?)
lock that started by reading state $a, and the $current at time of
exiting != $a, I should try to delete $a.

+
+    Recovery:
+     - if there is a current, delete everything but the format and the file
+       current points at.
+     - elif there is a MD5HASH .check and no current, mv the .check to current
+       and then remove the rest.
+     - elif there is a MD5HASH .delete with a MD5HASH.current, mv the
+       MD5HASH.current to current and remove the rest. If there is more than
+       one, bail and seek human help. (See invariants)

When does the recovery code run?

What is the human supposed to do?  istm it would be useful to at least
document what they should do or think about.  Then it may turn out
that some of it can be automated, or it may get to a point where we
should just rebuild the dirstate from the last commit - that would
lose uncommitted metadata, but at this point we would conclude the fs
crash had actually irretrievably lost it.

It seems like somewhere in recovery we might want to check that the
files actually have the purported hash before we use them or delete
anything else?

+
+    Commentary
+    ----------
+
+    The 'current' file contains a hash which is the current actual dirstate
+    to read. 'current' may contain the special value INITIALIZING\\n during
+    creation. Updates have two forms - stat cache updates (when a logical
+    read lock is in place), and semantic updates - when a write lock is in
+    place. The decision about whether to update is taken while current is
+    temporarily renamed to the NEWMD5HASH.check. If it will be replaced, it
+    is replaced and then renamed to .delete to indicate that it should be
+    deleted by later readers doing recovery.

Perhaps it's worth describing how the write-during-logical-read works.

+    After successfully updating 'current', the old state contained in the
+    content file named in '$hash.delete' should be removed. If the removal
+    errors due to other processing holding the file open, the errors can be
+    ignored.
+    Finally, when closing a dirstate state file, 'current' should be checked,
+    and if it is different to the dirstate file that was read initially, an
+    attempt to remove that dirstate file should be made - last one out close
+    the doors.
+
+    When reading an IndirectedDirState, check for a format file to determine
+    if the dirstate is a known format. If there is no current file, it is
+    either in-progress or an interrupted update has occured. Readers should
+    look for *.check to get the last-written dirstate, and read that
+    instead.  Concurrency concerns here mean that by the time .check is
+    processed current may have been updated and the old state file removed, so
+    a short loop is recommended.
+
+    Content files can possibly flip-flop by being restored to an identical
+    state, we may add a random component to guard against that as it can make
+    deleting unsafe.

Maybe.  It'd be nice to avoid it.

+    :ivar _last_pointer: The MD5HASH of the last opened state file.
+    """

On 20 May 2010 17:20, Robert Collins <robertc at robertcollins.net> wrote:
> On Thu, May 20, 2010 at 2:23 PM, John Arbash Meinel
> <john at arbash-meinel.com> wrote:
>
> Thanks for the eyeballs John.
>
>> Robert recently wanted to look at his dirstate2 branch, and wanted to
>> get some feedback on it. I've looked at it, but I won't say I've been
>> fully thorough.
>>
>> 1) Currently incomplete. Just reading what the docs mention, versus what
>> the code does.  However, the docs seem to outline something that would work.
>
> Yes, its minimal at the moment. I wanted to get early discussion
> before committing deeply to an implementation. I've merged all of
> trunk in just now, and will spend tomorrow poking at this - it would
> be nice to work better for emacs users, given their patience with
> savannah and sftp.

+1

>> 2) Initial thoughts is that it seems overly complicated. *Especially* on
>> platforms with atomic rename. Which also brings up the question of how
>> to make sure it is adequately tested on platforms where you can rename
>> open files. Maybe you've already solved this, or maybe it's why this
>> isn't fully complete yet.
>
> On this point specifically, my theory was that it wouldn't matter if
> you can do atomic renames - its correctness doesn't depend on renames
> over open files failing, its simply designed to not depend on open
> files renames succeeding.
>
>> 3) Depending on renaming files in/out of the way to get correct behavior
>> seems like it could be extra problematic on NFS. Then again, things may
>> not be worse than where we already are.
>
> We often fail completely on NFS at the moment with the dependency on
> OSLocks; even when we work we're likely to be running into cache
> coherency issues.
>
>> 4) I was a bit confused by the naming scheme ($MD5.check is the pointer
>> to the last file, while $MD5 is the new content, and $MD5.current is the
>> pointer to $MD5.) I don't have anything immediate, but giving an
>> extension to the new content, and possibly name the pointers as '.ptr'
>> or something, might make it a bit clearer what is what.
>
> Sure, we can do that.
>
>> 5) Adding a new directory under '.bzr/checkout' to hold this stuff seems
>> superfluous, especially since it seems to have yet-another format file.
>> Maybe I misunderstood, but it would be nice to just use .bzr/checkout.

Also, this doesn't seem to be documented in the IndirectedDirState class.

I agree it would be nicer to just put it in /checkout and make sure
the names are sufficiently well defined to support gc safely.

> This scheme has the potential to leave cruft behind and need garbage
> collecting; thats simpler if we can whitelist a small, constant set of
> files and then grab everything else. That was my main motivation for
> putting it in a subdir. The format marker was so that I could change
> things without worry about accidentally using an old test dir.

>> 6) If we are revisiting how dirstate works, what about splitting out the
>> stat file again. It seems that file is what gives us the most difficulty
>> (since it can be updated in a read lock). If we made it more clearly
>> separate, we might avoid some of the problems entirely. IIRC the main
>> desire to not have it was performance, but I wonder the real cost would
>> be if we structured it appropriately. Having to repeat all the paths
>> again would be bad, but maybe we can do better. IIRC we also do
>> something indexed on stat rather than by path, maybe that would be
>> useful here...
>
> I don't want to bite off too much; splittout of stat is, of course,
> possible - but its orthogonal to getting rid of the need for oslocks
> entirely, which is what windows needs for reliable operation.

Perhaps we can do this by simply checking the statcache is always
built during logical write operations, and then just not doing it
during logical reads.  We don't strictly need a new format.

I wonder how hard, or how bad it would be for performance, to just
throw away stat cache updates during a logical read lock.  It might be
an interesting experiment, and it seems like it might not be much
slower in realistic cases.  It would avoid an amount of complexity to
do with these low-priority transactions, and it would avoid some
operations that we know are risky on network filesystems that aren't
quite coherent.

-- 
Martin <http://launchpad.net/~mbp/>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: submit.diff
Type: text/x-patch
Size: 79439 bytes
Desc: not available
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20100531/b2ec2581/attachment-0001.bin