[RFC] making TreeTransform more resistant to unexpected failures / writing problems

Thu Aug 9 18:19:01 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> Martin Pool wrote:
>> On 8/9/07, John Arbash Meinel <john at arbash-meinel.com> wrote:
>>> Anyway, what I would actually like to see is for TreeTransform to try to
>>> rename-into-place and if that fails, it can rename it 'to the side' and
>>> mark the file as conflicted.
> 
> We don't do atomic renames here; we always delete the original before
> installing the new one.  This means that if the old file cannot be
> deleted, it will fail on delete, not on rename.
> 
> Also, I'm not clear on which file you're renaming to the side.  It
> sounds like you're renaming the existing file, but will that work?  I'm
> not clear on that.

Well for case conflicts, TT is failing during the rename into place.
Which is the only thing I've had experience with. So some of this is
just because of a bit of misunderstanding.

I also didn't realize that you actually nuke the target file. I thought
you renamed it out of the way, renamed a new one in place, and then
deleted the out of the way one.
According to you, you are deleting all targets, and then renaming all
new entries into place. Which is fine, but just a different sequence
than I was thinking.

> 
>>> This probably has some purity issues for Aaron, as he has always wanted
>>> to detect all conflicts in the beginning, so that by the time he gets
>>> around to actually doing the updates, he knows what will happen.
> 
> I certainly do.  For one thing, it's nice to know what applying a
> transform will actually do.  We don't say "change this tree in this
> way...maybe", and I don't know why you'd want to.

Because of reality... There are times when you cannot do what was
requested, so rather than failing with an abort, you should instead
record what you were actually able to accomplish, and have a way to let
people know what happened.

> 
> But on a more practical level
> - renames frequently require moving file ids, also.
> - renames can produce conflicts with other files.
> - renames can produce conflicts with other file ids
> - the THIS, BASE, OTHER conflict code is at a higher level than
>   TreeTransform.
> 
>>> However, it works for case conflicts
> 
> I think that's a really horrible way to handle case conflicts.  Our
> standard code for detecting duplicate file names should be handling that
> case.
> 

Well, there are a lot of edge cases that you will start to have to
cover. Which is *already* handled by the host OS.

Specifically, you have to be aware of when you are creating 2 files with
similar names (so there is no target files yet, but you are trying to
create both Makefile and makefile).
You have to be aware of when one of the target files already exist, but
not the other one. (Makefile is present on disk, and you are trying to
create 'makefile').
On Mac, you have to not only be aware of case conflicts, but also
unicode normalization conflicts. (You are trying to create u'\xe5' ==
'\xc3\xa5', but u'a\u030a' == 'a\xcc\x8a' already exists on disk).

I could see doing an os.lstat() for the case when the target already
exists on disk (since the os will let you know if that collides with an
existing filename). It would handle both case and unicode normalization.
However, when you add 2 files that differ only by case/normalization on
Linux (which is allowed if a bit broken).

Now, on Mac TT won't actually die because it successfully renames over
an existing file. But it leaves you in a pretty weird state. (I think
the commit code will say everything is fine, because it can stat both
u'\xe5' *and* u'a\u030a', but status will say u'\xe5' is missing because
it uses a different algorithm.)

>> , as well as locked files, or files
>>> showing up while you are renaming into place. (I guess that means it
>>> also needs to conflict if the '_apply_deletions' step fails).
> 
> Depends on the failure.  If the file disappears before I can remove it,
> should that be a failure?

It depends what you mean by failure. Right now on Windows, if you have a
symlink introduced, or a case-conflict, as it is updating your tree, it
goes *puke* and leaves you in a very unknown state. The
inventory/dirstate is does not reflect the actual filesystem, and
getting yourself out of this might be possible with a revert, but that
(at a minimum) causes all uncommitted changes to be lost.

> 
>>> Ultimately, I would rather not rollback the entire transaction, instead,
>>> just mark a few more files as conflicted, and put at least .THIS files
>>> around.
> 
> I think if we did that, we'd need to do another round of conflict
> resolution.  And to do that, we'd need to edit the transform to reflect
> the operations that did succeed.  Which gets really gross and slow.

Well, you only really need to do it for the ones that fail. And with any
luck that number is going to be much smaller than the total transform.
(If it isn't, everything is pretty broken anyway).

> 
>>  We could retry the replacement but doing it immediately after will
>> probably find that it's still in use.  It might be better to do all
>> the files that we can, then come back and retry the ones that were in
>> use.
> 
> Well, "all we can" may be a small set.  TreeTransform is designed to
> work around ordering issues, so removing the files out-of-order breaks
> its design.
> 
>> The other thing to consider is that on Windows, it could be better to
>> write over the files in place rather than renaming the new file into
>> place.
> 
> On Windows, you can't do atomic rename.  You must delete the target file
> first, and that's the step that's failing.
> 
> There are a few ways to interpret your suggestion of overwriting:
> 
> overwrite-on-TT.create_file
> ---------------------------
> The most efficient way to overwrite files directly is when
> TreeTransform.create_file is called.
> 
> This breaks everything.  Transforms are not supposed to have any effect
> until they are applied.  We would need to make substantial changes to
> every caller.
> 
> This doesn't address cases where we're trying to delete the locked file,
> not modify it.
> 
> overwrite-on-TT.apply_insertions
> --------------------------------
> Instead of renaming the file from limbo into place, we can
> unconditionally read the limbo file, and use its contents to overwrite
> the target file.  This is slow.  It also doesn't address cases where
> we're trying to delete the locked file, not modify it.
> 
> conditionally-overwrite-on-TT.apply_insertions
> ----------------------------------------------
> We could try our current approach, and if it fails, try overwriting.
> This would reduce the performance impact, but reduce test coverage.  It
> also doesn't address cases where we're trying to delete the locked file,
> not modify it.
> 
>> I think many programs that hold the file open are not denying
>> writers, but just refusing to let it be renamed or deleted.
> 
> I'm a bit confused about whether renames are verboten when a file is
> held open.

When you open a file it locks the whole path to that file. So if I open:
bzr/foo/bar/baz.py

then the directories 'bzr', 'bzr/foo', ... and the file 'baz.py' cannot
be renamed or deleted.

Windows doesn't have the concept of an inode. It identifies things by
their path. So when you open something, that path cannot change, or it
cannot find the thing you were talking about.

> 
>> That has
>> some risk that we'll overwrite the only copy of the file, so we might
>> like to make a backup first.
> 
> To summarize:
> 
> Rollback
> --------
> - simple change
> - fails safe
> - succeeds in everything except deleting the old file, if old file can
>   be renamed.
> - slightly slower, because it will need to rename deleted files out of
>   the way, and delete them at the very end.
> - may be implemented with journalling, so hard interruptions are
>   recoverable.
> 
> Rename-aside
> ------------
> - changes programming model
> - fails unsafe if a file cannot be renamed aside
> - produces conflicts when locked files are encountered
> - slower, due to repeated conflict handling
> 
> Overwrite-on-TT.create_file
> ---------------------------
> - changes programming model
> - succeeds if file can be overwritten, but not deleted
> - fails unsafe if file cannot be overwritten
> - fails if a locked file must be deleted (not modifed)
> - increases the window where interruptions/failures will leave tree in
>   bad state
> 
> Overwrite-on-TT.apply_insertions
> --------------------------------
> - succeeds if file can be overwritten, but not deleted
> - fails unsafe if file cannot be overwritten
> - fails if a locked file must be deleted (not modifed)
> - slow
> 
> Conditional-overwrite-on-TT.apply_insertions
> --------------------------------------------
> - succeeds if file can be overwritten, but not deleted
> - fails unsafe if file cannot be overwritten
> - complex, and slightly slower
> 
> I still like rollback the best.  I could be convinced about
> Conditional-overwrite-on-TT.apply_insertions, but it's not a complete
> solution, because it doesn't address cases where we're trying to delete
> the locked file.  But it could be combined with rollback.

The only reason I don't like 'rollback' is that you cannot ever get past
it. So if you only work on Windows, and someone else changes your
codebase on Linux, you can't merge them until *they* do something about it.

If instead it generated a conflict, you could resolve that locally,
commit, and be on your way.

> 
> Both rollback and rename-aside require the ability to rename locked
> files.  Rollback will simply fail to delete the locked files at the end,
> whereas rename-aside produces conflicts.  So I think rollback is better.
> 
> Rollback is unique in that it addresses other failure scenarios.  And if
> rollback was based on journalling, it could even be done if Bazaar was
> interrupted uncleanly.
> 
> Aaron

Rollback is a lot better than failing in the middle and leaving the WT
in an unknown state. So I would certainly BB:approve it. I'm not sure
that it is the best option. But it is better than what we have now.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGu0yFJdeBCYSNAAMRAgDiAKCDynWO0j9N71aF5+At0RN4CU+vlwCdFw1E
HqJ3Kfyu0k83odO/oK0hk3U=
=xva4
-----END PGP SIGNATURE-----