[RFC] dirstate and deletes

Tue Sep 12 15:04:44 BST 2006

Aaron Bentley wrote:
> Robert Collins wrote:

...

>>> I've got a good proposal I think:
>>>
>>> When a file is unversioned in a dirstate, we change the versioned-type
>>> to 'u' for unversioned/unknown, but preserve the parents data. We can
>>> now tell that this file was in whichever parents by its current path. 
>>>
>>> If a new file is versioned at the same place, we consider the original
>>> file 'deleted', and move to the original proposal of nuking the filename
>>> field and putting it at the end of the dirstate.
>>>
>>> So this is just fine tuning - saying that we want to manage
>>> 'unversioned' and 'deleted' as separate steps on the path to oblivion.
>>>
>>> Thoughts ?

My quick summary... I think it is reasonable. Another possibility would
be to never squash it, and just recognize that there may be more than
one entry for a given path. But since dirstate is really optimized for
the 'filesystem path => internal information', and not 'file id =>
filesystem path/info', I think your proposal works well.

> 
> I don't understand the advantage of doing it in two steps.  Could you
> explain that a bit more?

The specific issue is how to represent a regular 'delete' in a dirstate
file. Right now a dirstate is a dir-sorted list of all the contents of
the working tree, coupled with basic info, and parent information.

The one I worked out was:

[directory, name, kind, file_id, size, stat, sha], [[parent info]]

Yes, it mirrors the filesystem, but it needs to, because the hash-cache
is being folded into the dirstate file. So we need the 'stat' to know if
an object's sha hash is out of date. And kind defines whether 'sha' is a
sha or a symlink target, or nothing for a directory.

Anyway, because it is sorted by (directory, name), then when you do
'walkdirs()' you stay in sync with the filesystem.

Robert's comment is how to handle either a plain 'rm directory/name' or
a 'bzr rm directory/name'.

The issue is you still need a place to store the [parent info] even
though the file has been removed. Because when you go to commit, you
want to find that it isn't there, but it used to be.

I think leaving the path alone, and just marking that the file has been
unversioned is a good step. We still need to track it, because commands
are going to want to know what the state of that file is.

And about Robert's comment of 'if a new file/dir/symlink is added in the
same location, wipe the path, and shove it to the end'.

I'm okay with that... I wonder, though, if it would be better to allow
for multiple entries next to each other.

But dirstate is really designed to be optimized for the: "I have this
file/dir/symlink at this specific path, what is its state". Which means
that deleted items don't really do much for that part of the system.

Thinking about 'bzr commit directory', though. It would be nice if we
could leave a reference so that we can bisect for 'directory' and not
have to worry about the rest of the file.

The same is true for handling renames. Interestingly enough, we get a
really weird failure there right now. Specifically:

$ bzr init
$ mkdir d; touch d/f
$ bzr add
$ bzr commit -m init
$ bzr mv d/f g
$ bzr st
renamed:
  d/f => g
$ bzr commit -m 'rename' d
bzr: ERROR: no changes to commit. use --unchanged to commit anyhow

A second ago, I was trying it on a different tree, and I got:
bzr: ERROR: parent_id {s-20060829161817-a8qjwtu2p39ot1dh-1} not in inventory

('s' would be equivalent to 'd' in that scenario). Oddly enough, I can
completely reproduce it *in that branch*. But it doesn't seem to work in
a new test branch.

> 
> Also, "delete" in the context of trees usually means os.unlink, so I'd
> rather use a different term.
> 
> Aaron

I will admit that dirstate is not structured well for answering 'what
path had this file id in the parent inventory', if that path is not the
same in the current inventory.

Which is one of the reasons I've thought to break out the dirstate into
records for each parent, rather than all-in-one. However, we haven't
really looked into seeking issues, etc. But dirstate isn't really
streamed anyway. It is read in one big chunk, and then split up in memory.

And from the timings that I remember, it went something like:

0: 120ms, 1:195ms, 2:300ms, 3:400ms

That is No Parents, 1 Parent, 2 Parents, 3 Parents. (3+ uses a different
code path from 0,1,2 which each have a specialized code path).

Anyway, for a kernel-sized inventory (22K files), it takes approx 100ms
to read in. And each parent record is approximately the same size as a
standard record (they have the same number of elements, which is the
important part).
So it should take the same amount of time to read in 2 separate files as
it does to read 1 large one.

The major difference, though, would be when doing partial operations.
Which may be a reason to leave it all-in-one. Bisect can grab a full
directory subset of the 22K files in about 6ms. I really wanted to write
an 'extract directory recursive' which would recurse down the tree
grabbing everything that was a child of a specific directory.

Robert- If you need something like that, just let me know, because I
would be interested in writing it.

John
=:->

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20060912/cac6da5a/attachment.pgp