[RFC] Tree.iter_changes

Wed Sep 20 23:33:29 BST 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Arbash Meinel wrote:
> Aaron Bentley wrote:
> 
>>John Arbash Meinel wrote:
>>
>>So from my POV, an interface that only supports diffing against parents
>>is too limited.
> 
> 
> I realize it is limited, but it is a common case that our data storage
> model (both dirstate and our current .knit format) heavily optimize.

I agree that it's valuable to optimize this case, but I think the
proposed iter_changes can support that.

>>>>For example, dirstate lets us iterate through the whole inventory, and
>>>>compare against all parents as we go.
>>
>>AFAIK, commit is the only command that can make use of that functionality.
>
> Sure, but it would be helpful for 'bzr status' and 'bzr diff' to not
> have to create a complete inventory, just for the 3 files that have changed.

That's not what I meant.  I meant that most commands only compare two
trees, and so comparing against more than one parent efficiently doesn't
help them.

I agree that comparing against one parent efficiently is a win for other
commands like diff, but I think we can support maximum efficiency doing
that for iter_changes, as it stands.

>>The patch implements the functionality as an InterTree method.  So if
>>the basis tree is a special type, e.g. a BasisTree or ParentTree, then
>>we can implement an optimizer for that case, to use dirstate in the most
>>efficient way possible.

> I like the approach. But I think we need a way to get at this stuff
> without having to build the inventory.

I don't understand.  Why would that require building the inventory?

By default, it only emits information about modified files, and it does
not emit inventory entries, just tuples.

> It is also useful for even stuff like 'log -v', which wants the changed
> file list for each revision.

Well, there's always the file_ids_altered_by_revision_ids hack, but
doing it right for a comparison interface requires another inventory
format, I think.

> As an example of how much we could benefit from it...
> 
> Consider a tree with 10,000 files. In any given commit, you probably
> only change < 10 of them. With your method, you would need to extract
> the basis tree (possibly by doing a bunch of patch applications to the
> inventory texts), 

The inputs are trees, but I don't think it's necessary to generate the
inventories.

> and then reading the full text, and creating 10,000
> InventoryEntry objects at 3us per python object. 

This method certainly doesn't need InventoryEntry objects.  It does emit
data such as name and executability, but those are essential to the task.

> And contrast that to the time it takes to just extract the 10 single
> line changes from the .knit file, and print them out.

I think this API can accomodate that kind of optimization.

> This is some of the motivation of dirstate. To create tuples, and
> process as much as possible in tuple form, before we get into Inventory
> objects.

Yes, I already understood that object creation can be expensive, which
is why iter_changes outputs only tuples.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFFEcG50F+nu1YWqI0RAqe2AJ4ujXukxRwvwJVZIUjnmt6vbqW6pQCdE0Od
zXmNlOahfoXvlmTCtmUdDOo=
=V0pH
-----END PGP SIGNATURE-----