[RFC] Removing the inventory concept from Bazaar.

Wed May 9 13:56:00 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> Daniel Silverstone wrote:
>>> On Tue, 2007-05-08 at 09:22 -0400, Aaron Bentley wrote:
>>>> Inspired by Robert's post, I started thinking about improving our
>>>> scaling in large trees.  I decided to take the extreme approach of
>>>> removing inventories as a concept, to see what broke.
>>>>
>>>> It seems to work, so I've started a draft spec here:
>>>> http://bazaar-vcs.org/DraftSpecs/NoInventory
>>> Your third assumption is that you can add a noop record to every
>>> unaffected knit index on commit
> 
> Yes, but I had hoped that the constant factors would be small enough
> that we could get away with it.
> 
> The motivation for no-op commits is so that we can look up the content
> of a file-id for a given revision efficiently.  Currently, this
> operation is done using the inventory, so it scales O(n).
> 
> Robert and I have been exploring some alternative approaches to looking
> up content efficiently.  He proposed maintaining a file-id ->
> revision-id file.  I think maybe we can leverage the revision history to
> find the latest version of a file.
> 
>>> Won't this mean that commit time will be proportional to size of tree
>>> not number of modifications?
> 
> Yes, but I can imagine a very, very small constant factor, because we'd
> only be adding a knit index entry.
> 
>>> Also, won't this mean that a push will always have to update every knit
>>> index? Seriously increasing the potential number of round trips?
> 
> That is more worrying.
> 
> Aaron

What about just a "changes" file in general. It would scale with
O(changes) (N revisions * changes per revision), rather than O(tree).

It does mean that really, really old files would take a while (because N
revisions grows big before that file is mentioned).

file-id => last-modified-revision could be cached, but it is
parameterized by revision so you would either only cache one tree, or it
becomes versioned, and you kind of end back at an inventory.

The primary advantage would be if it was sorted by file-id, so you could
bisect into it. (not have to read the whole file to get one entry).

But if you are versioning this file, then you have to extract it, which
means you had to read it.

We should also consider this, in light of some alternative repository
formats (non-knit).

I wonder if just splitting up the inventory by directory is going to be
an easier win than getting rid of inventory entirely.

Then it scales as O(log(N)) rather than O(N), since you don't need to
read portions of the inventory that are not involved.

The downside is that if all you have is a 'file-id' then you don't know
what directory it is in, so you don't know what sub-inventory to check.

However, if you have either a file-id and path, you could guess that the
file hasn't moved, and in the first pass read just a portion. If that
hint fails, then you read the whole inventory. You get mostly O(log(N))
with an occasional O(N).

Another possibility presents itself, though. Which is that if you have
the file-id and parent-file-id, you could be in O(1). We could either

1) Always update all directory entries (like you propose for files, but
we get a nice reduction if we only do directories). This means a lookup
of (parent-id, file-id) => information is always O(1), because we know
where to go for parent-id.

2) Only update directories when their children change. Now you have to
find the last change of the directory, which means you may need to know
its parent directory, etc. (we could use file-id + [parent-ids] back to
the root).
Or we could just cache the file-id => last-modified for directories.
Again, similar to what you proposed, but rather than doing it for every
file, we only do it for their containing dirs.

Some more thoughts...

The reason we use file-ids is because we have a file at path X now, and
we want to find the same file in another revision. Without file-ids, you
need to look at all the deltas between the revisions, in case one of
them renamed the file. It is also because renaming a file, and creating
a new one it its place should not cause a false-hit.

Doing it the way hg decided to, means that when they want to care about
renames they scale by O(num_delta_revisions), since they have to check
each one for a rename.

I'm curious whether that is cheaper/more expensive than O(num_files).

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGQcTgJdeBCYSNAAMRAi3vAKDXNyVi9Wbj4SJIErZdtpC7Flg9SQCdH17a
+yCSVMWxCCFBx15BlY1Ko7I=
=EvOH
-----END PGP SIGNATURE-----