'bzr status' stats each file multiple times

John A Meinel john at arbash-meinel.com
Sun Dec 4 14:54:21 GMT 2005


John A Meinel wrote:
> Michael Ellerman wrote:
>> On Fri, 2 Dec 2005 23:35, John A Meinel wrote:
>>> [snip]
>>> I've noticed that "bzr status" on an clean tree seems to take quite a
>>> while.
>>> [snip] 
>> Me too.
>>
>> On my tree, it seems to recompute the sha1 for about 1/3 rd of
>> the files every time. It then caches them, but never writes the hashcache?
>> I also don't see the point of calling hashcache.scan() prematurely?
> 
> The idea of calling hashcache.scan() early, is that you can stat in
> inode order. Which according to git, should be faster. And who would
> know better than kernel hackers. :)
> 
> Now, the other issue is that we still re-stat the file when we come back
> later. Just to make sure that it hasn't changed in the mean time.
> 
> I think we should have a timeout value. So that if I stat'd the file
> within the last X seconds, I assume the file hasn't changed.
> We can even go one better, and do a double check saying, If the files
> mtime is older than Y seconds, and I have stat'd the file within X
> seconds, don't stat again.
> 
> I could see using Y=1day, and X=5seconds.
> So if the file is more than 1 day old, we don't really expect it to
> change frequently, certainly not within the last 5 seconds.
> 
> All it really means is that if you start a "bzr status" and something
> changes *while* you are running status, it won't show up until the next
> status.
> 
> "bzr commit" also won't pick up changes that occur while it is
> committing. *However* there is a nice property that whenever you commit,
> you do compute the sha1 sum from the actual lines that you are
> committing. You don't pay any attention to the cache at that point.
> Actually, at that point you do a sha1 check multiple times on the same
> strings loaded in memory. Specifically at line 574 in inventory.py:
> 
>    new_lines = work_tree.get_file(self.file_id).readlines()
>    self._add_text_to_weave(new_lines, file_parents, weave_store,
>                            transaction)
>    self.text_sha1 = sha_strings(new_lines)
>    self.text_size = sum(map(len, new_lines))
> 
> So we read the text, add it to the weave (which computes a sha1 sum),
> and then we compute the sha1 sum, to add it to the inventory entry.
> sha1 is relatively cheap (when compared with stuff like adding a text to
> a weave file), so it doesn't really matter.
> 
>> This makes it a bit quicker.
> 
> Actually, on my tree, I have much worse behavior. I saw it stat each
> file stat times.
> I think what is happening, is that we have multiple code paths creating
> a working_tree, each of which opens a new hash_cache, and they can get
> out of sync.
> I'm going to test creating a weakref dictionary of hashcache paths, and
> see if it is getting accessed more than once.

I'm using a plain dictionary, just because I realized you probably do
want to keep it in memory. There really is no reason to re-read the file
multiple times.
Except maybe when you go to "get()" a new one, it should re-stat itself,
and if it has been updated, it should go re-read.
That way if you have to bzr executables running on the same directory
(at the same time), they will stay a little bit more synchronous.

Though it is just a cache, and as such doesn't have to stay perfectly in
sync.

But I did see that the HashCache for the same directory is opened 3
times. Which means that without any sort of optimization, it was trying
to stat all files 3 times, just because it opened a new hashcache.

John
=:->


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 249 bytes
Desc: OpenPGP digital signature
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20051204/d20f85a1/attachment.pgp 


More information about the bazaar mailing list