[MERGE] less-shaing during status

Robert Collins robertc at robertcollins.net
Thu Sep 25 22:25:04 BST 2008


On Thu, 2008-09-25 at 12:16 -0500, John Arbash Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Robert Collins wrote:
> > (Refreshed to avoid NEWS conflicts)
> > 
> 
>          if (stat_value.st_mtime < self._cutoff_time
> - -            and stat_value.st_ctime < self._cutoff_time):
> +            and stat_value.st_ctime < self._cutoff_time
> +            and len(entry[1]) > 1
> +            and entry[1][1][0] != 'a'):
> +                # Could check for size changes for further optimised
> +                # avoidance of sha1's. However the most prominent case of
> +                # over-shaing is during initial add, which this catches.
> +            link_or_sha1 = self._sha1_file(abspath)
>              entry[1][0] = ('f', link_or_sha1, stat_value.st_size,
>                             executable, packed_stat)
> 
> ^- I would probably change the "!= 'a'" into "== 'f'". I don't see why
> we would cache if it was a rename, or kind change.

Ok, I'll do this. Probably needs some very careful profiling to
distinguish between these possibilities.

> I also don't see a harm in
> 
>   and entry[1][1][1] == stat_value.st_size
> 
> But if you've done the profiling, it may be a case we don't need to
> check regularly. (It helps the.. I've changed 20 items in my tree, on an
> ongoing basis and keep running 'bzr st', but sha-ing 20 items is usually
> not an expensive case.)

Its another 4 lookups was all; I haven't exhaustively profiled this case
because I don't know what _really_ is common and whats not.

> 
> === modified file 'bzrlib/dirstate.py'
...
> >                      # Target details is updated at update_entry
> time
> >                      if self.use_filesystem_for_exec:
> >                          # We don't need S_ISREG here, because we are sure
> 
> ^- I'll note that we have "osutils.sha_file_by_name" which makes the
> inner portion of this a bit cleaner. It also uses raw os.open(), which I
> think has a bit less overhead (no buffering in python, etc.)
>
> Unless this 'sha_file' is off of the Dirstate object for tesing purposes.

Note that we need to fstat to put content in the cache. And
sha_file_by_name doesn't do that. We'd need a new version of that
function, and it will have one dedicated user. I didn't feel it was
worth it.

> I wonder if we shouldn't add a check that the existing cached value
> missed, and the file won't actually be sha'd. Does it actually change
> the record in the dirstate (and set the IN_MEMORY_MODIFIED)?
> 
> It would mean that for:
> 
> bzr add foo
> bzr st
> 
> We wouldn't have to write out the dirstate again.

Well, there is a test that it still caches the kind and size. So there
are really:
 - hits
 - misses-we-do-not-need-to-sha-cache
 - misses-we-do-sha-cache-on

I think there is room for further improvements, certainly.

Thanks for the review.
-Rob

-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080926/c36eda7a/attachment.pgp 


More information about the bazaar mailing list