[MERGE] Streamline _walkdirs_utf8 for utf8 file systems, reducing time to traverse a mozilla tree from 1s to .6 seconds. (Robert Collins)

John Arbash Meinel john at arbash-meinel.com
Wed Sep 10 18:13:48 BST 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> So, this is an updated patch of the one John tweaked to work on windows.
> The windows stuff has not been changed further, so it should still work
> fine.
> 
> Its faster :)
> 
> Before, I see 1000ms to walkdirs a mozilla tree, with it, I see 612ms.
> In 'st' terms, thats 3.84 seconds down to 3.34. Or half a second saved,
> ~14% total time reduction.
> 
> Enjoy.
> 
> -Rob
> 
> 

You are doing a bunch of PyOS_snprintf calls, but not actually using the
pathstart or abspath strings.

For each directory you go into, you end up with a 'chdir' in, and then a
'chdir' back to the original directory. That seems *really* inefficient. Can
you instead change the outer loop so that either:

a) Pass in a file handle to 'fchdir()' back to
b) Save the location at the beginning of looping, and chdir back at the end of
looping?

Perhaps because if walkdirs is a generator, the calling code isn't guaranteed
to consume the whole thing (though I'll mention that with python 2.5 you could
use a try/finally which *will* get called when the generator dies.)

So probably (a) is the better solution. Considering that every time you do
another readdir, it is going to go back to the same directory.

As you mention, this is only 14% reduction, and how much of that is directly
because of the chdir() and how much of it is other tuning? If you feel it is
good and import, I can accept it. It just feels... unclean, to change the
global state. If it is 10/14% then I'm fine with it, if it is 2/14% then I'm
not so sure. It seems like something we should come back to, rather than
something we should depend explicitly on now.

Anyway, given you probably feel strongly towards it, I'll review the rest:

+cdef class _Stat:
+    """Represent a 'stat' result."""
+
+    cdef readonly int st_mode
+    # nanosecond time definitions use MACROS, fucking up te ability to have a
+    # variable called st_mtime. Yay Ulrich, thanks a lot.
+    cdef readonly time_t _ctime

^- I believe we try to avoid profanity, and regardless "the" is misspelled.

Also, why not just add:

cdef extern "readdir.h":
  pass

And in 'readdir.h':
#undef st_mtime

You also have a lot of commented out lines like:
+    # cdef readonly double st_atime

and
+        #return repr((self.st_mode, 0, 0, 0, 0, 0, self.st_size, self.st_atime,
+        #             self.st_mtime, self.st_ctime))


Also, regardless I think it would be good to unify our extension "_Stat" objects.

A few other things...

You tend to use "int" in a lot of places where you should be using Py_ssize_t.
And for most cases "st_size" is actually a 64-bit integer (or PyLong as you
prefer).

You are still using:

dot = ord('.')

instead of c'.', not a huge issue, but I feel like:

            if not (name[0] == dot and (
                (name[1] == 0) or
                (name[1] == dot and name[2] == 0))
                ):

is harder to read than

            if not (name[0] == c'.' and (
                (name[1] == 0) or
                (name[1] == c'.' and name[2] == 0))
                ):
because I don't have to double check what the 'dot' variable is.


So the important things are

1) Clean up the snprintf stuff if you aren't going to be using it.
2) Clean up types. Use Py_ssize_t for python API's, 'off_t' for stat objects, etc.
3) Make sure we really need to do chdir(). If you are confident, then we can
do so.
4) Consider finding a way to not have to continually reset the cwd using a
path. Such as using 'fchdir()'. Also, we know the inode for the directory, as
we found it in the previous pass. Is there any way we could use that to speed
up the initial chdir(dir_to_look_in)?

I also wonder if using a relative chdir() would be faster than using an
"absolute" path each time.

5) Alternatively, perhaps we make the api for DirReaders require that callers
use a 'finalize()' style statement. And *that* would reset the current working
directory.


4 & 5 are more directions for future improvement. 1 & 2 are pretty important.
3 is an open question.

BB:resubmit

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIyABMJdeBCYSNAAMRAr8/AKDBUsmdqvFi9Wl/NnevBoXx6rWLiACg2U5w
akcGUyX3nzJ3KEuSLt7XdBY=
=L3kj
-----END PGP SIGNATURE-----



More information about the bazaar mailing list