Benefits of walkdirs in ut8 instead of unicode

John Arbash Meinel john at arbash-meinel.com
Wed Feb 28 18:13:28 GMT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Well, I've been profiling stuff like 'bzr status' with the new dirstate
code. And I've been seeing a lot of calls to encodings/utf_8.py
decode/encode. And when you look closer, you can see that a lot of these
calls are actually happening because of os.listdir() calls.

And since dirstate itself has all the paths in utf8, on systems that use
utf8 filesystem encoding, we don't really need to decode just to encode
again.

So I wrote _walkdirs_utf8. With the idea that on platforms with either
ASCII or UTF-8 encoding we can just walk the paths without any
encode/decode steps. On Windows, we can actually walk in Unicode and
encode each one into utf8. And I'm thinking that in all other encodings
we'll just do something like the Windows walk.

It means that we might also get some paths which are not valid utf8. But
I'm expecting those to be taken care of at a higher level. For example
"bzr add" can fail if trying to add a non-utf8 path, but _iter_changes
can just ignore the fact.

There is a pretty big difference in performance. Doing a walkdirs('.')
call on the large Mozilla tree (5,766 directories, approx 55k total
entries).

With a plain walkdirs(u'.') call it takes 2.5s, with a
_walkdirs_utf8('.') call, it takes 1.75 - 1.8s.

For a launchpad sized tree (~4600 entries) the difference is 145ms
versus 100ms.

As a comparison,
% python -m timeit -s "from subprocess import call, PIPE; import os;" \
  -s "devnull = open('/dev/null', 'rb+')" \
  "call(['find', '.', '-size', '+0'], stdout=devnull)"

seems to say that find() takes about 38ms (time find ... takes 60ms).

So how about a summary table:
		num	find	walkdir	_utf8	os.walk
bzr.dev		 1018	 11.6	  29.4	  22.8	  33.6
launchpad	 4638	 39	 143	  99	 167
mozilla		54775	600	2210	1740	2590

I don't think we'll really get to the speed of find. Though it would be
nice. I'm a little surprised that we aren't a lot faster than os.walk,
but then again all of this is cached speed, so the change in walk
pattern doesn't really show up yet. Also, walkdir() is returning a lot
more information (so we don't have to stat the files again, etc).

I'll let you know how it effects _iter_changes().

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFF5cZIJdeBCYSNAAMRAkm4AKCkAh8+FIRRZy9t/WrrWN0FfklnTwCeO/r6
pDm9ffgOFwEiSjoAdXX4Do0=
=3liR
-----END PGP SIGNATURE-----



More information about the bazaar mailing list