[MERGE] Fast _walkdirs for win32
John Arbash Meinel
john at arbash-meinel.com
Thu Jul 17 15:56:21 BST 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Alexander Belchenko wrote:
| You can ignore my mail if you wish, but I simply can't pass over this
| patch.
| Thank you for writing this, I knew bzr must have this long ago,
| but I was unable to write such code myself.
I'm always happy to hear from you Alexander. I actually wanted to point
you in this direction, because I realized that FindFirstFileW isn't
going to exist in Win98. I don't really feel a pressing need to support
it, but we *could* do some work to use TCHAR and FindFirstFile.
I have the feeling it is going to confuse Pyrex a lot to have something
that might be a plain 'char' and might be a 'wchar_t'.
As the fallback code already works on Win98, I felt like that would be fine.
|
| John Arbash Meinel пишет:
|> Attached is a patch which follows up on my earlier work to use the
|> windows native FindFile APIs for walkdirs. It is a pretty good fit,
|> because walkdirs wants to return the stat information along with the
|> paths, and that is what FindFile* does. (In fact, it uses FindFile* for
|> each os.stat, and another one for the os.listdir.)
| [...]
|> It seems that the overhead of creating a FindFirstFile for every stat is
|> pretty horrendous. Specifically, for a tree of 9k entries, --lsprof puts
|> 3.5s/5s to nt.lstat().
|
| And penalty on this much higher on FAT32, as I discover in the past.
|
|> Anyway, this isn't urgent for 1.6, but it is a rather nice improvement,
|> so I'd like to get it in sooner rather than later. (Otherwise I'm going
|> to be tempted to run something other than bzr.dev as my primary 'bzr',
|> which I'm not keen on doing.)
|
| Because 1.6 is anyway out of regular schedule,
| IMO your patch should be merged into 1.6.
|
| I have some minor comments on your patch itself.
|
| First at all it does not compile for me with Pyrex 0.9.6.3.
| What version of Pyrex you're used?
0.9.8.2
|
| 1) My Pyrex don't understand inplace xor (^=).
| I'm forced to replace them with explicit xor operator:
|
| cdef int _get_mode_bits(WIN32_FIND_DATAW *data):
| cdef int mode_bits
|
| mode_bits = 0100666 # writeable file, the most common
| if data.dwFileAttributes & FILE_ATTRIBUTE_READONLY ==
| FILE_ATTRIBUTE_READONLY:
| mode_bits = mode_bits ^ 0222 # remove the write bits
| if data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY ==
| FILE_ATTRIBUTE_DIRECTORY:
| # Remove the FILE bit, set the DIR bit, and set the EXEC bits
| mode_bits = mode_bits ^ 0140111
| return mode_bits
No problem here, I'll be happy to switch it.
|
| 2) But anyway it failed because linker cannot find Win32 API functions:
|
...
| bzrlib\_walkdirs_win32.pyd : fatal error LNK1120: 4 unresolved externals
| Building of "bzrlib._walkdirs_win32" extension failed, will use the
| Python version instead
|
|
| I found that if I compile _walkdirs_win32.c with custom setup.py
| (without involving Pyrex at all)
| I get the same error. When I looked at the generated C code I saw this
| definition:
|
| #ifndef WIN32
| #define __stdcall
| #define __cdecl
| #endif
|
| If I manually add
|
| #define WIN32_LEAN_AND_MEAN
| #include <windows.h>
|
| before mentioned lines *then* module compiled successfully without any
| warning or error.
| I don't know is it Pyrex bug or not.
|
Well, on my version I see:
#ifndef WIN32
~ #ifndef __stdcall
~ #define __stdcall
~ #endif
~ #ifndef __cdecl
~ #define __cdecl
~ #endif
#endif
So it first checks if they are already defined before it clears them out.
I'll see about passing in WIN32 through the compilers, but in the short
term, maybe just upgrading pyrex is a better option.
I'm also using mingw32 to do my compiles, rather than VC7, but either
should certainly work.
...
| + cdef int GetLastError()
| +
| + # Wide character functions
| + DWORD wcslen(WCHAR *)
|
| ^-- You don't need to use cdef's in this block, because you're using
| "cdef extern from" construct at the top. They are don't seem to harm
| though.
I'm not sure which block you mean, but I'm directly accessing the
members, so I have to tell Pyrex what their types are.
|
| [...]
| +cdef __int64 _get_size(WIN32_FIND_DATAW *data):
| + # Pyrex casts a DWORD into a PyLong anyway, so it is safe to do << 32
| + # on a DWORD
| + return ((<__int64>data.nFileSizeHigh) << 32) + data.nFileSizeLow
|
| ^-- Your comment here is misleading IMO. You explicitly cast nFileSizeHigh
| to 64-bit integer, so it's definitely safe to shift lower 32 bits
| to higher half. And you return from function __int64 value, so there is no
| DWORD anymore.
Yeah, that is an old comment. I'll fix it. It was actually true before
the <__int64> cast, though. As it would convert nFileSizeHigh to a
PyLong and then you can << it as much as you like.
|
| [...]
| + def _get_files_in(self, directory, relprefix):
|
| ^-- I'd like to suggest to add some docstring here.
| Although it's not public interface, but there 'directory'
| expected to be unicode, and 'relprefix' is utf-8 string.
| It's unclear for casual reader (like me).
Added. It is mostly inherited from the "_walkdirs_utf8" api, but I'll
expand its docstring here.
|
| + cdef WIN32_FIND_DATAW search_data
| + cdef HANDLE hFindFile
| + cdef int last_err
| + cdef WCHAR *query
| + cdef int result
| +
| + top_star = directory + '*'
| +
| + dirblock = []
| +
| + query = PyUnicode_AS_UNICODE(top_star)
| + hFindFile = FindFirstFileW(query, &search_data)
| + if hFindFile == INVALID_HANDLE_VALUE:
| + # Raise an exception? This path doesn't seem to exist
| + raise WindowsError(GetLastError(), top_star)
|
| ^-- Umm, I see this code path is not tested explicitly?
| Who will catch this exception?
| I assume this is just safeguard and this should never happens because
| _walkdirs_utf8_win32_find_file supposed to be used on real WT always?
|
Actually, it *is* tested explicitly. In "tests__walkdirs_win32" I have a
test for:
~ def test_missing_dir(self):
~ e = self.assertRaises(WindowsError, list,
~ self.walkdirs_utf8(u'no_such_dir'))
~ self.assertEqual(errno.ENOENT, e.errno)
~ self.assertEqual(3, e.winerror)
~ self.assertEqual((3, u'no_such_dir/*'), e.args)
The portion that isn't tested explicitly is:
~ last_err = GetLastError()
~ if last_err != ERROR_NO_MORE_FILES:
~ raise WindowsError(last_err)
Because I didn't actually know how to trigger it.
| +
| + try:
| + result = 1
| + while result:
| + # Skip '.' and '..'
| + if _should_skip(&search_data):
| + result = FindNextFileW(hFindFile, &search_data)
| + continue
| + name_unicode = _get_name(&search_data)
| + name_utf8 = PyUnicode_AsUTF8String(name_unicode)
|
| ^-- here you're using explicit suffixes to differentiate unicode/utf-8
|
| + relpath = relprefix + name_utf8
| + abspath = directory + name_unicode
|
| ^-- but here is not :-(
| Probably it's just consistency with other walkdir walkers...
| But it's still confusing (me personally).
|
Well, for other walkers, 'abspath' is actually an 8-bit path. It is just
meant to be *a* path that you can pass to the OS functions to grab the
actual file. (Like open(), or lstat(), etc.)
I'm mostly using name_(utf8,unicode) because in both cases it is just
the filename. I just went ahead and got rid of the local variables that
I wasn't using elsewhere anyway. The return type is documented in the
_get_files_in() docstring, so it should be okay.
Also, technically _get_files_in() *is* a standard Python function, so
you *can* call it from plain python. I didn't think the call overhead
would be too much, as it is only 1x per directory (not for every file).
| + PyList_Append(dirblock,
| + (relpath, name_utf8,
| + self._get_kind(&search_data),
| + self._get_stat_value(&search_data),
| + abspath))
| +
| + result = FindNextFileW(hFindFile, &search_data)
| + # FindNextFileW sets GetLastError() == ERROR_NO_MORE_FILES
| when it
| + # actually finishes. If we have anything else, then we have a
| + # genuine problem
| + last_err = GetLastError()
| + if last_err != ERROR_NO_MORE_FILES:
| + raise WindowsError(last_err)
| + finally:
| + result = FindClose(hFindFile)
| + if result == 0:
| + last_err = GetLastError()
| + pass
|
| ^-- you don't need 'pass' here, perhaps.
| You pick GetLastError here but don't use it.
| Does it's just for clearing Windows' internal variable?
No, I just didn't know what to do with it. I'll just raise a
WindowsError and realize I don't know what is actually going on there. :)
I'd rather not raise an exception if there is already an exception
pending, though. So I'll just document that something could be done
there, and leave it for now.
|
| + return dirblock
|
| [...]
|
| +def _walkdirs_utf8_win32_find_file(top, prefix=""):
| + """Implement a version of walkdirs_utf8 for win32.
| +
| + This uses the find files api to both list the files and to stat them.
| + """
| + return Win32Finder(top, prefix=prefix)
|
| ^-- You have almost identical copy of this function in the osutils.py --v
| I see copy in osutils used only for testing.
| May be there is possible way to avoid this duplication somehow?
|
Yeah, before submitting I decided to get rid of the one in osutils.py
| === modified file 'bzrlib/osutils.py'
| --- bzrlib/osutils.py 2008-06-11 03:56:46 +0000
| +++ bzrlib/osutils.py 2008-07-17 04:49:57 +0000
| @@ -1190,8 +1191,14 @@
| pass to os functions to affect the file in question. (such as
| os.lstat)
| """
| fs_encoding = _fs_enc.upper()
| - if (sys.platform == 'win32' or
| - fs_encoding not in ('UTF-8', 'US-ASCII', 'ANSI_X3.4-1968')): #
| ascii
| + if (sys.platform == 'win32'):
| + try:
| + from bzrlib._walkdirs_win32 import
| _walkdirs_utf8_win32_find_file
| + except ImportError:
| + return _walkdirs_unicode_to_utf8(top, prefix=prefix)
| + else:
| + return _walkdirs_utf8_win32_find_file(top, prefix=prefix)
| + if (fs_encoding not in ('UTF-8', 'US-ASCII', 'ANSI_X3.4-1968')): #
| ascii
| return _walkdirs_unicode_to_utf8(top, prefix=prefix)
| else:
| return _walkdirs_fs_utf8(top, prefix=prefix)
|
| ^-- Windows 98 does not have Unicode API for FindFirstFileW and Co.
| Although I remember clearly that bzr does not support Windows 98
| explicitly,
| you could change your check for platform so it will be win98-friendly:
|
| + if win32utils.winver == 'Windows NT':
| + try:
| + from bzrlib._walkdirs_win32 import
| _walkdirs_utf8_win32_find_file
| + except ImportError:
| + return _walkdirs_unicode_to_utf8(top, prefix=prefix)
| + else:
| + return _walkdirs_utf8_win32_find_file(top, prefix=prefix)
|
| win32utils.winver == 'Windows NT' on non-windows platforms will be False
| as well.
|
Do we strictly depend on win32utils?
Also, do we get 'Windows NT' for stuff like Vista and beyond?
There is the variable "os.name" which gives me just 'nt' on Vista. Can
you confirm what it is on other platforms?
| But of course such change is entirely for your taste.
|
I'm happy to bring in small changes that allow supporting Win98. I
wasn't going to base the whole work around it.
|
| I don't dig into tests details, but I'm like your pyrex code.
| I don't know is my vote worth something, but +1 from me anyway.
|
|
| And something indirectly related to your pyrex code.
|
| About a year(?) ago I was experimenting on bringing my code for fake
| cygwin-like
| symlinks (win32symlink plugin) into bzrlib. I was failed in the past
| because my code required another stat call for every file, and I found
that
| this slowdown walkdir operations for ~80% on FAT32 and ~15% on NTFS.
|
| Now with your very carefully optimized pyrex extension fake symlinks
| have another
| chance for life. To differentiate fake symlinks from regular files
| SYSTEM file
| attribute is used. So in your pyrex code only one function
| _get_mode_bits needs
| to be tweaked. (And also providing support in some osutils functions too).
|
That would certainly be possible. If you want to look into it a bit, I
would be interested in reviewing the code.
Note, however, that we would need similar functionality in a plain
python function. I suppose it doesn't need to perform extra fast, but it
should at least perform ok on small projects.
Actually, there was one thing I didn't really like about all of this,
which is that I try importing the function every time. I'll work up a
patch to change that, which should also make it easier to switch to a
win98 function.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkh/XZUACgkQJdeBCYSNAANtqQCfQcMziphaitupnAADcBqBqXA/
Xd8AoK+c4jZbw0P6F22oItg3aTpOPsEZ
=FrX1
-----END PGP SIGNATURE-----
More information about the bazaar
mailing list