[MERGE] fast read-dir support for unix too

Robert Collins robertc at robertcollins.net
Tue Sep 23 00:04:26 BST 2008


On Mon, 2008-09-22 at 17:56 -0400, John Arbash Meinel wrote:
> 
> I *do* wonder if doing:
> 
>   property st_XXX:
>     def __get__(self):
>       return self._st.st_XXX
> 
> Is faster/slower than:
> 
> 
> cdef class _Stat:
> 
>    cdef readonly int st_XXX
> 
> 
> I know the former is translated into an explicit function, while the 
> latter is
> just put into the 'members' struct.
> It might be worth benchmarking.
> 
> On the flip side, having an explicit wrapping of a 'struct stat' may
> be 
> a great
> way to have other pyrex code access the raw 'struct stat' directly.

The property is likely slightly slower; I have other work to remove the
use of stat itself throughout the higher layers and instead have an
opaque fingerprint; Thats down the track though. For now I'll carefully
setup those types according to the C definition. I use the properties
because of the C macros that cause utter chaos - doing cdef readonly int
st_XXX fails. Also using the stat struct as I do is less copying in the
inner loop.

> ...
> 
> +    cdef _kind_from_mode(self, int mode):
> +        # in order of frequency:
> +        if S_ISREG(mode):
> +            return self._file
> +        if S_ISDIR(mode):
> +            return self._directory
> +        if S_ISCHR(mode):
> +            return self._chardev
> +        if S_ISBLK(mode):
> +            return self._block
> +        if S_ISLNK(mode):
> +            return self._symlink
> +        if S_ISFIFO(mode):
> +            return self._fifo
> +        if S_ISSOCK(mode):
> +            return self._socket
> +        return self._unknown
> 
> ^- I would assume symlinks are more common in working trees than CHR
> and 
> BLK
> devices. Not that it is a huge issue after you've handled files and
> directories.

Yeah, how about I just say 'files and dirs are respectively most
common'.


> I notice that you pushed "lstat" into _read_dir(). Did you decide
> that 
> stat-ing
> in inode order was not a sufficient win to continue using it? Does
> that 
> point
> us to simplifying the _read_dir api?

It wasn't a sufficient win on hot-cache to make me cry losing it; I
intend to reinstate it; I'm happy to do so in this branch, or a later
one; I'd prefer a later one because having smaller deltas is easier for
me to work with.

> I'll note that the docstring is now incorrect:
> cdef _read_dir(path):
>      """Like os.listdir, this reads the contents of a directory.
> 
>      :param path: the directory to list.
>      :return: a list of (sort_key, basename) tuples.
>      """
> 
> ^- as _read_dir is now returning:
> PyList_Append(result, (entry.d_ino, entry.d_name, None,
>      statvalue, None))

Thanks.

> I'll note that it is a little bit ugly to create a tuple and modify it
> *in-place*, but as long as it is contained in the C code, and you 
> haven't
> called hash(tuple), etc, it is probably fine to do. Perhaps we should 
> make it a
> bit clearer that we are doing so?

I disagree on 'ugly' - its what the C layer _does_ to create tuples.
(You make it, and the populate it). The ugliness here is IMO being
half-python and half-CPython.

> +            # We have inode, name, None, statvalue, None
> +            # inode -> path_from_top
> +            # direct concat - faster than operator +.
> 
> # We have (inode, name, None, statvalue, None), which we will replace 
> with
> # (prefix_path, name, kind, statvalue, top_path)

I'll rephrase it because I think you're saying it's not clear at first
read. But we're not replacing one tuple with another, we're editing the
tuple we're creating - and I don't want confusion on that score!

> Otherwise the read_dir() function looks like it is properly
> streamlined 
> on the
> standard path. We use direct C access for all of the common path code.

Theres quite a bit more we can do on that path I think, but its quite
tedious.

> I'll note, though, that you no longer return entries in sorted order.
> I 
> guess
> that is okay, since you use a sorted() call in _walkdirs_utf8()
> 
> However, you need to change the DirReader api documentation:
>   def read_dir(self, prefix, top):
>       """Read a specific dir.
> 
>       :param prefix: A utf8 prefix to be preprended to the path 
> basenames.
>       :param top: A natively encoded path to read.
>       :return: A sorted list of the directories contents. Each item 
> contains:
>           (utf8_relpath, utf8_name, kind, lstatvalue, native_abspath)
>       """
>       raise NotImplementedError(self.read_dir)
> 
> I'll also note that we can get rid of the sorted() in
> UnicodeDirReader.read_dir() and the same code in walkdirs_win32.pyx

Thanks, will do.


> Also, things are quite broken for win32, I'm getting infinite
> recursion 
> in
> "file_kind_from_stat_mode_thunk". You are missing this line:
> +            file_kind_from_stat_mode = _file_kind_from_stat_mode
> 
> After you import the correct version from the _readdir_py.py module.
> 
> So we should probably should have test that it can fallback correctly.

Oh hmm. I'll look at a unit test for this.

> Oh, and you broke the _walkdirs_win32 code, because it was expecting
> to 
> grab
> "formats" from osutils.py. This fixes it for me:
> === modified file 'bzrlib/_walkdirs_win32.pyx'
> ...
> I also think it would be reasonable to rename the
> "_walkdirs_win32.pyx" 
> file to
> something like "_readdir_win32_pyx.pyx" to fit the other naming.

will do.

Thanks for the review.

-Rob
-- 
GPG key available at: <http://www.robertcollins.net/keys.txt>.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : https://lists.ubuntu.com/archives/bazaar/attachments/20080923/adc6e8e7/attachment-0001.pgp 


More information about the bazaar mailing list