[MERGE] sftp readv performance tweaks (2x faster)

Fri Jul 21 00:23:42 BST 2006

On 17 Jul 2006, John Arbash Meinel <john at arbash-meinel.com> wrote:

> Well, I ran scripts all weekend to figure out how fast I could make sftp
> with my changes. And the attached image shows how it went.

Very nice.  Given the improvement it would be good for 0.9 - do you have
any reservations?

> +    @staticmethod
> +    def _coalesce_offsets(offsets, limit, fudge_factor):
> +        """Yield coalesced offsets.
> +
> +        With a long list of neighboring requests, combine them
> +        into a single large request, while retaining the original
> +        offsets.
> +        Turns  [(15, 10), (25, 10)] => [(15, 20, [(0, 10), (10, 10)])]
> +
> +        :param offsets: A list of (start, length) pairs
> +        :param limit: Only combine a maximum of this many pairs
> +                      Some transports penalize multiple reads more than
> +                      others, and sometimes it is better to return early.
> +                      0 means no limit
> +        :param fudge_factor: All transports have some level of 'it is
> +                better to read some more data and throw it away rather 
> +                than seek', so collapse if we are 'close enough'
> +        :return: a generator of [(start, total_length, 
> +                                  [(inner_offset, length)])]
> +                (start, total_length) is where to start, and how much to read
> +                and (inner_offset, length) is how those chunks should be
> +                split up.

These lists are getting to a size/complexity where they might be better
as trivial objects rather than a list - with say .start, .total_length,
.ranges.  Unless there's a compelling peformance reason it's better not
to store objects with multiple fields inside a list or tuple.  It might
make the tests a bit more reasonable too.

This couldn't be a generator then, but does it get any value from being
so?  Since you sort the input it can't really stream.

Just a thought, if you think it's better this way you can leave it.

> +        """
> +        last_end = None
> +        cur_start = None
> +        cur_total = None
> +        cur_subranges = []
> +
> +        for start, size in offsets:
> +            end = start + size
> +            if (last_end is not None 
> +                and start <= last_end + fudge_factor
> +                and start >= cur_start
> +                and (limit <= 0 or len(cur_subranges) < limit)):
> +                cur_total = end - cur_start
> +                cur_subranges.append((start-cur_start, size))
>              else:
> -                if (len (combined_offsets) < 50 and
> -                    combined_offsets[-1][0] + combined_offsets[-1][1] == offset):
> -                    # combatible offset:
> -                    combined_offsets.append([offset, size])
> -                else:
> -                    # incompatible, or over the threshold issue a read and yield
> -                    pending_offsets.appendleft((offset, size))
> -                    for result in do_combined_read(combined_offsets):
> -                        yield result
> -                    combined_offsets = []
> -        # whatever is left is a single coalesced request
> -        if len(combined_offsets):
> -            for result in do_combined_read(combined_offsets):
> -                yield result
> +                if cur_start is not None:
> +                    yield (cur_start, cur_total, cur_subranges)
> +                cur_start = start
> +                cur_total = size
> +                cur_subranges = [(0, size)]
> +            last_end = end
> +
> +        if cur_start is not None:
> +            yield (cur_start, cur_total, cur_subranges)
> +
> +        return
>  
>      def get_multi(self, relpaths, pb=None):
>          """Get a list of file-like objects, one for each entry in relpaths.
> 
> === modified file 'bzrlib/transport/sftp.py'
> --- bzrlib/transport/sftp.py	2006-06-30 15:13:19 +0000
> +++ bzrlib/transport/sftp.py	2006-07-17 18:14:33 +0000
> @@ -426,25 +434,6 @@
>          except (IOError, paramiko.SSHException), e:
>              self._translate_io_exception(e, path, ': error retrieving')
>  
> -    def get_partial(self, relpath, start, length=None):
> -        """
> -        Get just part of a file.
> -
> -        :param relpath: Path to the file, relative to base
> -        :param start: The starting position to read from
> -        :param length: The length to read. A length of None indicates
> -                       read to the end of the file.
> -        :return: A file-like object containing at least the specified bytes.
> -                 Some implementations may return objects which can be read
> -                 past this length, but this is not guaranteed.
> -        """
> -        # TODO: implement get_partial_multi to help with knit support
> -        f = self.get(relpath)
> -        f.seek(start)
> -        if self._do_prefetch and hasattr(f, 'prefetch'):
> -            f.prefetch()
> -        return f
> -

Was this just no longer used?

-- 
Martin