On Lists and Iterables

Sun Dec 17 21:52:14 UTC 2017

On Sun, Dec 17, 2017 at 12:22:20PM +0100, Xen wrote:
> Neal McBurnett schreef op 16-12-2017 18:16:
> > For more on the rationale for changes related to iterators see
> > http://portingguide.readthedocs.io/en/latest/iterators.html
> 
> That entire rationale is only explained with one word "memory consumption".
> 
> So now you are changing the design of your _grammar_ just so that the
> resulting code will use less memory.
> 
> That is the job of the compiler, not the developer.

I don't think the document above does a particularly good job of
explaining it, and I think you've fundamentally misunderstood things,
perhaps by extrapolating too much from toy examples.

zip() takes iterables as its inputs; concrete lists are only one kind of
iterable.  Iteration constructs are very widespread in non-trivial
Python code, and it's common to make use of iterators to express
constructions where you can cheaply extract a few elements but it would
be expensive to extract them all.

For example, I spend most of my time working on database-backed web
applications, which is a very popular application for Python.  In that
context, it's commonplace to make database queries via functions that
return iterators and do lazy loading of results.  You then iterate over
these to build a page of results (which can use things like LIMIT and
OFFSET when compiling its SQL queries), and you render and return that.
If you accidentally call something that consumes the whole input
iterable in the process, then it's going to do a *lot* of database
traffic for some queries, and it doesn't take much of that to utterly
destroy the performance of your application.

This is not something that the compiler can optimise, because the
*contract* of zip et al in Python 2 was that it would consume the entire
inputs (up to the shortest one in the case of zip, anyway); iteration is
visible to the program and can have external side-effects, and it's not
something that can be quietly optimised out given the design of the
language.  Talking about memory consumption of the result is relevant in
some cases, sure, but it's certainly not the whole story; what often
matters is the work involved in materialising the whole iterable, and
that can be very significant indeed.

In Python 2, there were many functions that took iterables as input and
returned concrete lists, consuming the entire inputs in the process.  In
most cases there were versions of these that operated in a lazy fashion
and returned iterables instead, but they were generally hidden off in
the itertools module and less obvious compared to the built-in versions.
Effectively, the language did the wrong thing by default.

Python 3 changes these around to give preference to the versions that
take iterables as input and return iterables as output, and says that if
you want a list then you have to use list() or similar to get one.  This
reduces the cognitive load of the language, because now instead of
remembering the different names for the list-returning and
iterable-returning versions of various things, you only have to remember
one version and the general rule that you use list() to materialise a
whole iterable into a list (which was already useful for other things
even in Python 2).  It makes the language simpler to learn, because
there are fewer rules and they compose well; and it makes it easier to
do what's usually the right thing.  This comes at the cost of a bit of
porting effort for some code that started out in Python 2, of which
there'll be less and less as time goes on.

To put it another way: "don't perform operations on collections of
unbounded size" is pretty much the number one rule for webapps that I've
picked up over the last few years, and Python 3 takes this lesson and
applies it to the core language.

Toy examples involving zip([1, 2], [3, 4]) and the like miss the point
because they simplify too much.  This family of functions is almost
always used in iteration constructs, usually "for ... in" or a
comprehension, and in those common cases the programmer doesn't have to
change anything at all.  In cases where they do need to change
something, it has the useful effect of highlighting that something a
little unusual may be going on, rather than hiding behaviour that's
potentially catastrophic at scale behind an innocuous-looking built-in
function.

> Meanwhile Python 3.4 can be excessively slower than 2.7. SO WHERE'S THE
> GAIN?

It will no doubt depend on the benchmark, and rather than cherry-picking
a single one it's likely more interesting to look at either a wide range
of benchmarks, or at the specific application in question.
Counterpoint, which also links to much more data:

  https://lwn.net/Articles/725114/

-- 
Colin Watson                                       [cjwatson at ubuntu.com]