[RFC] Repository.get_file_texts API and planning for it

Wed Aug 15 15:38:19 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> Robert Collins wrote:
>> Aaron and I chatted on IRC earlier today. We'd like to add an interface
>> which will extract a number of file texts from a repository in whatever
>> manner is best for that repository, calling back to a tree transform (or
>> possibly a callback ?) for each text.
> 
> In order to maintain our current set of abstractions, it seems like this
> will need a corresponding method on RevisionTree:
> 
> def create_files([(file_id, trans_id)])
> 
> Which would have a naive implementation on Tree.
> 
> Also, it kinda sucks that DirStateRevisionTree is not a RevisionTree,
> because we have to implement it there, too.
> 
>> This needs changes to Repository - adding the new method (we need a good
>> name
> 
> Repository.create_files ?
> 
>> - its not a regular getter, because it calls back with the text, or
>> the text lines, or a file object - we should decide what is least
>> friction here too).
> 
> Iterables of bytes is a very convenient one.  Text lines is nice only
> when working with text.  File objects have high API demands, but even
> strings are iterables of bytes.

You've said this in the past, and while I agree it is convenient, it has
some odd performance characteristics. Specifically (edited for clarity):

STRING="'this is a test of writelines with a fairly long string. The
content of this string cdoesnt really matter, it is just meant to be
fairly long because file texts are long'"

% TIMEIT -s "from cStringIO import StringIO" \
          "sio = StringIO(); sio.writelines($STRING)"
100000 loops, best of 3: 19.7 usec per loop

% TIMEIT -s "from cStringIO import StringIO" \
	"sio = StringIO(); sio.writelines([$STRING])"
1000000 loops, best of 3: 1.82 usec per loop

% TIMEIT -s "from cStringIO import StringIO" \
	"sio = StringIO(); sio.writelines(iter([$STRING]))"
100000 loops, best of 3: 2.14 usec per loop

% TIMEIT -s "from cStringIO import StringIO" \
         -s "sio_text = StringIO($STRING)" \
          "sio = StringIO(); sio_text.seek(0); sio.writelines(sio_text)"
1000000 loops, best of 3: 3.17 usec per loop

These aren't a perfect comparison, I should have really read in a real
file, and done the comparison. However, you can see that:

a) StringIO.writelines() is happy to take any iterable. It handles a
string, a list, an iterator and another file-like object without any
problems.

b) Using a plain string instead of any other format is about 10x slower.
And I would assume this would only get *worse* when using longer
strings. Because when iterating over a string, it has to create N single
byte strings. All of the others get to work on some sort of chunk. This
test is also biased, in that there is a single line, rather than having
multiple lines. However, in the worst case, it still is a factor of
average line length (40 chars?) fewer chunks to be written.

Now, to be a little more fair, I went ahead and redid the tests using a
real file text as the source:

% TIMEIT -s "from cStringIO import StringIO" \
         -s "lines = open('builtins.py', 'rb').readlines()" \
         "sio = StringIO(); sio.writelines(lines)"
1000 loops, best of 3: 554 usec per loop

% TIMEIT -s "from cStringIO import StringIO" \
         -s "lines = open('builtins.py', 'rb').readlines()" \
         -s "text = ''.join(lines)"
         "sio = StringIO(); sio.writelines(text)"
100 loops, best of 3: 15 msec per loop

% TIMEIT -s "from cStringIO import StringIO" \
         -s "lines = open('builtins.py', 'rb').readlines()" \
         -s "text = StringIO(''.join(lines))" \
         "sio = StringIO(); text.seek(0); sio.writelines(text)"
1000 loops, best of 3: 1.18 msec per loop

% TIMEIT -s "from cStringIO import StringIO" \
         -s "lines = open('builtins.py', 'rb').readlines()" \
         -s "text = [''.join(lines)]" \
         "sio = StringIO(); sio.writelines(text)"
10000 loops, best of 3: 132 usec per loop

So the order here goes
  132us writelines([single_string])
  554us writelines([lots of strings])
 1180us writelines(StringIO(single_string))
15000us writelines(single_string)

So *if* you have already have a single string in memory, it is *vastly*
better to just wrap that into a list with a single entry, and return it.
 It is almost 100x faster than the alternative.
Passing back a StringIO of the string is about 10x slower than a plain
list, most likely because it is parsing through the data to figure out
where the '\n' characters are, and then creating smaller strings as we
go through. And more write calls overall.

So while I agree that iterable of bytes is a convenient and very
adaptable api. We really don't want to be passing a plain string to that
api.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGww/bJdeBCYSNAAMRAkSpAJ0YvHaMNJwAxGKdo45ZR4yis0JF+QCgjzI8
xnh5gl8n5wZCOMqiozZckFM=
=Z/ri
-----END PGP SIGNATURE-----