[MERGE] (robertc) Cap the amount of data we write in a single IO during local path pack operations to fix bug 255656. (Robert Collins)

Fri Aug 15 17:27:51 BST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Thu, 2008-08-14 at 23:08 -0500, John Arbash Meinel wrote:
> 
> 
>> This at least feels like it should be a helper function akin to
>> "osutils.pumpfile".
> 
> Ok, I"ll put one [tested] together.
> 
>> BB:tweak
>>
>> The actual loop seems fine, though I would wonder about buffer() versus
>> just bytes[start:end]. (I realize there is at least 1 copy in using
>> slicing, but I also don't think we need to be using a 5MB buffer here.)
> 
> bytes[start:end] does a memcpy. buffer does not.
> 
> -Rob

As I said "1 copy using slicing". I understand that buffer works without
copying. I'm quite curious to probe deeper and see how it works.

Specifically, does it cache its hash value? That property comes in handy
for PatienceDiff, and GroupCompress. It looks like it does have a cached
hash, and it further looks like it uses the same hash as string objects:

  x = *p << 7;
  while (--len >= 0)
      x = (1000003*x) ^ *p++;
  x ^= a->ob_size;
  if (x == -1)
      x = -2;

Which is good, and important for groupcompress and patiencediff. (If
hash(buffer(x, start, len)) gave a different hash than
hash(x[start:start+len]) then we would have trouble in places.)

It still requires a object creation. There are also other limitations,
such as:

>>> ''.join([y, z])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: sequence item 0: expected string, buffer found

So buffer() objects aren't supported for string.join()

That may be a big issue.

My other concern for Group-Compress specifically is that the buffer
object has to hold a reference to the original string. So imagine a
worse-case scenario, where you have 3 10kB texts in 3-different GC
blocks. If we used buffer() as the way to return chunks, you would end
up holding on to 3*20MB = 60MB of memory for 30kB of actual texts.

Note also that buffer(unicode, 2, 3) works, but it exposes the
underlying unicode implementation:
>>> n = buffer(u'12324234', 2, 4)
>>> str(n)
'2\x003\x00'

Anyway, I certainly think that buffer should be explored. It has the
potential to be quite interesting for saving memory copies. I'm just
concerned that buffer() isn't quite enough like a str() for what we need.

You also can't do:

>>> import cStringIO
>>> sio = cStringIO.StringIO()
>>> x = '1234567890'
>>> y = [buffer(x, start, 2) for start in xrange(len(x))]
>>> sio.writelines(y)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: expected string or Unicode object, buffer found

Which I find odd, because
>>> sio.write(y[0])

works fine.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkilrocACgkQJdeBCYSNAAOs9QCgld1PNKNyyMjdFox5B/rzD1XA
DP8AoKkq5Tw04xOstPYp9Mit+WTELUCa
=OJst
-----END PGP SIGNATURE-----