[MERGE] TreeTransform avoids many renames when constructing trees

Tue Jun 5 15:16:54 BST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> John Arbash Meinel wrote:
>> John Arbash Meinel has voted +1.
>> Status is now: Approved
>> Comment:
>> Adding self.rename_count doesn't seem strictly necessary. Is it just for
>> testing?
> 
> Yes.  I thought that was a planned direction for future work, e.g. to
> make sure we don't open too many transports, take too many locks, etc.
> Do I misunderstand?

It didn't seem strictly necessary, but I'm okay with it for testing. It doesn't
seem to add a lot of overhead.

> 
>> I find names like "frexpar" to be more confusing than just saying
>> "first_dir" or "second_dir". Having more meaningful names would make it
>> easier to figure out what the tests are doing.
> 
> Yeah, I've moved away from the cutesy test suites, but most style guides
> recommend retaining the style of surrounding code, which in this case
> was wizard-of-oz-based.  I can make it more straightforward.

Well, I'm fine with biasing local consistency over "correctness", as I don't
expect you to rename everything.

> 
>> I would be more intrigued if you could create a target directory which
>> claims to be a child of oz-id and neither one had a name yet.
> 
> I'm not sure what you mean.  You mean a triply-nested path, like
> foo/bar/baz?  Creating the contents for all three first, then assigning
> the name?

well, I was just meaning 'foo/bar' where 'foo' has not been named yet, but
'bar' has. Ultimately if it is an uncommon code path, we don't need to optimize
for it. It is just the common case of 50,000 files being built in a single
large tree that we need to make faster.

> 
>> Just doing
>> it at the top level means you haven't tried to do any nesting yet. (It
>> may not be possible to create a child in a directory that doesn't have a
>> name).
> 
> Well, it's possible to create a child that has a name in a directory
> that has no name.
> 
>> The more I look at it, I don't understand why "test_change_parent()" can
>> have the final file in the correct directory, but "test_reuse_name()"
>> can't.
> 
> In theory it could, but creating elphaba2 while elphaba1 is occupying
> that limbo path causes elphaba2 to be created at the top level of limbo.
>  I don't consider it worthwhile to rename top-level limbo files into
> their parents before Transform.apply().  If we did that, we would be
> renaming a file before Transform.apply() to avoid renaming the file
> during Transform.apply().

Well, you are doing that with the test_change_parent test. So I'm confused why
you aren't doing it consistently. Either always do it, or never, at least from
what I can see.

> 
>> Also, we don't really need the top-level directories to have their final
>> names until the rename. Since we have to "os.rename('limbo/*', 'wt/*')"
>> anyway. I don't know whether this simplifies your code or not. But you
>> could use "oz == 'new-2'" until you go to "rename limbo/new-2 => oz"
> 
> Yes, that's what I do.

Ok, you just had written comments like:

# limbo/oz => oz

and other comments like

# limbo/new-2 => foo

So it seemed like 'oz' was getting its final name in limbo but 'foo' wasn't.
You may want to revisit the comments a little bit.

> 
>> You don't actually have to build up the temporary list here:
>> -            for trans_id, kind in self._new_contents.iteritems():
>> -                path = self._limbo_name(trans_id)
>> +            entries = [(self._limbo_name(t), t, k) for t, k in
>> +                       self._new_contents.iteritems()]
>> +            for path, trans_id, kind in sorted(entries, reverse=True):
> 
>> You could do it as a generator, which saves a bit of memory. Or you
>> could do it as:
>> entries = sorted((self._limbo_name(t), t, k)
>>                  for t, k in self._new_contents.iteritems(),
>>                  reverse=True)
> 
>> I don't think it is strictly a big deal, but this is O(tree) so it is
>> nice to avoid extra copies when possible.
> 
> AIUI, it is faster to iterate through a list than a generator.

Well, you are creating a list to pass it to sorted, which generates another
list. I do believe that
  foo = sorted([list expression])
is better than
  foo = [list expression]
  foo.sort()
But I haven't tested anything between:
  foo = sorted(generator)
versus
  foo = sorted([list expression])

The quick test is:

TIMEIT -s "y = range(100000); import random; random.shuffle(y)" \
  "x = sorted([n for n in y])"
10 loops, best of 3: 747 msec per loop

  "x = sorted(n for n in y)"
10 loops, best of 3: 764 msec per loop

Which indicates to me they are pretty much equivalent. (I would guess the
variation is actually because of random.shuffle() more than because of
list/generator).

Interestingly at 10,000 entries the times are 50.0ms and 49.2ms favoring the
list form. (Just sorting y is 43.9ms). So there may be a bias, but it is a
small one. (If I had to guess, it might be that sorted() can pre-allocate the
length of the list rather than growing it as new entries come in).

> 
>> I'm guessing TT._limbo_name() could be optimized a bit (like have it do
>> 1 lookup with a try/except KeyError rather than 2 with a "x in y: return
>> y[x]". But that sort of depends on whether x is usually present or not.
>> And from what I could see, it is a 50/50 mix. (You use it the first time
>> you create something, and then when you are renaming everything into place)
> 
> Yeah, could be.  I wanted to defer optimization until these changes were
> merged, to reduce the size of this patch.
> 
> Thanks for your review.
> 
> Aaron

Certainly. And other than some minor doc things, I think it is fine to come is
as in. Enough that I wouldn't block it for 0.17 as-is.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGZXBWJdeBCYSNAAMRAjcLAKCGmAwS5lDHeHCZGMDBeeKTSnrWGACfUKq7
KT71oSqBdjQLPRk4umM0MHM=
=NvKJ
-----END PGP SIGNATURE-----