urlutils improvements from hpss

Thu Apr 5 18:16:23 BST 2007

Andrew Bennetts wrote:
> John Arbash Meinel wrote:
>> Martin Pool wrote:
>>> On 3/30/07, John Arbash Meinel <john at arbash-meinel.com> wrote:
>>>> I don't really understand the need for "joinpath" versus "pathjoin".
>>> Did you mean "versus join_url_path"?
>>>
>>> Anyhow, I agree, the module name makes it clear enough.
>>>
>> No, I mean we have 2 functions. "osutils.joinpath" and
>> "osutils.pathjoin". Where "pathjoin" is inherited from os.path.join, but
>> then "joinpath" seems to be doing the same thing, only disallowing '..',
>> '', and None in the paths.
> 
> Also, osutils.pathjoin varies behaviour depending on the host platform, which is
> not something we want to do with URLs.  Joining URL paths should work
> identically on all platforms.
> 

I agree, though I thought that osutils.pathjoin() should function
approximately the same, regardless of the platform. (Since internally we
use all '/' separated paths)

>> The naming is close enough to make it seem like they should be the same
>> function. And they do almost the same thing, but one has a subtle
>> difference.
>>
>> For urlutils I simplified it to urlutils.join().
>>
>> But now there is urlutils.joinpath(), and from a naive glance, I don't
>> see a big difference between the two.
> 
> urlutils.join joins a full URL, and one or more path parts of a URL, returning a new
> URL.
> 
> urlutils.joinpath joins a path part of a URL, and one or more path parts of a
> URL, returning a new path part of a URL.

maybe 'urlutils.join_partial'? Or "join_fragments"?

> 
> There are various places where we want to just join paths in URLs, rather than
> full URLs, e.g. in the chrooting code and remote path calculations.
> 
> You can work around the lack of urlutils.pathjoin by using urlutils.join with
> "fake:///" or similar ugliness, but that's hardly clear.
> 
> There are many places in bzrlib that I think could use urlutils.pathjoin (e.g.
> Transport._combine_paths, and various other reimplemenations of the same idea).
> Having a single implementation in an obvious place is a first step to reducing
> this duplication.
> 
>> What is also weird to me is that urlutils.join() does this in the interior:
>>
>>             path = '/'.join(path)
>>             path = joinpath(path, arg)
>>             path = path.split('/')
> 
> It makes sense for urlutils.join to delegate part of its task to
> urlutils.pathjoin, rather than duplicating the logic.

I agree, but it seems like both of these should be using a helper which
works in the same way that they internally do (on lists), rather than
munging to get a string, to pass to a command which converts to a list,
and then converts back to a string, so that we can split it back into a
list.

> 
>> so it takes ['foo', 'bar', 'baz']  => 'foo/bar/baz' passes it to
>> 'joinpath' and then splits it back up. At best this is inefficient.
>>
>> Especially considering that 'joinpath()' is defined first as:
>>
>> def joinpath(base, *args):
>>   path = base.split('/')
>>   ...
> 
> I'm happy for performance to be improved, so long as there's a real problem
> here.  There could well be, I haven't done benchmarking, although I think it's
> safe to say this isn't the bottleneck for hpss yet :)

It isn't so much the performance as seeming really ugly.

> 
> My main concern when writing this was correctness.  These functions proved quite
> fiddly to get passing tests.
> 
>> So that means the above becomes:
>>
>>
>>    path = '/'.join(path)
>>    path = path.split('/')
>>    # operate on extra argument
>>    path = '/'.join(path)
>>    path = path.split('/')
> 
> Sure.  There's probably a reasonable win to be had by extracting out a
> _joinpath_segments variant that assumes the path.split('/') has already been
> done, and then urlutils.join could call that directly.
> 
> Another possibility is to have a URL class, which these could be methods of.  A
> URL instance could have this sort of parsing cached.  The added overhead of
> using a custom object instead of str for representing URLs isn't clearly a
> performance win, although it's probably a clarity win.  It would also be more
> work, especially considering backwards compability with strings would be needed
> for some time.

...

>> I suppose I would like to know what was wrong with "urlutils.join()"
>> that it couldn't be used rather than "urlutils.joinpath()".
> 
> The problem is that urlutils.join does something else, so is at best awkward to
> use in places that can use urlutils.joinpath directly.  I'm not sure why this
> was unclear, does the docstring need to be made clearer.
> 
> I wanted to be able to join two path parts (for lack of better terminology) of a
> URL without adding "fake:///" to one and then stripping it again, or worse
> reimplementing the same logic as urlutils.join.  I note that e.g.
> Transport._combine_paths does this.  SFTPTransport._remote_path duplicates most
> of the same logic (although with the added wrinkle of /~/), and I seem to recall
> there were other places I've seen too.  I've personally had to fix the same bugs
> over and over again because of this duplication already.  Extracting the
> relevant logic into somewhere reusable (and independently testable!) seemed like
> an obvious step.
> 
> -Andrew.

Probably if it was called something other than "joinpath" or "pathjoin"
I would feel better about it. Simply because those names are already
overloaded (too much, IMO).

I agree that having a separate function for it is very reasonable and an
overall good thing. (I certainly wouldn't advocate doing any weird hacks
like 'fake:///').

John
=:->