urlutils improvements from hpss
John Arbash Meinel
john at arbash-meinel.com
Thu Apr 5 18:16:23 BST 2007
Andrew Bennetts wrote:
> John Arbash Meinel wrote:
>> Martin Pool wrote:
>>> On 3/30/07, John Arbash Meinel <john at arbash-meinel.com> wrote:
>>>> I don't really understand the need for "joinpath" versus "pathjoin".
>>> Did you mean "versus join_url_path"?
>>>
>>> Anyhow, I agree, the module name makes it clear enough.
>>>
>> No, I mean we have 2 functions. "osutils.joinpath" and
>> "osutils.pathjoin". Where "pathjoin" is inherited from os.path.join, but
>> then "joinpath" seems to be doing the same thing, only disallowing '..',
>> '', and None in the paths.
>
> Also, osutils.pathjoin varies behaviour depending on the host platform, which is
> not something we want to do with URLs. Joining URL paths should work
> identically on all platforms.
>
I agree, though I thought that osutils.pathjoin() should function
approximately the same, regardless of the platform. (Since internally we
use all '/' separated paths)
>> The naming is close enough to make it seem like they should be the same
>> function. And they do almost the same thing, but one has a subtle
>> difference.
>>
>> For urlutils I simplified it to urlutils.join().
>>
>> But now there is urlutils.joinpath(), and from a naive glance, I don't
>> see a big difference between the two.
>
> urlutils.join joins a full URL, and one or more path parts of a URL, returning a new
> URL.
>
> urlutils.joinpath joins a path part of a URL, and one or more path parts of a
> URL, returning a new path part of a URL.
maybe 'urlutils.join_partial'? Or "join_fragments"?
>
> There are various places where we want to just join paths in URLs, rather than
> full URLs, e.g. in the chrooting code and remote path calculations.
>
> You can work around the lack of urlutils.pathjoin by using urlutils.join with
> "fake:///" or similar ugliness, but that's hardly clear.
>
> There are many places in bzrlib that I think could use urlutils.pathjoin (e.g.
> Transport._combine_paths, and various other reimplemenations of the same idea).
> Having a single implementation in an obvious place is a first step to reducing
> this duplication.
>
>> What is also weird to me is that urlutils.join() does this in the interior:
>>
>> path = '/'.join(path)
>> path = joinpath(path, arg)
>> path = path.split('/')
>
> It makes sense for urlutils.join to delegate part of its task to
> urlutils.pathjoin, rather than duplicating the logic.
I agree, but it seems like both of these should be using a helper which
works in the same way that they internally do (on lists), rather than
munging to get a string, to pass to a command which converts to a list,
and then converts back to a string, so that we can split it back into a
list.
>
>> so it takes ['foo', 'bar', 'baz'] => 'foo/bar/baz' passes it to
>> 'joinpath' and then splits it back up. At best this is inefficient.
>>
>> Especially considering that 'joinpath()' is defined first as:
>>
>> def joinpath(base, *args):
>> path = base.split('/')
>> ...
>
> I'm happy for performance to be improved, so long as there's a real problem
> here. There could well be, I haven't done benchmarking, although I think it's
> safe to say this isn't the bottleneck for hpss yet :)
It isn't so much the performance as seeming really ugly.
>
> My main concern when writing this was correctness. These functions proved quite
> fiddly to get passing tests.
>
>> So that means the above becomes:
>>
>>
>> path = '/'.join(path)
>> path = path.split('/')
>> # operate on extra argument
>> path = '/'.join(path)
>> path = path.split('/')
>
> Sure. There's probably a reasonable win to be had by extracting out a
> _joinpath_segments variant that assumes the path.split('/') has already been
> done, and then urlutils.join could call that directly.
>
> Another possibility is to have a URL class, which these could be methods of. A
> URL instance could have this sort of parsing cached. The added overhead of
> using a custom object instead of str for representing URLs isn't clearly a
> performance win, although it's probably a clarity win. It would also be more
> work, especially considering backwards compability with strings would be needed
> for some time.
...
>> I suppose I would like to know what was wrong with "urlutils.join()"
>> that it couldn't be used rather than "urlutils.joinpath()".
>
> The problem is that urlutils.join does something else, so is at best awkward to
> use in places that can use urlutils.joinpath directly. I'm not sure why this
> was unclear, does the docstring need to be made clearer.
>
> I wanted to be able to join two path parts (for lack of better terminology) of a
> URL without adding "fake:///" to one and then stripping it again, or worse
> reimplementing the same logic as urlutils.join. I note that e.g.
> Transport._combine_paths does this. SFTPTransport._remote_path duplicates most
> of the same logic (although with the added wrinkle of /~/), and I seem to recall
> there were other places I've seen too. I've personally had to fix the same bugs
> over and over again because of this duplication already. Extracting the
> relevant logic into somewhere reusable (and independently testable!) seemed like
> an obvious step.
>
> -Andrew.
Probably if it was called something other than "joinpath" or "pathjoin"
I would feel better about it. Simply because those names are already
overloaded (too much, IMO).
I agree that having a separate function for it is very reasonable and an
overall good thing. (I certainly wouldn't advocate doing any weird hacks
like 'fake:///').
John
=:->
More information about the bazaar
mailing list