urlutils improvements from hpss

Mon Apr 2 01:22:48 BST 2007

John Arbash Meinel wrote:
> Martin Pool wrote:
> > On 3/30/07, John Arbash Meinel <john at arbash-meinel.com> wrote:
> >> I don't really understand the need for "joinpath" versus "pathjoin".
> > 
> > Did you mean "versus join_url_path"?
> > 
> > Anyhow, I agree, the module name makes it clear enough.
> > 
> 
> No, I mean we have 2 functions. "osutils.joinpath" and
> "osutils.pathjoin". Where "pathjoin" is inherited from os.path.join, but
> then "joinpath" seems to be doing the same thing, only disallowing '..',
> '', and None in the paths.

Also, osutils.pathjoin varies behaviour depending on the host platform, which is
not something we want to do with URLs.  Joining URL paths should work
identically on all platforms.

> The naming is close enough to make it seem like they should be the same
> function. And they do almost the same thing, but one has a subtle
> difference.
> 
> For urlutils I simplified it to urlutils.join().
> 
> But now there is urlutils.joinpath(), and from a naive glance, I don't
> see a big difference between the two.

urlutils.join joins a full URL, and one or more path parts of a URL, returning a new
URL.

urlutils.joinpath joins a path part of a URL, and one or more path parts of a
URL, returning a new path part of a URL.

There are various places where we want to just join paths in URLs, rather than
full URLs, e.g. in the chrooting code and remote path calculations.

You can work around the lack of urlutils.pathjoin by using urlutils.join with
"fake:///" or similar ugliness, but that's hardly clear.

There are many places in bzrlib that I think could use urlutils.pathjoin (e.g.
Transport._combine_paths, and various other reimplemenations of the same idea).
Having a single implementation in an obvious place is a first step to reducing
this duplication.

> What is also weird to me is that urlutils.join() does this in the interior:
> 
>             path = '/'.join(path)
>             path = joinpath(path, arg)
>             path = path.split('/')

It makes sense for urlutils.join to delegate part of its task to
urlutils.pathjoin, rather than duplicating the logic.

> so it takes ['foo', 'bar', 'baz']  => 'foo/bar/baz' passes it to
> 'joinpath' and then splits it back up. At best this is inefficient.
> 
> Especially considering that 'joinpath()' is defined first as:
> 
> def joinpath(base, *args):
>   path = base.split('/')
>   ...

I'm happy for performance to be improved, so long as there's a real problem
here.  There could well be, I haven't done benchmarking, although I think it's
safe to say this isn't the bottleneck for hpss yet :)

My main concern when writing this was correctness.  These functions proved quite
fiddly to get passing tests.

> So that means the above becomes:
> 
> 
>    path = '/'.join(path)
>    path = path.split('/')
>    # operate on extra argument
>    path = '/'.join(path)
>    path = path.split('/')

Sure.  There's probably a reasonable win to be had by extracting out a
_joinpath_segments variant that assumes the path.split('/') has already been
done, and then urlutils.join could call that directly.

Another possibility is to have a URL class, which these could be methods of.  A
URL instance could have this sort of parsing cached.  The added overhead of
using a custom object instead of str for representing URLs isn't clearly a
performance win, although it's probably a clarity win.  It would also be more
work, especially considering backwards compability with strings would be needed
for some time.

> Obviously I missed this when Robert proposed this patch. Though it is
> the 'hpss-urlutils' patch. So it is fairly recent.
> 
> I'm all for code re-use, but something is seriously borked with this
> code path. I sure hope we don't call 'urlutils.join()' with any sort of
> frequency. (Though I think we do it for every Transport call, which
> means it happens at least 2 times per knit we process).

Well, I'm all for faster code, so long as it actually matters :)

> I suppose I would like to know what was wrong with "urlutils.join()"
> that it couldn't be used rather than "urlutils.joinpath()".

The problem is that urlutils.join does something else, so is at best awkward to
use in places that can use urlutils.joinpath directly.  I'm not sure why this
was unclear, does the docstring need to be made clearer.

I wanted to be able to join two path parts (for lack of better terminology) of a
URL without adding "fake:///" to one and then stripping it again, or worse
reimplementing the same logic as urlutils.join.  I note that e.g.
Transport._combine_paths does this.  SFTPTransport._remote_path duplicates most
of the same logic (although with the added wrinkle of /~/), and I seem to recall
there were other places I've seen too.  I've personally had to fix the same bugs
over and over again because of this duplication already.  Extracting the
relevant logic into somewhere reusable (and independently testable!) seemed like
an obvious step.

-Andrew.