[MERGE][bug #120697] Don't force http cache revalidation

Wed Nov 21 08:19:10 GMT 2007

>>>>> "john" == John Arbash Meinel <john at arbash-meinel.com> writes:

    john> Aaron Bentley wrote:
    >> Vincent Ladeuil wrote:
    aaron> It was needed because of broken proxies that would return stale data,
    aaron> causing breakage in fetch, unless we did that.
    >> 
    >>> So the bug report is wrong and we shouldn't fix that behavior or
    >>> not ?
    >> 

    aaron> To me, it's a tough call.  You get really strange
    aaron> errors when your proxy is returning stale data, so I'd
    aaron> be inclined to leave it on by default, and allow
    aaron> people to disable it.

By 'on' you mean bypassing the cache to always get fresh data
from the origin server ?

I agree with that.

    >> 
    >> Fortunately, packs don't reuse names, so breaking fetch is much less of
    >> a concern.
    >> 
    >> Aaron

    john> Also, I think we are better about making sure we fetch
    john> specific data.

I agree.

    john> Back when we used Weaves, we assumed that a referenced
    john> weave would contain everything we needed. So we just
    john> downloaded the whole file, and pushed it into the local
    john> repository (overtop of whatever we used to have).

    john> With knits, we are expecting a specific set of
    john> revisions. So if the .kndx doesn't have them (because
    john> it was being cached), then we would know that something
    john> was wrong. So we should abort, rather than copying
    john> broken data. (If the .knit data wasn't downloaded, we
    john> should likewise be aware that something was wrong and
    john> abort.)

That's how I understand it and that may enable us to work with
cached data.

    john> As I mentioned in the other email, what can happen is a
    john> branch won't look like it has been updated. But that is
    john> a relatively minor thing.

I can't find that other mail you're referring to.

    john> I see that we could do a few things.

    john> 1) Default to allowing caches to work transparently.

I find that too risky.

    john> 2) Allow users to add a BZR_NO_HTTP_CACHES=1, which
    john> will set the Pragma's to force a revalidation.

I'd prefer BZR_HTTP_USE_CACHE=1 to enable cache use.

    john> 3) Get extra fancy and supply the Pragma for files
    john> which are small and don't really need to be cached. For
    john> example .bzr/branch/last-revision. And maybe the .kndx
    john> files.

At the transport level ? By adding http specific parameters to
get et al. ? >-/

    john> So the .knit data would still be cached, but we would
    john> be sure to have the latest indexes. We would still need
    john> to be safe in the presence of incomplete data (we know
    john> that we should have bytes 1000-2000 but we only get
    john> 1000-1500 because the cache didn't see the rest.) But
    john> we need that anyway.

Ooook.

Well all in all, I provided that patch because it was trivial to
write and I was curious about why we chose that policy (thanks to
all for the responses so far, now I know).

But reworking our http caching policy (actual: don't cache)
toward a more smart policy seems too risky for no obvious
benefits (considering we try to address network uses by better
formats or hpss).

What are the use cases here ?

- When an initial branch fails we throw away the data collected
  so far.

Allowing cache use may help in that case if we can be sure that
no inconsistency will be introduced. But I don't think we can be
sure enough to take that risk.

- Too much data are transferred on subsequents pulls.

bzr MUST require less and less data. I'd prefer that we to
address that point without messing with cache. Without going into
special http server configurations (which most of our users may
not be able to do), I can't see any smart uses of the http
protocol (even 1.1) that will give us the guarantee that we get
up-to-date data (prove me wrong, you'd be welcome).

Martin said:

    martin> With knits and weaves we modify existing files.  If a
    martin> cache gives us older versions of some files, and
    martin> newer versions of others, we'll see the repository as
    martin> broken.

Good thing we *are* able to detect that !

    martin> However, I think it'd be better to ask the cache
    martin> 'must revalidate' rather than 'no cache'.

I think the following comment in _pycurl.py should ring a bell:

            # There's no way in http/1.0 to say "must
            # revalidate"; we don't want to force it to always
            # retrieve.  so just turn off the default Pragma
            # provided by Curl.

gannotate points to me but only for fun, the real author is
Martin in rev1540.3.11 :)

    martin> It would be dangerous to take this out entirely and
    martin> allow them to use the default age-based heuristic.
    martin> That will be more friendly in the case of the
    martin> original bug.

Well, at a minimum that will require us to check warnings in
cache responses to detect stale data but if we are trying to
address buggy proxies I still think it's to risky to trust
them...

The original bug report said:

,----
|  Bzr is slow enough already, bypassing proxy-caches is not
| good. One could, with lots of goodwill, argue the case for
| initial branching, but re-fetching more than a megabyte
| afterwards for each bzr pull (one in each branch) is stretching
| it.
`----

I think we'd better help the said 'goodwill' by making bzr
download less data when pulling (which we are already trying to
do) rather than exploring weird proxies for which we don't have
test infrastructure support nor knowledge (does somebody remember
which proxy exhibited such a behavior ?).

I attached the actual patch with a comment in launchpad saying:

,----
| Buggy proxies (returning stale data) have been encountered in
| the past.  If a cache gives us older versions of some files,
| and newer versions of others, we'll see the repository as
| broken.  Therefore, bzr explicitly bypass cache by using a
| 'no-cache' directive as Roland rightly deduced.
| 
| The attached patch disables that behavior. Use at your own risk.
`----

If any of you think that we should do more than that*, speak up.

Otherwise I'll set the bug to "Won't Fix" in a few days
explaining that we intend to address the root problem (network
unfriendliness) by other means.

      Vincent

*: For example, implementing the BZR_HTTP_USE_CACHE environment
 variable.