[PING][MERGE] Waiting on locks

Sat Sep 9 15:56:15 BST 2006

John Arbash Meinel <john at arbash-meinel.com> writes:

> David Allouche wrote:
>> Marius Kruger wrote:
>>> +1 for 5 min default timeout:
>>>
>>>     The cronscript use case is interesting, but I think it is rare enough
>>>     that the default should be to block if not forever, then at least for a
>>>     "very long time", like 6 hours.
>>>
>>> * cron scripts are definitely not isolated/rare use cases
>> 
>> I would be the last person to discount the importance of bzr behaving
>> right in automated systems. But in my experience, finding the little
>> trick (a config option for example) to make it behave as expected in
>> such context is acceptable, while it's not acceptable to deliver the
>> best out-of-the-box behaviour for interactive use.

I disagree : writting your cron job, one would just use "bzr
whatever". No one will search for a somewhat hidden option untill the
problem actually occurs. The cron job will run for days, and when a
stale lock will appear, it will make it very difficult to diagnose.

>> It is my opinion that retrying to take the lock for what is a "very long
>> time" in interactive use is the right out the box behaviour.

I agree, but six hours is more than "very long time" from a user's
perspective. Maybe 5 minutes is too short, but as a user, letting a
process try and fail repeatedly for more than an hour without
investigating on the cause of the problem reveals a problem with the
user.

>>> * I'd rather let it timeout and maybe retry later 
>>>    (could be specified with commandline parameter like --max-retires=3)
>> 
>> When I said "timeout" I actually meant "time during we'll actively retry
>> taking the lock periodically", for example every 10 seconds.
>
> The current code was designed to sleep for 0.5s between attempts.
> Certainly if we decide to make the overall timeout variable, we could
> make the retry time variable too.

I'd say "double the delay at each retry, starting with ~0.5 second".

If you've been trying several times without success, it means the
server is overloaded, and there's no point in overloadint it even
more.

Additionnaly, there might be some firewall qos policy that blacklists
the client temporarily if it does repeated attempts. For example, my
french lab's firewall does "if an ip connects more than 5 times in a
minute, block it for one minute (and any retry extends the block
period)". So, any sufficiently stupid brute-force attack is blocked
forever, but still, nmap can scan the machine (because it has a
strategy of not retrying too often when it sees it's locked).

Be sure that as a sysadmin, if I see a client sending one request
every half second for hours, I'll blacklist it immediately
(fortunately, I'm not a sysadmin ;-).

> It is flexible enough. But there are some api issues. (LockDir tries to
> look like a plain lock to the Branch/Repository, so that the old format
> code doesn't have to be special cased. and LockDir shouldn't manually
> reach out to read the lock timeout information from the config file).

If there's a way to know whether a successive failures in obtaining
the lock come from the same lock, it would be interresting to use this
information. It helps to distinguish between stale locks and
overloaded server.

-- 
Matthieu