Friday evening handover, 1/6/2012
Dave Cheney
david.cheney at canonical.com
Sun Jun 3 22:41:42 UTC 2012
> This is actually my fault. We had already agreed a long time ago that
> any non-stable issues from ZooKeeper should result in visible errors,
> and we should fail, because there are edge cases that don't work so
> well in case of a connection reestablishment, and because we need to
> be able to handle harsh scenarios anyway.
>
> I'll fix gozk so it behaves in that way.
Cool.
>> 2. The timeout value passed to zookeeper.Dial() doesn't do anything,
>> I believe this is a bug.
>
> I believe it does, but it's being ignored. See the open function in
> state/open.go. The events in the session channel will tell that things
> are not so well.
Fair enough, i'll have to take your word for it. I took to setting that value to 15e6 and 1/2 an hour later the PA was still sitting their waiting to connect.
> I actually think the correct thing to do is to take any unusual state
> as fatal, clean up state properly, and then reestablish our knowledge
> about the whole environment by synchronously stopping background
> activity, closing the state, and redialing in.
Yup, that should be fairly straight forward now, the Provisioner will exit with an error, we can then close the old state connection, open a new one, build a NewProvisioner and try again.
> Note that you mentioned breaking out of the process and letting
> upstart restarting. This doesn't sound like a good approach because we
> lose the ability to control what is going on. Upstart will quickly
> respawn the service, and will stop after a limit of retries. If it
> retries too many times, it stops retrying. Even if we tweak that
> limit, that's still not great, because we don't want the service
> wildly spinning attempting to connect. Instead, we want to retry,
> forever, in a controlled manner. This is trivial to do from within
> process, in the area we're touching right now.
Understood.
>> 2. The value returned from the various watcher.Stop() should
>> not be of type error, but *state.Error, allowing us to interrogate
>> it. This is a larger change, but arguably more correct.
>
> My feeling is that we should avoid that kind of verification. I've
> dived into the ZooKeeper code before, and there are code paths I'm
> pretty sure can misbehave in edge cases of reconnections.
That is fine with me, I like the pattern of keeping errors opaque. If needed we can still add a *state.Error later down the track.
>> I would appreciate it if we could find some time as a group to
>> discuss this issue. I don't think it's an immediate problem because
>> the state never appears to break, but in the medium term ZK will
>> get a shim, and then get replaced, so it would be good to address
>> this issue ahead of time.
>
> I'm happy to debate about it again, but this is an immediate problem
> that we have to solve now, because it affects the reliability of the
> implementation in a significant manner.
I'm not looking to debate it, just looking for clarification on what zk will do in the case of errors as I can't get it to trigger those errors today.
Cheers
Dave
More information about the Juju-dev
mailing list