Friday evening handover, 1/6/2012

Sun Jun 3 22:41:42 UTC 2012

> This is actually my fault. We had already agreed a long time ago that
> any non-stable issues from ZooKeeper should result in visible errors,
> and we should fail, because there are edge cases that don't work so
> well in case of a connection reestablishment, and because we need to
> be able to handle harsh scenarios anyway.
> 
> I'll fix gozk so it behaves in that way.

Cool. 

>> 2. The timeout value passed to zookeeper.Dial() doesn't do anything,
>> I believe this is a bug.
> 
> I believe it does, but it's being ignored. See the open function in
> state/open.go. The events in the session channel will tell that things
> are not so well.

Fair enough, i'll have to take your word for it. I took to setting that value to 15e6 and 1/2 an hour later the PA was still sitting their waiting to connect.

> I actually think the correct thing to do is to take any unusual state
> as fatal, clean up state properly, and then reestablish our knowledge
> about the whole environment by synchronously stopping background
> activity, closing the state, and redialing in.

Yup, that should be fairly straight forward now, the Provisioner will exit with an error, we can then close the old state connection, open a new one, build a NewProvisioner and try again.

> Note that you mentioned breaking out of the process and letting
> upstart restarting. This doesn't sound like a good approach because we
> lose the ability to control what is going on. Upstart will quickly
> respawn the service, and will stop after a limit of retries. If it
> retries too many times, it stops retrying. Even if we tweak that
> limit, that's still not great, because we don't want the service
> wildly spinning attempting to connect. Instead, we want to retry,
> forever, in a controlled manner. This is trivial to do from within
> process, in the area we're touching right now.

Understood.

>> 2. The value returned from the various watcher.Stop() should
>> not be of type error, but *state.Error, allowing us to interrogate
>> it. This is a larger change, but arguably more correct.
> 
> My feeling is that we should avoid that kind of verification. I've
> dived into the ZooKeeper code before, and there are code paths I'm
> pretty sure can misbehave in edge cases of reconnections.

That is fine with me, I like the pattern of keeping errors opaque. If needed we can still add a *state.Error later down the track.

>> I would appreciate it if we could find some time as a group to
>> discuss this issue. I don't think it's an immediate problem because
>> the state never appears to break, but in the medium term ZK will
>> get a shim, and then get replaced, so it would be good to address
>> this issue ahead of time.
> 
> I'm happy to debate about it again, but this is an immediate problem
> that we have to solve now, because it affects the reliability of the
> implementation in a significant manner.

I'm not looking to debate it, just looking for clarification on what zk will do in the case of errors as I can't get it to trigger those errors today.

Cheers

Dave