Friday evening handover, 1/6/2012

Fri Jun 1 08:54:29 UTC 2012

Hey Dave,

> More PA work today, I believe that the straw man PA is close to
> submission. Looking forward to your review feedback.
(...)
> 1. I can't make the state break, that is, the watchers to close through
> errors from GetW or ExistsW. The underlying C library happily blocks
> until something that responds on a ZK port arrives.

This is actually my fault. We had already agreed a long time ago that
any non-stable issues from ZooKeeper should result in visible errors,
and we should fail, because there are edge cases that don't work so
well in case of a connection reestablishment, and because we need to
be able to handle harsh scenarios anyway.

I'll fix gozk so it behaves in that way.

> 2. The timeout value passed to zookeeper.Dial() doesn't do anything,
> I believe this is a bug.

I believe it does, but it's being ignored. See the open function in
state/open.go. The events in the session channel will tell that things
are not so well.

> As discussed (discussion is occurring concurrently with this email) on
> irc, there are two possible solutions to this.
> 1. Adding something like state.IsDisconnected() so that after a watcher
> channel has been closed, before going around the loop again, we check
> this value (which I imagine would be set as part of the error path for
> GetW/ExistsW). Frank is on the fence about this.

I actually think the correct thing to do is to take any unusual state
as fatal, clean up state properly, and then reestablish our knowledge
about the whole environment by synchronously stopping background
activity, closing the state, and redialing in.

Note that you mentioned breaking out of the process and letting
upstart restarting. This doesn't sound like a good approach because we
lose the ability to control what is going on. Upstart will quickly
respawn the service, and will stop after a limit of retries. If it
retries too many times, it stops retrying. Even if we tweak that
limit, that's still not great, because we don't want the service
wildly spinning attempting to connect. Instead, we want to retry,
forever, in a controlled manner. This is trivial to do from within
process, in the area we're touching right now.

> 2. The value returned from the various watcher.Stop() should
> not be of type error, but *state.Error, allowing us to interrogate
> it. This is a larger change, but arguably more correct.

My feeling is that we should avoid that kind of verification. I've
dived into the ZooKeeper code before, and there are code paths I'm
pretty sure can misbehave in edge cases of reconnections.

> I would appreciate it if we could find some time as a group to
> discuss this issue. I don't think it's an immediate problem because
> the state never appears to break, but in the medium term ZK will
> get a shim, and then get replaced, so it would be good to address
> this issue ahead of time.

I'm happy to debate about it again, but this is an immediate problem
that we have to solve now, because it affects the reliability of the
implementation in a significant manner.

We have several issues today in the Python implementation that come
from the mindset of delaying handling important problems until a
latter moment, and then it becomes a serious, pervasive, and expensive
to solve issue. Let's not fall into the same mistake again.

gustavo @ http://niemeyer.net