Juju/ZK watch re-establishment in presence of errors

Wed May 15 17:54:11 UTC 2013

Hi Torin,

On Thu, May 9, 2013 at 2:06 PM, Torin Sandall <torinsandall at gmail.com>wrote:

> Hi All,
>
> I'm wondering if one of the developers familiar with Juju's use of ZK
> watches can clarify something for me.
>
> I've been running into some problems with Juju machine and unit agents
> involving ZK-related errors (connection loss, operation timeout, etc.)
>
>
I went ahead and fixed up and merged the txzk branch that deals with
connection errors and backoff. Its available from the juju ppa
(ppa:juju/pkgs)

>
>
One observation I have is that the Juju implementation makes no attempt to
> re-establish watches if there's an error while processing an event.
>

The watch handler should be handling any exceptional conditions wrt to data
its consuming outside of connectivity issues which are handled by txzk. If
the watch handler's don't handle data appropriately that's a bug.

>  In fact some of the low level functions which register the watches even
> remark on this in their docstrings. The problem with this approach is that
> a single failed ZK call can render agents inoperable. I found the
> retry-backoff and related txzookeeper branch however I want to know if
> re-establishing the watches in case of errors is valid or not.
>

The watch re-establishment in the retry-backoff branch (merged today),
addresses the connectivity issues, and on reconnect triggers the watch
callback/handler. The handlers themselves are setup to refetch current
state and revaluate to their known state, so the watch restablishment on
reconnect amounts to them firing again and catches up the agents again with
current state (be it zero delta or some significant delta).

hope that helps,

Kapil
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju/attachments/20130515/85d6fae2/attachment.html>