Handling disconnects and session expirations
kapil.thangavelu at canonical.com
Wed Oct 19 18:09:45 UTC 2011
One of the items on getting juju ready for production is better
handling of clients becoming disconnected from zookeeper. On the plane
ride back from a sprint, i took some time to write up the my
thoughts on the problem, and write some unit tests to validate the
assumptions. The support work for this is currently under way in the
Tread lightly, there be dragons here.
An application uses zookeeper by creating a tcp connection to a
zookeeper server, this connection is associated to a server side
session that is replicated throughout the zookeeper cluster.
Ephemeral nodes and watches established by the application
via this connection are associated to its session.
If the session is terminated due to a connection loss longer then the
session timeout or explicit close, the associated session resources
are expunged from the servers.
Internally the connection will maintain a heartbeat to the
zookeeper server to keep its session alive. A client associated
to multiple zookeeper servers, will attempt to reconnect to
a different server if it can't keep a heartbeat to its current
server. Heartbeat pings are sent every 1/3 of a session timeout
(default session timeout is 10s, so ping is every ~3s).
When a connection's session is expired, the connection is dead and all
extant watches are dead. They will never fire successfully. All
further operations on the connection will raise exceptions.
Session events will be generated on the connection for the
client connection as well as duplicated for any extant watches
associated to the client. A session event may be one of
'connected', 'expiring', 'connected'.
A session event watcher on the connection, will be invoked for
session events on the client connection. Extant watches also receive
Post the initial successful connection of the client, additional
session events of 'connecting' and 'connected' are non-fatal. These
events are sent as a result detecting abnormality in the client's
connection to the zookeeper server and its attempts to resolve it.
They effectively represent quality of service notifications about the
connection, to allow sophisticated applications to pause or enter a
quiescient period wrt to its connection activity till the connection
Only if the transient disconnect has lasted past the session timeout
will the lack of connectivity trigger a session expiration event. The
amount of time allowed for this is a negotiated value between both the
client and server based on their respective configurations. The
session expiration is only triggered on the client succesfully
reconnecting to the zookeeper cluster.
Its likely that attempting client operations during the period
demarked by QOS events, will result in a connection level error.
A connection error can happen for any operation. However a connection
error is not nesc. fatal, as it may merely represent a transient
connection loss. The server side session for the client may still be
active once the client reconnects. These reconnections are managed
transparently by the libzokeeper client library.
Many of the zk java libraries implement an automated retry option
here, but not all operations can be retried automatically at a library
level, they require application level awareness of the usage pattern,
and its retry idempotency.
In the case of an extant watch, A watch event deliveryattempted by the
server during a period while the client is transiently disconnected,
will be resent when the client next connects, as long as its session
is still active.
The QOS window is small enough the QOS issue may never
be detected by the client hearbeat or to the application via a session
event, and the watch delivery will be lost. This is both hard to test
reliably, and rather hard to deal with in practice for applications.
Proper recovery of an expired session, or error handling while
performing an operation is both highly application specific, and
Due to the architectural considerations of the twisted framework, the
txzookeeper client library implements watches as deferreds, which is
the framework's primary mechanism of representing an asynchronous
result. Deferreds are only capable off realizing a single
result/error. As a result session events are not sent to extant
watches but are instead shunted to the client's session callback
handler if set.
All api calls on the client also utilize deferreds to represent
results (only the async api of libzookeeper is exposed). For
connection level errors the default behavior is to trigger an error on
the corresponding api deferred. If a connection error handler is
defined, it will be invoked and its result or error passed on to the
The 0.8.0 release of txzookeeper features basic session event and
connection error callback facilities on the ZookeeperClient class.
There is some pending work to be done, in light of completing juju support
for handling session expiration and connection errors.
- Session expired events should pass through to their watchers as errors,
this will enable the watch deferreds to be cleared out the event loop.
We have to be careful here that the callers are aware enough to not
just catch exceptions blindly and retry another operation. The
connection is dead after a session expiration event, and further
operations will just cause a connection exception.
- Operation serialization to allow for application driven retry.
Given that any operation anywhere in the application can fail
arbitrarily with a transient connection error. We should have some
closure or op serialization to annotate connection exceptions and to
pass to the connection error handler, to allow the application to
retry the operation.
For connection level errors, there are two paradigms to application
retry, global/automated at the connection level and local to the api
Global requires careful monitoring of zk usage to verify compatibility,
at the application level we don't currently have any uses that
appear problematic for a transparent retry on connection error.
Regarding modifications made by the system and retry.
- Our persistent sequence node creation for state objects creation
wouldn't update the topology, our centralized node index.
- We don't employ concurrent creation of the same node by multiple clients.
- Our concurrent modifications are always in the context of a retry_change
function which will do a best effort to merge the change to current state
or fail with a concurrency error (StateChanged).
The juju default error handler would therefore be able to transparently retry
effectively at the client level. *But* we have to *very* careful on auto retry
regarding future zk usages and interactions.
As gustavo's advocated, a better approach than a client level retry
handler here is to handle errors locally at api call sites. The global approach
requires careful consideration of the entire application usage pattern. Local
placement of retry pattern abstractions scales much better with application
changes in usage. ie. even if a global approach is useable atm, it might not
always be so.
The downside to local usage is that, it potentially requires
significant rework. But, A local retry abstraction could be as simple
as a retry client facade exposing the same api, and changing call
sites appropriately during an audit, with local call sites free to use
either api as needed.
If the retry fails over several runs (effectively longer than a
session timeout), the retry stops, and the error passes through. The app
will get a session expiration event/exception.
For session expiration events, juju agents should just suicide, and
let upstart respawn them. There is some transient state in the unit
agent we need to capture to disk or maybe zk for such a
scenario. Alternatively we could perform an in memory restart,
stopping current activity, closing their connection, opening a new
connection, and proceeding to process current state and restablish
watches. The downside is the current stop procedures for unit agents
have some zk writes associated to them. I'm leaning towards the suicide
As an added wrinkle for in-memory restarts, session expiration can
manifest as both a session event and as a connection error. ie. if the
app uses the existing client api and recieves an expired session, it
will also potentially recieve a connection level error
(SessionExpiredException) from client usage, this can be in addition
to the session expiration event or by itself. In practice I haven't
seen the session event iff the connection exception triggers
first. The session events are driven by the heartbeat clock. The
connection level error from api usage.
Reliably testing different scenarios is a mess without mocking,
because of the timing dependencies (heartbeat, session expiration) for
different scenarios. I ended up using a tcp proxy to get better
control to validate the assumptions herein.
Nice blog post on various error handling scenarios (from a java perspective).
Zk docs on error handling
More information about the Juju