Agent API Versioning and Upgrades

Wed Sep 17 09:15:48 UTC 2014

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

TL;DR When introducing a new agent API facade version used by a
worker, which requires an upgrade step (schema changes, migration), I
propose to not to keep the old worker code (using the older API facade
version only) after the upgrade step is in place. In other words,
trust our upgrade logic to do the right thing.

While doing the port ranges work I encountered a common problem, which
I'll try to explain below.

As our agents and workers evolve we introduce new agent API facade
versions to enable that. Unlike the client API we have to support
every version since the release of trusty, the older agent API
versions can be deprecated sooner. Why? Because we have synchronized
upgrades which run before any workers have a chance to connect to the
API server and request a specific facade version.

I'm specifically talking about http://reviews.vapour.ws/r/33/, which
introduces a new FirewallerAPIV1 and refactors the existing code as
FirewallerAPIBase (for embedding and sharing common code between V0
and V1) and FirewallerAPIV0. The apiserver side has separate test
suites for V0 and V1, and a "base" suite containing common tests for
both versions. The reason for introducing V1 is because the firewaller
worker will start watching opened port ranges on machines, not units,
as soon as the APIV1 lands and the worker code is changed. For this to
happen though, there are some schema changes (adding a new openedPorts
collection) and migration of existing data (moving individual opened
ports from the units document to the new collection as port ranges on
the unit's assigned machine), which can be implemented as an upgrade
step. Once the upgrade step is in place, due to the way upgrades work
now (in both HA and non-HA scenarios), we can guarantee that:
 1. The (new) worker using APIV1 will only start after the upgrades
are done (on all state servers)
 2. Even if not all state server have synchronized some time after the
upgrade, it's possible for the worker to connect to an apiserver which
is not yet "fully upgraded" to support APIV1. The worker tries to
connect, requests version 1, does not get it and terminates,
triggering a restart and hopefully connecting to another apiserver
which supports APIV1 (or keeps restarting until it does, but it should
be for a relatively short time).

So, once there's an upgrade step in place, I see no point in keeping
the older worker code (using APIV0 only) around. The only reason for
keeping the old code is if we happen to connect to an apiserver which
does not support APIV1 (therefore we can't use the new worker code
which only works with APIV1) and have to fall back to the old code.

If we trust our upgrade process works correctly (and file bugs when it
doesn't), chances are after the upgrade we WILL be able to connect to
an upgraded apiserver supporting V1. If not, we'll bounce the worker a
few times until it does.

I hope all that made sense and appreciate comments.

Cheers,
- -- 
Dimiter Naydenov <dimiter.naydenov at canonical.com>
juju-core team
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUGVFEAAoJENzxV2TbLzHwQlwH/icQi0fcUtcVQseadE1hpvfW
L/WPM7NPpYw5Wgr74joMi4R6ExUki5kQiUGVO6Eoqa5cfZEpW6jXAloOzaN8+bdG
YCTUKYK0GSD9ptemUG9IehoCDqJrpse9I2bEFGhNdLy+PDRCbb0XT9N5X0dWyhC5
ugJ8wNxCANRCbRayFS1PRTaouzIXDCKuxX+z3vzsWnVQnJnxMXMEbmmNo0XH6bdp
dY8umH1PbbqGRfgs6SPlSxRfrnjD/JN+ZQ7hBvdIwlUUQVDqf9TP0hZknqTla1Fw
hY3qIyA6/eRu0jlj9I10TfzcCPX3Hpx9fE9+Q8YALgzWE4Jamuq7hPmAGOH/zeA=
=XZ7y
-----END PGP SIGNATURE-----