CI hates juju, maybe the feelings are mutual

Thu Apr 10 12:23:29 UTC 2014

Note that copying off the all-machines log seems to be broken, the lines:

juju --show-log scp -e test-release-aws -- -o 'StrictHostKeyChecking
no' -o 'UserKnownHostsFile /dev/null' -i
/var/lib/jenkins/cloud-city/staging-juju-rsa
0:/var/log/juju/all-machines.log
/var/lib/jenkins/jobs/aws-upgrade/workspace/artifacts/all-machines-test-release-aws.log
2014-04-10 03:40:59 INFO juju.cmd supercommand.go:297 running
juju-1.18.0-precise-amd64 [gc]
2014-04-10 03:40:59 INFO juju api.go:238 connecting to API addresses:
[ec2-54-86-4-94.compute-1.amazonaws.com:17070]
2014-04-10 03:40:59 INFO juju apiclient.go:114 state/api: dialing
"wss://ec2-54-86-4-94.compute-1.amazonaws.com:17070/"
2014-04-10 03:40:59 INFO juju apiclient.go:124 state/api: connection established
2014-04-10 03:40:59 ERROR juju.cmd supercommand.go:300 unexpected
argument "-o"; extra arguments must be last

It looks like you have to spell it:

juju --show-log scp -e test-release-aws
0:/var/log/juju/all-machines.log
/var/lib/jenkins/jobs/aws-upgrade/workspace/artifacts/all-machines-test-release-aws.log
-o 'StrictHostKeyChecking no' -o 'UserKnownHostsFile /dev/null' -i
/var/lib/jenkins/cloud-city/staging-juju-rsa

I don't know if "--" for SCP ever worked, but it appears 1.18 wants a
different spelling.

John
=:->

On Thu, Apr 10, 2014 at 3:40 PM, John Meinel <john at arbash-meinel.com> wrote:

> So I did a 1.18.0 to 1.18.1 test here (which should be the r2264 failure
> that you're seeing (I'm using r2266).
>
> I did see it upgrade successfully, but it took a surprisingly long amount
> of time for all the unit agents to come up. The relevant logs (from my
> point of view) are:
> http://paste.ubuntu.com/7230339/
>
> So you can see at time 11:05:01 all of the machine agents notice there is
> an upgrade to be performed. At time 11:05:01 machine-1 completes the
> upgrade first and calls SetTools(1.18.1.1).
>
> 11:05:05 unit-mysql-0 is informed that it should upgrade (since machine-1,
> where mysql is, has upgraded)
> 11:05:05 machine-0 has restarted with updated tools, and that bounced the
> other machines, so -1 and -2 also report that they are running 1.18.1.1
> 11:05:07 unit-wordpress-0 is told that it should upgrade
> 11:05:09 unit-wordpress-0 has upgraded and is told that it is now on the
> correct version
> 11:05:38 unit-mysql-0 finally has upgraded.
>
> I have no idea why all of the agents upgraded in <10s except unit-mysql-0
> decided it needed 30s to do the same thing. Perhaps it was running a hook
> when it was first told to upgrade and that prevented it from restarting at
> an appropriate time.
>
> Note that the mysql charm still has a bug when run in the local provider,
> where it wants to allocate something like 15GB of buffer space, so I had to
> manually hack that back down to 10MB, before it would start in the first
> place. I suppose the charm is doing something like "how much memory *could*
> I get, which is 16GB on my machine".
>
> Anyway, upgrade worked, but it took an awfully long time.
> I do see this in the unit-mysql-0 log file:
>
> http://paste.ubuntu.com/7230369/
>
> Which shows that it sees the need for an upgrade, but at roughly exactly
> the same time it sees "upgrade needed" it also starts running the
> relation-changed hook. Which seems to be trying to stop mysql and has the
> lines:
> 2014-04-10 11:05:08 DEBUG worker.uniter.jujuc server.go:104 hook context
> id "mysql/0:config-changed:6033929919748807197"; dir
> "/var/lib/juju/agents/unit-mysql-0/charm"
> 2014-04-10 11:05:08 INFO juju-log Restart failed, trying again
> 2014-04-10 11:05:08 INFO config-changed stop: Job has already been
> stopped: mysql
> 2014-04-10 11:05:38 INFO config-changed mysql start/running
> 2014-04-10 11:05:38 INFO juju.worker.uniter uniter.go:483 ran
> "config-changed" hook
> 2014-04-10 11:05:38 INFO juju.worker.uniter uniter.go:494 committing
> "config-changed" hook
> 2014-04-10 11:05:38 INFO juju.worker.uniter uniter.go:509 committed
> "config-changed" hook
> 2014-04-10 11:05:38 DEBUG juju.worker.uniter modes.go:394
> ModeConfigChanged exiting
> 2014-04-10 11:05:38 INFO juju.worker.uniter uniter.go:140 unit "mysql/0"
> shutting down: tomb: dying
>
> Which  makes it definitely seem like when one worker upgrades itself it
> causes the other worker to run a relation-changed hook, which may block
> that worker from doing its upgrade until the charm thinks things are ready.
> And restarting mysql seems to take 30 (or takes 30s to timeout?)
>
> So is it possible that we just aren't waiting long enough? It definitely
> looks like you're doing a lot of waiting in:
>
> http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/job/aws-upgrade/1074/console
> (I see 682 lines of 1.18.1: 0). And a timeout of 5 minutes.
>
> I guess I'm at a loss. I did see it take a long time to upgrade, but it
> did succeed.
> John
> =:->
>
>
>
> On Thu, Apr 10, 2014 at 1:41 PM, John Meinel <john at arbash-meinel.com>wrote:
>
>> Well you used to be able to request a downgrade, but it never actually
>> worked... :)
>> And with the new upgrade steps, we explicitly don't implement the 'back
>> out these changes' logic, which is why things were breaking
>> some-of-the-time on upgrade. I'm not sure what I broke, but it is possible
>> I changed it from "some-of-the-time" to "all-of-the-time" by some perverse
>> logic. I'm pretty sure I tested it locally.
>>
>> I did try to upgrade a WP+MySQL environment from 1.16.6 to
>> lp:juju-core/1.18 (2266). And it failed but because of the WP charm
>> config-changed hook. This was the error:
>> 2014-04-10 09:37:13 INFO config-changed E: Could not get lock
>> /var/lib/dpkg/lock - open (11: Resource temporarily unavailable)
>> 2014-04-10 09:37:13 INFO config-changed E: Unable to lock the
>> administration directory (/var/lib/dpkg/), is another process using it?
>> 2014-04-10 09:37:13 ERROR juju.worker.uniter uniter.go:475 hook failed:
>> exit status 100
>>
>> After doing "juju resolved --retry wordpress/0" everything was happy.
>>
>> I wonder if there is something in 1.18 that is causing apt commands to
>> run after upgrade, and that ends up racing with running the config-changed
>> hook.
>>
>> Are we careful to take out the FSLock when doing Apt commands from
>> upgrade?
>>
>> John
>> =:->
>>
>>
>>
>>
>> On Thu, Apr 10, 2014 at 8:16 AM, Curtis Hovey-Canonical <
>> curtis at canonical.com> wrote:
>>
>>> I am exhausted, so I am sending out the barest summary of the Juju-CI
>>> problems I see.
>>>
>>> http://ec2-54-84-137-170.compute-1.amazonaws.com:8080/
>>> ^ In  general, if the test doesn't end in -devel or start with walk-,
>>> we require the test to pass.
>>>
>>> lp:juju-core/trunk r2593 could not upgrade because of an exception.
>>> Note that this rev precedes the rev that makes it impossible to
>>> downgrade. I see new revisions queue, may trunk will pass while I
>>> sleep
>>>
>>> lp:juju-core/1.18 r2264 and subsequent revscould not upgrade hp,
>>> joyent, azure, or aws. some of the agents failed to update. This
>>> results for the different CPC are consistent. I see the change relates
>>> to not permitting downgrades. Damn, I just submitted pull request
>>> documenting that you can downgrade.
>>>
>>> --
>>> Curtis Hovey
>>> Canonical Cloud Development and Operations
>>> http://launchpad.net/~sinzui
>>>
>>> --
>>> Juju-dev mailing list
>>> Juju-dev at lists.ubuntu.com
>>> Modify settings or unsubscribe at:
>>> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju-dev/attachments/20140410/af464402/attachment-0001.html>