regression: restore-backup broken by recent commit

Fri Feb 24 04:27:34 UTC 2017

OK, I think I got it now...

This is all crazy, and it was a change due to the  gorilla/websocket change.

So... what happens when there is a successful restore on the server side 
is that it calls os.Exit(...) which then has the pid 1 restart the 
agent. However from the API client, this is an abnormal closure.

In the rpc layer, I capture a number of websocket close errors as 
"normal", but I missed the Abnormal closure case, which is 1006.

I'll update, and repropose to devel.

Hazaah.

Tim

On 24/02/17 16:17, Tim Penhey wrote:
> Hi Curtis (also expanding to juju-dev),
>
> I have been looking into this issue. And the good news is that it
> doesn't appear to be a real problem with gorilla/websocket at all, but
> instead a change in timing showed an existing issue that hadn't surfaced
> before.
>
> I'll be looking into that issue - where the restore command after
> bootstrapping, doesn't appear to retry if it gets an error like "denied:
> upgrade in progress".
>
> Secondly I tried to reproduce on lxd to find that there is an issue with
> the rebootstrap and lxd - it just doesn't work.
>
> Then I tried with AWS, to mirror the CI test as close as possible. I
> didn't hit the same timing issue as before, but instead got a different
> failure with the mongo restore:
>
>   http://pastebin.ubuntu.com/24056766/
>
> I have no idea why juju.txns.stash failed but juju.txns and
> juju.txns.logs succeeded.
>
> Also, a CI run of a develop revision just before the gorilla/websocket
> reversion hit this:
>
> http://reports.vapour.ws/releases/4922/job/functional-ha-backup-restore/attempt/5045#highlight
>
>
>     cannot create collection "txns": unauthorized mongo access: not
>     authorized on juju to execute command { create: "txns" }
>     (unauthorized access)
>
> Not sure why that is happening either. Seems that the restore of mongo
> is incredibly fragile.
>
> Again, this shows errors in the restore code, but luckily it has nothing
> to do with gorilla/websockets.
>
> Tim
>
> On 23/02/17 04:02, Curtis Hovey-Canonical wrote:
>> Hi Tim, et al.
>>
>> All the restore-backup tests in all the substrates failed with your
>> recent gorilla socket commit. The restore-backup command is often
>> fails when bootstrap or connection behaviours change. This new bug is
>> definitely a connection failure while the client is driving a
>> restore.
>>
>> We need the develop branch fixed. As the previous commit was blessed,
>> as are certain 2.2-alpha1 was in very good shape before the gorilla
>> change.
>>
>> Restore backup failed websocket: close 1006
>> https://bugs.launchpad.net/juju/+bug/1666898
>>
>> As seen at
>>     http://reports.vapour.ws/releases/issue/5550dda7749a561097cf3d44
>>
>> All the restore-backup tests failed when testing commit
>> https://github.com/juju/juju/commit/f06c3e96f4e438dc24a28d8ebf7d22c76fff47e2
>>
>>
>> We see
>> Initial model "default" added.
>> 04:54:39 INFO juju.juju api.go:72 connecting to API addresses:
>> [52.201.105.25:17070 172.31.15.167:17070]
>> 04:54:39 INFO juju.api apiclient.go:569 connection established to
>> "wss://52.201.105.25:17070/model/89bcc17c-9af9-4113-8417-71847838f61a/api"
>>
>> ...
>> 04:55:20 ERROR juju.api.backups restore.go:136 could not clean up
>> after failed restore attempt: <nil>
>> 04:55:20 ERROR cmd supercommand.go:458 cannot perform restore: <nil>:
>> codec.ReadHeader error: error receiving message: websocket: close 1006
>> (abnormal closure): unexpected EOF
>>
>> This is seen in aws, prodstack, gce
>>
>>
>>
>