[ec2] Instances failing in a weird way

Jeremy Edberg jedberg at reddit.com
Tue Jul 21 18:17:43 BST 2009


Actually, I was hoping to get some guidance from Canonical on this
issue, as this bug bit me again Sunday afternoon.

I'm currently in the process of splitting the functions on that
instance into two separate instances so that I can further isolate the
cause of the bug.  Since it bites me every two weeks or so, I should
have more info in about two weeks. :)

On Tue, Jul 21, 2009 at 08:41, Darren Govoni<darren at ontrenet.com> wrote:
> I'm seeing this behavior now (using beta1 though) and created my own ami
> from it. Is it fixed in any of the new ami's though?
>
> On Mon, 2009-07-13 at 16:38 -0700, Jeremy Edberg wrote:
>> Greetings,
>>
>> I had originally written up a report of some odd behavior that I was
>> seeing, until this bug report was pointed out to me (my original
>> write-up is below for all the details):
>>
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/276476
>>
>> Basically, I'm seeing the behavior described in the bug.  My ec2 image
>> is based on the last 64 bit Intrepid beta AMI (unfortunately I didn't
>> write down the AMI id) and the kernel ID is aki-38c12651.
>>
>> My questions are:
>>
>> Is anyone else seeing this behavior?
>> Does anyone have a workaround?
>> Are there any other official kernels available on ec2, and if so is
>> there a list of them?
>> Does anyone know if/when this bug is going to be fixed?
>>
>> Thanks!
>>
>> Jeremy
>>
>> Original report:
>> -----------------
>> I've had two instances go down in the last week in a weird way, and
>> although they were different, I think they might be related.  I was
>> hoping someone could guide me in further investigation of the cause,
>> so that I could prevent the issue from recurring.
>>
>> Both of these are large instances.  For one, I rebooted it to fix the
>> problem, and the other, I replaced with a new instance, and left the
>> old one running for further investigation.
>>
>> When the first instance stopped working correctly, here were the symptoms:
>>
>> Couldn't ssh in (error on the remote was "connection reset by peer")
>> Log files stopped writing
>> Webserver still serving data
>> Existing ssh still worked, but couldn't sudo
>> Touching new files worked
>> Disk was not full
>> lsof reported far less open files than the system max
>> DNS was working partially if started by hand, but refused to start via
>> the init script
>> Load was normal
>>
>> The console log showed this:
>>
>> [3989783.026931] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [4076193.482201] INFO: task cron:11171 blocked for more than 120 seconds.
>> [4076193.482222] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [4162603.921175] INFO: task cron:11171 blocked for more than 120 seconds.
>> [4162603.921193] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [4249014.372240] INFO: task cron:11171 blocked for more than 120 seconds.
>> [4249014.372258] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [4335424.821709] INFO: task cron:11171 blocked for more than 120 seconds.
>> [4335424.821749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [4421835.272209] INFO: task cron:11171 blocked for more than 120 seconds.
>> [4421835.272230] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [4508245.709979] INFO: task cron:11171 blocked for more than 120 seconds.
>> [4508245.710000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [4594656.156967] INFO: task cron:11171 blocked for more than 120 seconds.
>> [4594656.156987] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>>
>>
>> The instance is still up.  I've since lost my shell, but perhaps I can
>> get some info from the console or underlying OS (via Amazon support)
>>
>> The second instance I rebooted a few days ago.  It's symptoms were similar:
>>
>> Couldn't ssh in
>> Logfiles not writing.
>> Disks not full.
>> Filehandles not out of range.
>>
>> The difference is that on this machine, the load just kept climbing.
>> It reached 72 before we rebooted.
>>
>> I saw a lot of entries like this in the console before rebooting:
>>
>> [2333582.440859] INFO: task cron:17810 blocked for more than 120 seconds.
>> [2333582.440864] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [2333582.440926] INFO: task cron:18074 blocked for more than 120 seconds.
>> [2333582.440931] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [2333582.440984] INFO: task cron:18075 blocked for more than 120 seconds.
>> [2333582.440989] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [2333582.441043] INFO: task cron:18076 blocked for more than 120 seconds.
>> [2333582.441048] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [2333582.441102] INFO: task cron:18077 blocked for more than 120 seconds.
>> [2333582.441107] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [2333582.441163] INFO: task cron:18081 blocked for more than 120 seconds.
>> [2333582.441168] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>>
>> Just like the other instance.
>>
>> So, my question is, do you know what causes this type of failure, and
>> how can I avoid it in the future?
>>
>> Thanks!
>>
>
>



More information about the Ec2 mailing list