[ec2] Instances failing in a weird way

Tue Jul 21 16:41:01 BST 2009

I'm seeing this behavior now (using beta1 though) and created my own ami
from it. Is it fixed in any of the new ami's though?

On Mon, 2009-07-13 at 16:38 -0700, Jeremy Edberg wrote:
> Greetings,
> 
> I had originally written up a report of some odd behavior that I was
> seeing, until this bug report was pointed out to me (my original
> write-up is below for all the details):
> 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/276476
> 
> Basically, I'm seeing the behavior described in the bug.  My ec2 image
> is based on the last 64 bit Intrepid beta AMI (unfortunately I didn't
> write down the AMI id) and the kernel ID is aki-38c12651.
> 
> My questions are:
> 
> Is anyone else seeing this behavior?
> Does anyone have a workaround?
> Are there any other official kernels available on ec2, and if so is
> there a list of them?
> Does anyone know if/when this bug is going to be fixed?
> 
> Thanks!
> 
> Jeremy
> 
> Original report:
> -----------------
> I've had two instances go down in the last week in a weird way, and
> although they were different, I think they might be related.  I was
> hoping someone could guide me in further investigation of the cause,
> so that I could prevent the issue from recurring.
> 
> Both of these are large instances.  For one, I rebooted it to fix the
> problem, and the other, I replaced with a new instance, and left the
> old one running for further investigation.
> 
> When the first instance stopped working correctly, here were the symptoms:
> 
> Couldn't ssh in (error on the remote was "connection reset by peer")
> Log files stopped writing
> Webserver still serving data
> Existing ssh still worked, but couldn't sudo
> Touching new files worked
> Disk was not full
> lsof reported far less open files than the system max
> DNS was working partially if started by hand, but refused to start via
> the init script
> Load was normal
> 
> The console log showed this:
> 
> [3989783.026931] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [4076193.482201] INFO: task cron:11171 blocked for more than 120 seconds.
> [4076193.482222] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [4162603.921175] INFO: task cron:11171 blocked for more than 120 seconds.
> [4162603.921193] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [4249014.372240] INFO: task cron:11171 blocked for more than 120 seconds.
> [4249014.372258] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [4335424.821709] INFO: task cron:11171 blocked for more than 120 seconds.
> [4335424.821749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [4421835.272209] INFO: task cron:11171 blocked for more than 120 seconds.
> [4421835.272230] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [4508245.709979] INFO: task cron:11171 blocked for more than 120 seconds.
> [4508245.710000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [4594656.156967] INFO: task cron:11171 blocked for more than 120 seconds.
> [4594656.156987] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> 
> 
> The instance is still up.  I've since lost my shell, but perhaps I can
> get some info from the console or underlying OS (via Amazon support)
> 
> The second instance I rebooted a few days ago.  It's symptoms were similar:
> 
> Couldn't ssh in
> Logfiles not writing.
> Disks not full.
> Filehandles not out of range.
> 
> The difference is that on this machine, the load just kept climbing.
> It reached 72 before we rebooted.
> 
> I saw a lot of entries like this in the console before rebooting:
> 
> [2333582.440859] INFO: task cron:17810 blocked for more than 120 seconds.
> [2333582.440864] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [2333582.440926] INFO: task cron:18074 blocked for more than 120 seconds.
> [2333582.440931] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [2333582.440984] INFO: task cron:18075 blocked for more than 120 seconds.
> [2333582.440989] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [2333582.441043] INFO: task cron:18076 blocked for more than 120 seconds.
> [2333582.441048] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [2333582.441102] INFO: task cron:18077 blocked for more than 120 seconds.
> [2333582.441107] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [2333582.441163] INFO: task cron:18081 blocked for more than 120 seconds.
> [2333582.441168] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> 
> Just like the other instance.
> 
> So, my question is, do you know what causes this type of failure, and
> how can I avoid it in the future?
> 
> Thanks!
>