cpufreqd as standard install?

Sat Mar 3 08:16:39 UTC 2012

On 03/03/2012 12:13 AM, Phillip Susi wrote:
> On 02/29/2012 04:40 PM, John Moser wrote:
>> At full load (encoding a video), it eventually reaches 80C and the
>> system shuts down.
>
> It sounds like you have some broken hardware.  The stock heatsink and 
> fan are designed to keep the cpu from overheating under full load at 
> the design frequency and voltage.  You might want to verify that your 
> motherboard is driving the cpu at the correct frequency and voltage.
>

Possibly.

The only other use case I can think of is when ambient temperature is 
hot.  Remember server rooms use air conditioning; I did find that for a 
while my machine would quickly overheat if the room temperature was 
above 79F, and so kept the room at 75F.  The heat sink was completely 
clogged with dust at the time, though, which is why I recently cleaned 
and inspected it and checked all the fan speed monitors and motherboard 
settings to make sure everything was running as appropriate.

In any case if the A/C goes down in a server room, it would be nice to 
have the system CPU frequency scaling kick in and take the clock speed 
down before the chip overheats.  Modern servers--for example, the new 
revision of the Dell PowerEdge II and III as per 4 or 5 years ago--lean 
on their low-power capabilities, and modern data centers use a 
centralized DC converter and high voltage (220V) DC mains in the data 
center to reduce power waste because of the high cost of electricity.  
It's extremely likely that said servers would provide a low enough clock 
speed to not overheat without air conditioning, which is an emergency 
situation.

Of course, the side benefit of not overheating desktops with inadequate 
cooling or faulty motherboard behavior is simply a bonus.  Still, I 
believe in fault tolerance.

>> I currently have cpufreqd configured to clock to 1.8GHz at 73C, and move
>> to the ondemand governor at 70C.
>
> This need for manual configuring is a good reason why it is not a 
> candidate for standard install.
>

I've attached a configuration that generically uses sensors (i.e. if the 
program 'sensors' gives useful output, this works).  It's just one core 
though (a multi-core system reads the same temperature for them all, as 
it's per-CPU); you can easily automatically generate this.

Mind you on the topic of automatic generation, 80C is a hard limit.  It 
just is.  My machine reports (through sensors) +95.0C as "Critical", but 
my BIOS shuts down the system at +80.0C immediately.  Silicon physically 
does not tolerate temperatures above 80.0C well at all; if a chip claims 
it can run at 95.0C it's lying.  Even SOD-CMOS doesn't tolerate those 
temperatures.

As well, again, you could write some generic profiles that detect when 
the system is running on battery (UPS, laptop) and make appreciable 
adjustments based on how much battery life is left.

>> At 73C, the system switches from 1.9GHz to 1.8GHz. Ten seconds later,
>> it's at 70C and switches back to 1.9GHz. 41 seconds after that, it
>> reaches 73C again and switches to 1.8GHz.
>>
>> That means at stock frequency (1.9GHz) with stock cooling equipment, the
>> CPU overheats under full load. Clocked 0.1GHz slower than its rated
>> speed, it rapidly cools. Which is ridiculous; who designed this thing?
>
> This sounds like your motherboard is overvolting the cpu in that 1.9 
> GHz stepping.
>

Possibly, but the settings are all default, nothing set to overclock (it 
has jumper free overclocking configuration, but the option "Standard" is 
default for clock rate and voltage settings, which I assume the CPU 
supplies).

Basically the argument here is between "Supply fault tolerance" and 
"Well your motherboard is [old|poorly designed] so buy a new one."  
That's an excellent argument for hard drives (I have, in fact, suggested 
in the past that Ubuntu monitor hard disks for behavior indicative of 
dying drives--SMART errors, IDE RESET commands because the drive hangs, 
etc--and begin annoying the user with messages about the SEVERE risk of 
extreme data loss if he doesn't back up his data), but really if my 
mobo/CPU is aging and the CPU runs a little hot I'm not going to cry 
when the CPU suddenly burns out and my machine shuts down.  I'll be 
confused, annoyed, but I'll buy a new one--I might buy an entire new 
computer, unaware that just my CPU is broken, and shove the hard drive 
in there.  So there's no harm in allowing the user's hardware to go 
ahead and burn itself out if you think that's what's going on here.

By all means that doesn't mean you can't have a diagnostic center 
somewhere that the user can review and see the whole collection.  
"Ethernet: Lots of garbage [Possibly:  Faulty switch, faulty NIC, 
another computer with a chattering NIC spewing packets]."  "CPU:  
Overheats under high CPU load [Possibly:  Dust-clogged CPU heat sink, 
failing CPU fan, overclocking, failing CPU, failing motherboard voltage 
regulators, buggy motherboard BIOS]."  "/!\ Hard drive:  Freezes and 
needs IDE Resets [Possibly:  Dying hard drive/!\, dying IDE controller, 
dying RAID controller] /!\WARNING:  SEVERE DATA LOSS POSSIBLE".  Etc.  
Looks like you really need a new computer...

Yes I have strange ideas about what a computer should and shouldn't do.  
But then, you know, people run huge racks of computers that fail 
catastrophically if you don't pipe an air conditioning line straight to 
the chassis fan intake (take a look under the cabinet, the floor tile 
directly under each server rack is perforated--the raised floor has A/C 
pumped under it and it vents directly and exclusively into the server 
cabinets).
-------------- next part --------------
# this is a comment
# see CPUFREQD.CONF(5) manpage for a complete reference
#
# Note: ondemand/conservative Profiles are disabled because
#       they are not available on many platforms.

[General]
pidfile=/var/run/cpufreqd.pid
poll_interval=0.2
verbosity=4
#enable_remote=1
#remote_group=root
[/General]

[Profile]
name=Standard
minfreq=0%
maxfreq=100%
policy=ondemand
[/Profile]

[Profile]
name=Hot
minfreq=50%
maxfreq=95%
policy=ondemand
[/Profile]

[Profile]
name=Overheating
minfreq=0%
maxfreq=10%
policy=ondemand
[/Profile]

##
# Basic states
##
[Rule]
name=Normal
#acpi_temperature=0-70
sensor=temp1:0-70
#cpu_interval=00-100
profile=Standard
[/Rule]

##
# Special Rules
##
# CPU Too hot!
[Rule]
name=CPU Hot
#acpi_temperature=4-5
sensor=temp1:73-76
#cpu_interval=00-100
profile=Hot
[/Rule]

[Rule]
name=CPU Too Hot
#acpi_temperature=50-100
sensor=temp1:76-100
#cpu_interval=00-100
profile=Overheating
[/Rule]