[Bug 2054391] Re: Fix CPU thermal sensors enumeration
Robie Basak
2054391 at bugs.launchpad.net
Wed Jun 26 14:14:44 UTC 2024
[indent snipped]
> + std::string temp_dir_path = base_path[i] + entry->d_name
> + + "/";
> + DIR *temp_dir = nullptr;
> + struct dirent *temp_dir_entry = nullptr;
> + int len_temp_dir_entry = 0;
> + int len_input = strlen("_input");
> +
> + if ((temp_dir = opendir(temp_dir_path.c_str())) != NULL) {
> + while ((temp_dir_entry = readdir(temp_dir)) != NULL) {
> + len_temp_dir_entry = strlen(temp_dir_entry->d_name);
> + if ((len_temp_dir_entry >= len_input
> + && !strcmp(
> + temp_dir_entry->d_name
> + + len_temp_dir_entry
> + - len_input, "_input"))
> + && (!strncmp(temp_dir_entry->d_name, "temp",
> + strlen("temp")))) {
FTR, the suffix matching code here is...suboptimal, especially as
std::string is being used elsewhere in the same code block. I don't see
an error here, but the unnecessary complexity is a risk. I nearly
rejected this asking for it to be rewritten, and it took me a
disproportionately large amount of SRU review time to understand it as
well as the previous suboptimal sensor_mask/mask obtuseness.
--
You received this bug notification because you are a member of Ubuntu
Sponsors, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/2054391
Title:
Fix CPU thermal sensors enumeration
Status in HWE Next:
New
Status in thermald package in Ubuntu:
Fix Released
Status in thermald source package in Jammy:
Fix Committed
Bug description:
[Impact]
Some CPU sensors are not enumerated, this can make thermald deviates from the correct behavior of the CPU TDP.
[Fix]
Traverse all sensors under hwmon sysfs directory to make sure everything is enumerated.
[Test]
Check the output of thermald. Once the fix is in place, thermal zones that are previously omitted now shows up:
[INFO]Zone 1: AMBF, Active:1 Bind:1 Sensor_cnt:1
To do so
0. get a large machine which will have more thermal zones
1. stop the potentially auto-running service
$ systemctl stop thermald
2. run the daemon in foreground with loglevel to see what is going on.
On many modern systemd (=the large ones) it might not know the CPUid,
to bypass that for the test you can ask it to ignore the check
$ sudo thermald --no-daemon --loglevel=info --ignore-cpuid-check
3. check the output
On init the system will be probed and that will show something like:
...
ZONE DUMP BEGIN
[1718954645][INFO]Zone 2: cpu, Active:1 Bind:0 Sensor_cnt:1
...
[1718954645][INFO]Zone 3: cpu, Active:1 Bind:0 Sensor_cnt:1
...
ZONE DUMP END
In here, on systems with many thermal zones one would before the fix
only see a few, and with the fix more zones.
[Where problems could occur]
Since the new logic traverse the whole hwmon sysfs, the startup time can take slightly longer.
[racb] Existing users' systems may have bad or otherwise irrelevant or
out of scope sensors that may not have been causing misbehaviour due
to being skipped, but after the fix, they would face a regression. I'm
not sure that we can realistically identify such cases though, and it
seems reasonable to favour correct systems over misbehaving ones.
[racb] Similar to my previous point, we may pick up additional sensor
data that we shouldn't do due to inadequate filtering, causing
incorrect behaviour but this time it would be a bug in our filtering
rather than misbehaving existing systems. In mitigation, I see that
the fixed version has been released in Kinetic, so has had some real
world testing, and I see no indication upstream or in Launchpad that
this was a problem in practice.
To manage notifications about this bug go to:
https://bugs.launchpad.net/hwe-next/+bug/2054391/+subscriptions
More information about the Ubuntu-sponsors
mailing list