A Nasty Performance Regression For Some Intel Systems Wound Up In Linux 6.5 Stable
So what's with the Linux 6.5 performance? Thanks to some pretty clear-cut differences in performance going from Linux 6.4 to Linux 6.5, it's a pretty easy and fun bisect. To limit the time spent on this endeavour, I focused in on the video encoding regressions and SVT-VP9 in particular...
It was a clear-cut difference around the Linux 6.5 merge window where the regression was introduced at least for the video encoding tests but presumably the same issue for most if not all of the other workloads.
What it ended up bisecting down to is commit, 9050a9cd5e4c848e265915d6e7b1f731e6e1e0e6 or "powercap: intel_rapl: Cleanup Power Limits support." This clean-up of the Intel Runtime Average Power Limiting code regressed the Core i9 13900K system hard.
While sorting that one through, I then discovered that last week Intel posted a fix for the PL4 setting. That patch cites the above mentioned commit and explains:
"System runs at minimum performance, once powercap RAPL package domain enabled flag is changed from 1 to 0 to 1.
Setting RAPL package domain enabled flag to 0, results in setting of power limit 4 (PL4) MSR 0x601 to 0. This implies disabling PL4 limit. The PL4 limit controls the peak power. So setting 0, results in some undesirable performance, which depends on hardware implementation.
Even worse, when the enabled flag is set to 1 again. This will set PL4 MSR value to 0x01, which means reduce peak power to 0.125W. This will force system to run at the lowest possible performance on every PL4 supported system.
Setting enabled flag should only affect the "enable" bit, not other bits. Here it is changing power limit. This is caused by a change which assumes that there is an enable bit in the PL4 MSR like other power limits. Although PL4 enable/disable bit is present with TPMI RAPL interface, it is not present with the MSR interface.
There is a rapl_primitive_info defined for non existent PL4 enable bit and then it is used with the commit 9050a9cd5e4c ("powercap: intel_rapl: Cleanup Power Limits support") to enable PL4. This is wrong, hence remove this rapl primitive for PL4. Also in the function rapl_detect_powerlimit(), PL_ENABLE is used to check for the presence of power limits. Replace PL_ENABLE with PL_LIMIT, as PL_LIMIT must be present. Without this change, PL4 controls will not be available in the sysfs once rapl primitive for PL4 is removed."
Ouch.
The good news though is that it should be fixed today with the Linux 6.5.3 point release with that patch having been merged. In any event, very surprising such a glaring regression made it through to a stable Linux kernel release before being detected by Intel considering their generally great QA and extensive Linux testing.
It's too bad I didn't have the time to test Linux 6.5 earlier on during the development cycle with this Raptor Lake system I commonly use for Intel desktop Linux benchmarking. As always, those that enjoy my consistent Linux hardware testing and performance benchmarking can join Phoronix Premium to view the site ad-free and multi-page articles on a single page, native dark mode, and other benefits. Phoronix tips are also accepted via PayPal and Stripe. Now onto more Linux 6.5~6.6 Git benchmarking on other systems...
If you enjoyed this article consider joining Phoronix Premium to view this site ad-free, multi-page articles on a single page, and other benefits. PayPal or Stripe tips are also graciously accepted. Thanks for your support.