Getting the most out of your Intel integrated GPU on Linux

About a year ago ago, I got a new laptop: a late 2019 Razer Blade Stealth 13.  It sports an Intel i7-1065G7 with the best Intel's Ice Lake graphics along with an NVIDIA GeForce GTX 1650.  Apart from needing an ACPI lid quirk and the power management issues described here, it’s been a great laptop so far and the Linux experience has been very smooth.

Unfortunately, the out-of-the-box integrated graphics performance of my new laptop was less than stellar.  My first task with the new laptop was to debug a rendering issue in the Linux port of Shadow of the Tomb Raider which turned out to be a bug in the game.  In the process, I discovered that the performance of the game’s built-in benchmark was almost half of Windows.  We’ve had some performance issues with Mesa from time to time on some games but half seemed a bit extreme.  Looking at system-level performance data with gputop revealed that GPU clock rate was unable to get above about 60-70% of the maximum in spite of the GPU being busy the whole time.  Why?  The GPU wasn’t able to get enough power.  Once I sorted out my power management problems, the benchmark went from about 50-60% the speed of Windows to more like 104% the speed of windows (yes, that’s more than 100%).

This blog post is intended to serve as a bit of a guide to understanding memory throughput and power management issues and configuring your system properly to get the most out of your Intel integrated GPU.  Not everything in this post will affect all laptops so you may have to do some experimentation with your system to see what does and does not matter.  I also make no claim that this post is in any way complete; there are almost certainly other configuration issues of which I'm not aware or which I've forgotten.

Update your drivers

This should go without saying but if you want the best performance out of your hardware, running the latest drivers is always recommended.  This is especially true for hardware that has just been released.  Generally, for graphics, most of the big performance improvements are going to be in Mesa but your Linux kernel version can matter as well.  In the case of Intel Ice Lake processors, some of the power management features aren’t enabled until Linux 5.4.

I’m not going to give a complete guide to updating your drivers here.  If you’re running a distro like Arch, chances are that you’re already running something fairly close to the latest available.  If you’re on Ubuntu, the padoka PPA provides versions of the userspace components (Mesa, X11, etc.) that are usually no more than about a week out-of-date but upgrading your kernel is more complicated.  Other distros may have something similar but I’ll leave as an exercise to the reader.

This doesn’t mean that you need to be obsessive about updating kernels and drivers.  If you’re happy with the performance and stability of your system, go ahead and leave it alone.  However, if you have brand new hardware and want to make sure you have new enough drivers, it may be worth attempting an update.  Or, if you have the patience, you can just wait 6 months for the next distro release cycle and hope to pick up with a distro update.

Make sure you have dual-channel RAM

One of the big bottleneck points in 3D rendering applications is memory bandwidth.  Most standard monitors run at a resolution of 1920x1080 and a refresh rate of 60 Hz.  A 1920x1080 RGBA (32bpp) image is just shy of 8 MiB in size and, if the GPU is rendering at 60 FPS, that adds up to about 474 MiB/s of memory bandwidth to write out the image every frame.  If you're running a 4K monitor, multiply by 4 and you get about 1.8 GiB/s.  Those numbers are only for the final color image, assume we write every pixel of the image exactly once, and don't take into account any other memory access.  Even in a simple 3D scene, there are other images than just the color image being written such as depth buffers or auxiliary gbuffers, each pixel typically gets written more than once depending on app over-draw, and shading typically involves reading from uniform buffers and textures.  Modern 3D applications typically also have things such as depth pre-passes, lighting passes, and post-processing filters for depth-of-field and/or motion blur.  The result of this is that actual memory bandwidth for rendering a 3D scene can be 10-100x the bandwidth required to simply write the color image.

Because of the incredible amount of bandwidth required for 3D rendering, discrete GPUs use memories which are optimized for bandwidth above all else.  These go by different names such as GDDR6 or HBM2 (current as of the writing of this post) but they all use extremely wide buses and access many bits of memory in parallel to get the highest throughput they can.  CPU memory, on the other hand, is typically DDR4 (current as of the writing of this post) which runs on a narrower 64-bit bus and so the over-all maximum memory bandwidth is lower.  However, as with anything in engineering, there is a trade-off being made here.  While narrower buses have lower over-all throughput, they are much better at random access which is necessary for good CPU memory performance when crawling complex data structures and doing other normal CPU tasks.  When 3D rendering, on the other hand, the vast majority of your memory bandwidth is consumed in reading/writing large contiguous blocks of memory and so the trade-off falls in favor of wider buses.

With integrated graphics, the GPU uses the same DDR RAM as the CPU so it can't get as much raw memory throughput as a discrete GPU.  Some of the memory bottlenecks can be mitigated via large caches inside the GPU but caching can only do so much.  At the end of the day, if you're fetching 2 GiB of memory to draw a scene, you're going to blow out your caches and load most of that from main memory.

The good news is that most motherboards support a dual-channel ram configurations where, if your DDR units are installed in identical pairs, the memory controller will split memory access between the two DDR units in the pair.  This has similar benefits to running on a 128-bit bus but without some of the drawbacks.  The result is about a 2x improvement in over-all memory throughput.  While this may not affect your CPU performance significantly outside of some very special cases, it makes a huge difference to your integrated GPU which cares far more about total throughput than random access.  If you are unsure how your computer's RAM is configured, you can run “dmidecode -t memory” and see if you have two identical devices reported in different channels.

Power management 101

Before getting into the details of how to fix power management issues, I should explain a bit about how power management works and, more importantly, how it doesn’t.  If you don’t care to learn about power management and are just here for the system configuration tips, feel free to skip this section.

Why is power management important?  Because the clock rate (and therefore the speed) of your CPU or GPU is heavily dependent on how much power is available to the system.  If it’s unable to get enough power for some reason, it will run at a lower clock rate and you’ll see that as processes taking more time or lower frame rates in the case of graphics.  There are some things that you, as the user, cannot control such as the physical limitations of the chip or the way the OEM has configured things on your particular laptop.  However, there are some things which you can do from a system configuration perspective which can greatly affect power management and your performance.

First, we need to talk about thermal design power or TDP.  There is a lot of misunderstanding on the internet about TDP and we need to clear some of them up.  Wikipedia defines TDP as “the maximum amount of heat generated by a computer chip or component that the cooling system in a computer is designed to dissipate under any workload.”  The Intel Product Specifications site defines TDP as follows:

Thermal Design Power (TDP) represents the average power, in watts, the processor dissipates when operating at Base Frequency with all cores active under an Intel-defined, high-complexity workload. Refer to Datasheet for thermal solution requirements.

In other words, the TDP value provided on the Intel spec sheet is a pretty good design target for OEMs but doesn’t provide nearly as many guarantees as one might hope.  In particular, there are several things that the TDP value on the spec sheet is not:
  • It’s not the exact maximum power.  It’s a “average power”.
  • It may not match any particular workload.  It’s based on “an Intel-defined, high-complexity workload”.  Power consumption on any other workload is likely to be slightly different.
  • It’s not the actual maximum.  It’s based on when the processor is “operating at Base Frequency with all cores active.” Technologies such as Turbo Boost can cause the CPU to operate at a higher power for short periods of time.
If you look at the  Intel Product Specifications page for the i7-1065G7, you’ll see three TDP values: the nominal TDP of 15W, a configurable TDP-up value of 25W and a configurable TDP-down value of 12W.  The nominal TDP (simply called “TDP”) is the base TDP which is enough for the CPU to run all of its cores at the base frequency which, given sufficient cooling, it can do in the steady state.  The TDP-up and TDP-down values provide configurability that gives the OEM options when they go to make a laptop based on the i7-1065G7.  If they’re making a performance laptop like Razer and are willing to put in enough cooling, they can configure it to 25W and get more performance.  On the other hand, if they’re going for battery life, they can put the exact same chip in the laptop but configure it to run as low as 12W.  They can also configure the chip to run at 12W or 15W and then ship software with the computer which will bump it to 25W once Windows boots up.  We’ll talk more about this reconfiguration later on.

Beyond just the numbers on the spec sheet, there are other things which may affect how much power the chip can get.  One of the big ones is cooling.  The law of conservation of energy dictates that energy is never created or destroyed.  In particular, your CPU doesn’t really consume energy; it turns that electrical energy into heat.  For every Watt of electrical power that goes into the CPU, a Watt of heat has to be pumped out by the cooling system.  (Yes, a Watt is also a measure of heat flow.)  If the CPU is using more electrical energy than the cooling system can pump back out, energy gets temporarily stored in the CPU as heat and you see this as the CPU temperature rising.  Eventually, however, the CPU has to back off and let the cooling system catch up or else that built up heat may cause permanent damage to the chip.

Another thing which can affect CPU power is the actual power delivery capabilities of the motherboard itself.  In a desktop, the discrete GPU is typically powered directly by the power supply and it can draw 300W or more without affecting the amount of power available to the CPU.  In a laptop, however, you may have more power limitations.  If you have multiple components requiring significant amounts of power such as a CPU and a discrete GPU, the motherboard may not be able to provide enough power for both of them to run flat-out so it may have to limit CPU power while the discrete GPU is running.  These types of power balancing decisions can happen at a very deep firmware level and may not be visible to software.

The moral of this story is that the TDP listed on the spec sheet for the chip isn’t what matters; what matters is how the chip is configured by the OEM, how much power the motherboard is able to deliver, and how much power the cooling system is able to remove.  Just because two laptops have the same processor with the same part number doesn’t mean you should expect them to get the same performance.  This is unfortunate for laptop buyers but it’s the reality of the world we live in.  There are some things that you, as the user, cannot control such as the physical limitations of the chip or the way the OEM has configured things on your particular laptop.  However, there are some things which you can do from a system configuration perspective and that’s what we’ll talk about next.

If you want to experiment with your system and understand what’s going on with power, there are two tools which are very useful for this: powertop and turbostat.  Both are open-source and should be available through your distro package manager.  I personally prefer the turbostat interface for CPU power investigations but powertop is able to split your power usage up per-process which can be really useful as well.

Update GameMode to at least version 1.5

About a two and a half years ago (1.0 was released in may of 2018), Feral Interactive released their GameMode daemon which is able to tweak some of your system settings when a game starts up to get maximal performance.  One of the settings that GameMode tweaks is your CPU performance governor.  By default, GameMode will set it to “performance” when a game is running.  While this seems like a good idea (“performance” is better, right?), it can actually be counterproductive on integrated GPUs and cause you to get worse over-all performance.

Why would the “performance” governor cause worse performance?  First, understand that the names “performance” and “powersave” for CPU governors are a bit misleading.  The powersave governor isn’t just for when you’re running on battery and want to use as little power as possible.  When on the powersave governor, your system will clock all the way up if it needs to and can even turbo if you have a heavy workload.  The difference between the two governors is that the powersave governor tries to give you as much performance as possible while also caring about power; it’s quite well balanced.  Intel typically recommends the powersave governor even in data centers because, even though they have piles of power and cooling available, data centers typically care about their power bill.  The performance governor, on the other hand, doesn’t care about power consumption and only cares about getting the maximum possible performance out of the CPU so it will typically burn significantly more power than needed.

So what does this have to do with GPU performance?  On an integrated GPU, the GPU and CPU typically share a power budget and every Watt of power the CPU is using is a Watt that’s unavailable to the GPU.  In some configurations, the TDP is enough to run both the GPU and CPU flat-out but that’s uncommon.  Most of the time, however, the CPU is capable of using the entire TDP if you clock it high enough.  When running with the performance governor, that extra unnecessary CPU power consumption can eat into the power available to the GPU and cause it to clock down.

This problem should be mostly fixed as of GameMode version 1.5 which adds an integrated GPU heuristic.  The heuristic detects when the integrated GPU is using significant power and puts the CPU back to using the powersave governor.  In the testing I’ve done, this pretty reliably chooses the powersave governor in the cases where the GPU is likely to be TDP limited.  The heuristic is dynamic so it will still use the performance governor if the CPU power usage way overpowers the GPU power usage such as when compiling shaders at a loading screen.

What do you need to do on your system?  First, check what version of GameMode you have installed on your system (if any).  If it’s version 1.4 or earlier)and you intend to play games on an integrated GPU, I recommend either upgrading GameMode or disabling or uninstalling the GameMode daemon.

Use thermald

In “power management 101” I talked about how sometimes OEMs will configure a laptop to 12W or 15W in BIOS and then re-configure it to 25W in software.  This is done via the “Intel Dynamic Platform and Thermal Framework” driver on Windows.  The DPTF driver manages your over-all system thermals and keep the system within its thermal budget.  This is especially important for fanless or ultra-thin laptops where the cooling may not be sufficient for the system to run flat-out for long periods.  One thing the DPTF driver does is dynamically adjust the TDP of your CPU.  It can adjust it both up if the laptop is running cool and you need the power or down if the laptop is running hot and needs to cool down.  Some OEMs choose to be very conservative with their TDP defaults in BIOS to prevent the laptop from overheating or constantly running hot if the Windows DPTF driver is not available.

On Linux, the equivalent to this is thermald.  When installed and enabled on your system, it reads the same OEM configuration data from ACPI as the windows DPTF driver and is also able to scale up your package TDP threshold past the BIOS default as per the OEM configuration.  You can also write your own configuration files if you really wish but you do so at your own risk.

Most distros package thermald but it may not be enabled nor work quite properly out-of-the-box.  This is because, historically, it has relied on the closed-source dptfxtract utility that's provided by Intel as a binary.  It requires dptfxtract to fetch the OEM provided configuration data from the ACPI tables. Since most distros don't usually ship closed-source software in their main repositories and since thermald doesn't do much without that data, a lot of distros don't bother to ship or enable it by default.  You'll have to turn it on manually.

To fix this, install both thermald and dptfxtract and ensure that thermald is enabled.  On most distros, thermald is packaged normally even if it isn’t enabled by default because it is open-source.  The dptfxtract utility is usually available in your distro’s non-free repositories.  On Ubuntu, dptfxtract is available as a package in multiverse.  For Fedora, dptfxtract is available via RPM Fusion’s non-free repo.  There are also packages for Arch and likely others as well.  If no one packages it for your distro, it’s just one binary so it’s pretty easy to install manually.

Some of this may change going forward, however.  Recently, however, Matthew Garrett did some work to reverse-engineer the DPTF framework and provide support for fetching the DPTF data from ACPI without the need for the binary blob.  When running with a recent kernel and Matthew's fork of thermald, you should be able to get OEM-configured thermals without the need for the dptfxtract blob at least on some hardware.  Whether or not you get the right configuration will depend on your hardware, your kernel version, your distro, and whether they ship the Intel version of thermald or Matthew's fork.  Even there, your distro may leave it uninstalled or disabled by default.  It's still disabled by default in Fedora 33, for instance.

It should be noted at this point that, if thermald and dptfxtract are doing their job, your laptop is likely to start running much hotter when under heavy load than it did before.  This is because thermald is re-configuring your processor with a higher thermal budget which means it can now run faster but it will also generate more heat and may drain your battery faster.  In theory, thermald should keep your laptop’s thermals within safe limits; just not within the more conservative limits the OEM programmed into BIOS.  If all the additional heat makes you uncomfortable, you can just disable thermald and it should go back to the BIOS defaults.

Enable NVIDIA’s dynamic power-management

On my laptop (the late 2019 Razer Blade Stealth 13), the BIOS has the CPU configured to 35W out-of-the-box.  (Yes, 35W is higher than TDP-up and I’ve never seen it burn anything close to that much power; I have no idea why it’s configured that way.)  This means that we have no need for DPTF and the cooling is good enough that I don’t really need thermald on it either.  Instead, its power management problems come from the power balancing that the motherboard does between the CPU and the discrete NVIDIA GPU.

If the NVIDIA GPU is powered on at all, the motherboard configures the CPU to the TDP-down value of 12W.  I don’t know exactly how it’s doing this but it’s at a very deep firmware level that seems completely opaque to software.  To make matters worse, it doesn’t just restrict CPU power when the discrete GPU is doing real rendering; it restricts CPU power whenever the GPU is powered on at all.  In the default configuration with the NVIDIA proprietary drivers, that’s all the time.

Fortunately, if you know where to find it, there is a configuration option available in recent drivers for Turing and later GPUs which lets the NVIDIA driver completely power down the discrete GPU when it isn’t in use.  You can find this documented in Chapter 22 of the NVIDIA driver README.  The runtime power management feature is still beta as of the writing of this post and does come with some caveats such as that it doesn’t work if you have audio or USB controllers (for USB-C video) on your GPU.  Fortunately, with many laptops with a hybrid Intel+NVIDIA graphics solution, the discrete GPU exists only for render off-loading and doesn’t have any displays connected to it.  In that case, the audio and USB-C can be disabled and don’t cause any problems.  On my laptop, as soon as I properly enabled runtime power management in the NVIDIA driver, the motherboard stopped throttling my CPU and it started running at the full TDP-up of 25W.

I believe that nouveau has some capabilities for runtime power management.  However, I don’t know for sure how good they are and whether or not they’re able to completely power down the GPU.

Look for other things which might be limiting power

In this blog post, I've covered some of the things which I've personally seen limit GPU power when playing games and running benchmarks.  However, it is by no means an exhaustive list.  If there's one thing that's true about power management, it's that every machine is a bit different.  The biggest challenge with my laptop was the NVIDIA discrete GPU draining power.  On some other laptop, it may be something else.

You can also look for background processes which may be using significant CPU cycles.  With a discrete GPU, a modest amount of background CPU work will often not hurt you unless the game is particularly CPU-hungry.  With an integrated GPU, however, it's far more likely that a background task such as a backup or software update will eat into the GPU's power budget.  Just this last week, a friend of mine was playing a game on Proton and discovered that the game launcher itself was burning enough power with the CPU to prevent the GPU from running at full power.  Once he suspended the game launcher, his GPU was able to run at full power.

Especially with laptops, you're also likely to be affected by the computer's cooling system as was mentioned earlier.  Some laptops such as my Razer are designed with high-end cooling systems that let the laptop run at full power.  Others, particularly the ultra-thin laptops, are far more thermally limited and may never be able to hit the advertised TDP for extended periods of time.

Conclusion

When trying to get the most performance possible out of a laptop, RAM configuration and power management are key.  Unfortunately, due to the issues documented above (and possibly others), the out-of-the-box experience on Linux is not what it should be.  Hopefully, we’ll see this situation improve in the coming years but for now this post will hopefully give people the tools they need to configure their machines properly and get the full performance out of their hardware.

Comments

Popular Posts