2026/02/16

IBL Optimization Study III: Shared memory and Latency Hiding

Last time, we looked at precomputing irradiance using spherical harmonics and how the math can be simplified compared to the expressions commonly used. This time, we're back to looking at performance captures in NSight to try to squeeze more time out of our GGX prefiltering, especially on lower end hardware.

At the end of last post, our execution times looked something like this


RTX 5080

RTX 3050 Ti


GGX filter (ms)

Total (ms)

GGX filter (ms)

Total (ms)

Initial

0.18

0.22

2.17

2.35


Given those numbers, I'm going to focus on optimizing for the 3050. The first thing I tried was to use CopyTextureRegion for copying mip 0 instead of doing that manually in the shader. The hope here was that maybe the built-in implementation would hit some internal fast path and be faster than my manual copy. But this introduces a resource state transition barrier between mip 0 and mip 1, and didn't make any performance difference on either platform, so I'll revert this later.

Partial Unroll


NSight GPU Trace on NVidia 3050Ti

Looking at the perf captures in NSight, though, I noticed a few things. My occupancy could be better. I am getting about 32 warps out of the 48 you can reach on Ampere cards. Memory throughput is actually pretty low - about 23%, so we're not saturating bandwidth. Yet 75% of the time the shader is waiting on texture samples. This suggests we're not doing a good job at hiding our latency.
One thing we can do to improve this is partial unrolling. Our shader is basically a huge loop where we read from a texture at the start, wait for it, then do some math. So if, instead of taking one sample per iteration, we take two, the compiler will have some room to move the texture reads around and schedule the code more efficiently. So I extracted the code inside the loop into a separate function, and the loop now looks like this: