2026/02/16

IBL Optimization Study III: Shared memory and Latency Hiding

Last time, we looked at precomputing irradiance using spherical harmonics and how the math can be simplified compared to the expressions commonly used. This time, we're back to looking at performance captures in NSight to try to squeeze more time out of our GGX prefiltering, especially on lower end hardware.

At the end of last post, our execution times looked something like this


RTX 5080

RTX 3050 Ti


GGX filter (ms)

Total (ms)

GGX filter (ms)

Total (ms)

Initial

0.18

0.22

2.17

2.35


Given those numbers, I'm going to focus on optimizing for the 3050. The first thing I tried was to use CopyTextureRegion for copying mip 0 instead of doing that manually in the shader. The hope here was that maybe the built-in implementation would hit some internal fast path and be faster than my manual copy. But this introduces a resource state transition barrier between mip 0 and mip 1, and didn't make any performance difference on either platform, so I'll revert this later.

Partial Unroll


NSight GPU Trace on NVidia 3050Ti

Looking at the perf captures in NSight, though, I noticed a few things. My occupancy could be better. I am getting about 32 warps out of the 48 you can reach on Ampere cards. Memory throughput is actually pretty low - about 23%, so we're not saturating bandwidth. Yet 75% of the time the shader is waiting on texture samples. This suggests we're not doing a good job at hiding our latency.
One thing we can do to improve this is partial unrolling. Our shader is basically a huge loop where we read from a texture at the start, wait for it, then do some math. So if, instead of taking one sample per iteration, we take two, the compiler will have some room to move the texture reads around and schedule the code more efficiently. So I extracted the code inside the loop into a separate function, and the loop now looks like this:

2026/01/20

IBL Optimization Study II: Faster Irradiance

 Today, we're picking up right were we left last post. We are looking at building our irradiance map. As a refresher, the irradiance map is a low resolution cubemap where each texel corresponts to the cosine weighted integral of the incoming radiance across the hemisphere centered on that texel's direction. Given that description, it seems like a reasonable choice to build the irradiance map would be to solve the integral with importance sampling. Choose N samples from a cosine weighted distribution, and average them. And that, with N=1024 samples, is the approach chosen by the coding agent for writing our "initial implementation".

If you recall from the previous post, though, this came with a couple issues. First, we found siginificant aliasing artifacts in the cubemap. The following image is a capture of our irradiance cubemap, at it's current resolution of 32x32 texels per face, where I've highlighted some of the areas that show these aliasing artifacts.

Initial irradiance cube at 32x32 texels per face. Aliasing artifacts highlighted in yellow.

We could mitigate this by increasing the number of samples, but this would increase the cost. And that is precisely the second issue: computing this low resolution irradiance map is already pretty expensive. To fix those two issues, we are going to completely rewrite the algorithm, and instead implement a version of Ramamoorthi's spherical harmonics irradiance. However, we're going to go one step further and fully resolve some of the math analitically, not so much for the speed up, but because it will make the implementation cleaner and simpler to understand.