2026/02/16

IBL Optimization Study III: Shared memory and Latency Hiding

Last time, we looked at precomputing irradiance using spherical harmonics and how the math can be simplified compared to the expressions commonly used. This time, we're back to looking at performance captures in NSight to try to squeeze more time out of our GGX prefiltering, especially on lower end hardware.

At the end of last post, our execution times looked something like this


RTX 5080

RTX 3050 Ti


GGX filter (ms)

Total (ms)

GGX filter (ms)

Total (ms)

Initial

0.18

0.22

2.17

2.35


Given those numbers, I'm going to focus on optimizing for the 3050. The first thing I tried was to use CopyTextureRegion for copying mip 0 instead of doing that manually in the shader. The hope here was that maybe the built-in implementation would hit some internal fast path and be faster than my manual copy. But this introduces a resource state transition barrier between mip 0 and mip 1, and didn't make any performance difference on either platform, so I'll revert this later.

Partial Unroll


NSight GPU Trace on NVidia 3050Ti

Looking at the perf captures in NSight, though, I noticed a few things. My occupancy could be better. I am getting about 32 warps out of the 48 you can reach on Ampere cards. Memory throughput is actually pretty low - about 23%, so we're not saturating bandwidth. Yet 75% of the time the shader is waiting on texture samples. This suggests we're not doing a good job at hiding our latency.
One thing we can do to improve this is partial unrolling. Our shader is basically a huge loop where we read from a texture at the start, wait for it, then do some math. So if, instead of taking one sample per iteration, we take two, the compiler will have some room to move the texture reads around and schedule the code more efficiently. So I extracted the code inside the loop into a separate function, and the loop now looks like this:


This worked pretty well, reducing our prefiltering time on the 3050 by about 15%, to just shy of two milliseconds (~1.98 ms). Incidentally, it also reduced register count from 53 to 41, so we might have a better chance of improving our occupancy as well, which might help hide our latency even further. I tried unrolling more iterations, but the shader grows too much pretty quick and you end up paying that in instruction fetching code and more registers, so 2x unrolling seems to be the sweetspot.

Sharing the math


However, while we are looking at shader code, I want to take a step back. We haven't really changed much here since our AI agent generated the code, and there might be opportunities to just do less work. As it is, the code follows Karis's proposed implementation from the paper fairly closely. For each sample, it does the following:
- Evaluate the Hammersley sequence to get a pair of nicely distributed random numbers
- Use those numbers to importance sample the GGX to get a tangent space half vector
- Transform that vector to world space
- Compute a reflection vector using the world space half vector
- Sample the texture
- Do some math to compute this sample's weight
- Accumulate.
This is perfectly fine, but Karis's paper intended you to do this precomputation offline, so it wasn't necessarily optimized for speed, and copying its implementation naively isn't the best idea.
For example, on every thread, every sample is evaluating the same Hammersley sequence and getting the same results. And our texture reads can't happen until we do that math and then sample the GGX to even know what reflection direction to sample from.

So we will move this math to the start of the shader and distribute it across the group. We have 64 samples, and 64 threads in each group. Each thread will compute one value in the Hammersley sequence and then store it in group shared memory. Then in every iteration of the loop, we will read our sequence value from LDS and immediately sample the GGX. You can see the details in this commit.
That gives us another 5 to 7% improvement. Down to 1.84ms on 3050 and 0.17ms on 5080 (although closer to the noise level on the 5080, the improvement seems consistent across many runs).

Well, if that works, let's go one step further. We can also precompute the tangent space half vectors. In tangent space, they are identical for all threads — they only become different after each thread converts them into its own world space.

This saves a considerable amount of redundant math. Previously, every thread was recomputing the same 64 GGX samples independently. Now the group computes the sample set collaboratively once and shares it through LDS, and each thread simply reuses those values.

This brings our execution times down to 0.14ms on 5080 and 1.33ms on 3050!



RTX 5080

RTX 3050 Ti


GGX filter (ms)

Total (ms)

GGX filter (ms)

Total (ms)

Initial

0.18

0.22

2.17

2.35

Partial unroll

0.18

0.22

1.98

2.10

Shared Hammersley

0.17

0.25

1.84

1.96

Shared GGX

0.14

0.17

1.33

1.46


We can go a bit further by realizing that even though we need a world space reflection direction, all the GGX math to compute the sampling LOD and the integration weight can be done in tangent space. This not only makes the math simpler (e.g. dot products with the normal become trivial, because the tangent space normal is always {0,0,1}). It also means we can precompute things into shared memory along with our tangent vectors.
Refactoring the math to tangent space drops our cost to 0.13ms on 5080 and 1.21ms on 3050. And precomputing the sampling LOD at the start of the shader further drops it to 0.10ms and 0.99ms respectively. We are finally under one millisecond on our low end 3050 laptop GPU!

NSight Graphics GPUTrace capture on NVidia 3050Ti



RTX 5080

RTX 3050 Ti


GGX filter (ms)

Total (ms)

GGX filter (ms)

Total (ms)

Shared GGX

0.14

0.17

1.33

1.46

Tangent math

0.12

0.15

1.21

1.35

Shared LOD

0.10

0.13

0.99

1.13


Dead ends

At this point, I tried moving some of the math to fp16 precision (to reduce register pressure), and increasing my group size to 8x16 (to allow for better warp scheduling). This increased occupancy to the full 48 warps but made no difference in performance. If anything, it might have made things slightly worse, but it's within the noise between runs.
I also tried moving the computation of vectors and LOD levels completely off the GPU and doing it just once on the CPU, but it turns out the cost of reading the precomputed vectors from video memory is higher than just precomputing them on each group.
This is a classic GPU trade-off: recomputing values in registers or shared memory is often cheaper than reading them from global memory, even if you are actually doing more work. Always measure your "optimizations".
The shader is just stalled waiting for memory 90% of the time, so the best optimization we can do here is probably just hiding the cost by running this on the async queue at the same time that we run something like our GBuffer pass in the main queue. Or maybe we could take a look at how our samples are actually distributed to get a better sense of how many we really need. I suspect we could get away with a different number of samples per roughness level.

Overview

For now, though, I think 1ms on low end is pretty good, so we're going to leave it here. Overall, the biggest gains came from recognizing that much of the work we were doing per thread was actually identical across the whole group. By moving Hammersley generation, GGX sampling, and LOD computation into shared memory and doing the math in tangent space, we reduced redundant computation and improved latency hiding without increasing bandwidth pressure. Partial loop unrolling helped the compiler schedule texture reads more effectively, but beyond that point the shader became almost entirely memory-bound, with stalls dominated by texture fetch latency rather than arithmetic throughput. At that stage, further micro-optimizations provide diminishing returns, and higher-level strategies like async compute overlap or adaptive sample counts are likely to yield bigger wins.

No comments:

Post a Comment