Technik's blog: IBL Optimization Study

This is the first post in a series dedicated to Image-Based Lighting. This is a very common technique in modern videogames used to implement indirect lighting. My original idea was to use this problem as an excuse to explore the use of AI coding agents.

At first, the idea looked promising. I was able to blast through the implementation really quickly and get to a working prototype - which is, in itself, a good result. However, on closer inspection the implementation was actually quite bad, and it ran really slow. It reminded me of the work a very junior engineer might produce. Which gave me the idea that a better project could be to take this initial working implementation, and guide you through my process of fixing it and optimizing it, as one would in a real world production environment.

The intern is gone, and you're left with his half finished, promising project that you need to turn into something actually usable.

Without further ado, let's dive right in.

How Image-Based Lighting works

Image-Based Lighting (IBL for short) in modern games usually follows some variation of Brian Karis's split sum approximation technique. The basic idea is to use a cubemap to capture your environment, apply some preprocessing to it, and then use it to apply indirect lighting to your scene.

Specular lighting is solved by splitting the lighting integral in two parts: a preintegrated BRDF in the form of a polynomial approximation or a look up table (LUT), and a series of prefiltered incoming radiance maps stored in a cubemap mip chain. The two are then combined with some math at shading time to generate the final per-pixel specular contribution (we'll get into more details later).

Diffuse lighting is similarly achieved by precomputing the irradiance into a low resolution cubemap (convolving the environment map with a normalized cosine weighted function). At shading time, this irradiance is combined with the material's diffuse albedo to generate the final diffuse light contribution (basic implementations will just use this to compute a Lambert diffuse term, while more advanced versions might take into account the microfacet model, energy conservation, and other effects).

For now, we're going to focus on the process of generating these prefiltered specular and irradiance maps and the preintegrated BRDF LUT for lighting. There are many variations of the algorithm, but the basic steps are as follows:

Get a cubemap that captures the environment in all directions from a given point. This might be pre-rendered (either in your engine, or an external tool), or rendered dynamically (all at once or over multiple frames), or it might even be in the form of a 360 photograph like these encoded in equirectangular coordinates.
Generate a cubemap mip chain with simple box-filtering. We'll use it to speed up later steps and improve convergence.
Do some filtering on this cubemap to generate a mip chain of pre-convolved incoming radiance for different roughnesses.
Do some other filtering to convolve the initial cubemap with a cosine weighted function to generate the irradiance cubemap.
Preintegrate your specular BRDF (in our case a simple Cook-Torrance GGX) into a look up table.

Then use all these precomputed maps to evaluate indirect lighting in your scene, possibly in conjuction with other techniques like screen space ambient occlusion, screen space reflections, or even some ray tracing.

Basic flow for image-based lighting

The initial implementation

I actually had a hard time choosing what I mean by "initial implementation" here. The process of implementing a working prototype with an AI agent was very much iterative. In my experience, it's not like you can give a prompt to the agent, and it will generate a whole working implementation for you. Instead, the way it works best for me is to iterate with it until I get a very detailed implementation plan, divided in multiple phases, and further into small individual steps. Only when granularity is fine enough that I can implement each step without running out of context, do I let the AI actually start writing code. And even then, there is a lot of back and forth, rewriting and reviewing.

However, for this project in particular, I refrained from going too deep into fixes before having the end to end pipeline at least running, and producing something visual. So I am going to consider this commit as the "Initial implementation", meaning the AI agent output. From there on, all the changes are completely hand made as the result of my exploring the code more carefully and investigating performance with NSight graphics.

So what does it look like? Well, we have an EnvironmentProbe class that encapsulates all the work, and multiple compute shaders, one for each stage of the pipeline. EnvironmentProbe loads an HDR equirectangular environment map, uploads it to the GPU and dispatches all the compute shaders one after another to generate our IBL cubemaps as follows:

- equirectToCubemap reads from the HDR environment map and generates a 512x512 cubemap from it.

- generateCubemapMips generates the mipmap chain for this cubemap. We'll call this the "scratch" cubemap, since it's only needed as an intermediate product during filtering.

- convolveIrradiance performs cosine weighted importance sampling (with 1024 samples per texel) to integrate irradiance into a 64x64 cubemap.

- prefilterSpecular also implements importance sampling, but in this case the sampling follows Karis implementation (from the reference in the intro). 512x512 resolution at the top mip, also 1024 samples per texel.

- finally, generateBRDFLUT generates a 64x64 look up table by importance sampling the GGX BRDF as in Karis's reference.

All the cubemaps are rgba16f textures (so, 64bit per pixel), which is pretty standard for a quick implementation, but not really the best idea for production (more on that later). Overall, a reasonable implementation. Although we can see some visual artifacts:

Performance and other issues

Performance wise, things look pretty bad. I tested the implementation on two machines with NVidia GPUs:

- A laptop with an NVidia GeForce RTX 3050 Ti with 4GB of VRAM (I beleive it is the 50W variant)
- A desktop machine with an NVidia GeForce RTX 5080 with 16 GB of VRAM.

On the 5080, it takes over 9 milliseconds to run all the above, while the laptop's 3050 takes more than 70ms. Obviously, that's not going to fly for real time applications. In a real production environment, we'd be lucky to have a 1ms budget for all this. Even in our low spec hardware, things need to get a lot faster.

Here's a snipped of a GPU Trace capture taken with NSight Graphics on the 5080

Initial GPU trace on RTX 5080

Almost all of the time is spent in the specular prefilter pass, which is to be expected. This is the most complex pass, it takes a lot of samples, and runs at the highest resolution. The good news is that there is a lot of time where the GPU is basically idle. Idle time usually indicates poor parallelization rather than fundamental algorithmic cost. We might be able to improve things without even making deep changes.

Then there are the artifacts we mentioned above. Turns out both the specular and diffuse cubemaps have some issues. The prefiltered specular cubemap has a lot of noise in mip level 0

This is probably as a result of naively trying to sample the GGX at really low roughness values. In practice, there is no need for this. Mip 0 (which is also roughness 0) can just be trivially copied from the scratch cubemap. This would be both faster and cleaner than trying to sample the distribution.

Finally, the irradiance map also suffers from aliasing, probably because the high contrast from the HDR map I'm using is too much for the 1024 samples used here. The solution here will be more interesting, since we can't solve this one with a trivial copy.

Low hanging fruit

Since most time is spent on specular filtering, and since the main visible artifact is also there, that's where we're going to start fixing things. Adding a special case to our prefilterSpecular shader, to direct copy mip 0 reduces the cost to about 5ms on the 5080 and about 34ms on the 3050, and it gets rid of the specular artifact.

That's about a 44% perf saving, and the GPU traces now look like this, which is again mostly specular work, and most of it really idle time.

RTX 5080 GPU Trace

RTX 3050 Ti GPU Trace

The main issue is that we fail to parallelize our work. Most of the time, the GPU is doing nearly nothing, just waiting for a few threads to do all the work. Here we encounter the first obvious design problem with the AI generated code. It decided to loop over all cubemap faces inside the compute shader! This is really a terrible idea, since we could just be executing all six faces in parallel instead.

Let's change the code to use the z coordinate of our dispatch to indicate the cubemap face index and remove the loop. Instead of each thread processing one texel for each face, we now process a single texel per thread, for a single face dictated by SV_DispatchThreadID.z.

That alone gets us to ~1.8ms on the 5080 and ~19ms on the 3050, with much better GPU utilization overall.

GPU Trace on RTX 5080

GPU Trace on RTX 3050 Ti

We are still mostly ALU bound (purple in the throughput timelines above), meaning we are mostly busy doing math. Speeding this up would probably require deeper changes in the shader, but I want to stick with easy wins for now.

And while looking at these captures, I noticed that one of the main stall reasons in the shader is "Long Scoreboard". This basically means we're waiting for texture memory. Plus on the 3050 capture L2 cache access (light blue) seems to be a big factor too. All this indicates there might be some time to win by reducing our memory access.

An easy way to do this is to make the texture we're reading from (the scratch cubemap) smaller, and we can make it smaller with minimal quality loss by switching it from its current 64bits per pixel (bpp) format, to a 32bpp format like r11g11b10. Using a smaller format means we can fit more texels in both L1 and L2 caches. This will in turn reduce accesses to L2 and to main memory.

This change brings the specular filtering timings down to 1.7ms on 5080 and 18ms on 3050 and cleans up that L2 dependency

RTX 5080 GPU Trace

RTX 3050 Ti GPU Trace

Barriers

Something else that caught my attention here is that since our specular prefilter times have been going down, ConvolveIrradiance is starting to take a noticeable amount of time in the 5080 capture (about 0.28ms). Since 64x64 seems unnecessarily high resolution for this texture, I decided to cut it down to 32x32. Turns out this made absolutely no difference. Probably because this pass is also terribly parallelized, so we effectively went from poorly utilizing the gpu, to almost not utilizing it at all.

A better way to get rid of this problem, is to try to interleave our irradiance convolution with some other work. Since these two passes are independent, I am going to move them to the end of our cubemap work, and run them without any barriers in between (just a combined UAV barrier at the end of both). This saves about 0.1ms of overall processing time on our 5080 and 0.3ms on our 3050.

Speaking of barriers, those little dips we see in specular filtering above are because this implementation adds a barrier after computing each mip of specular prefiltering. At the end of each mip, a UAV barrier stalls the GPU until all writes are visible to later passes. This is unnecessary, since the mips are all independent of each other, and later mips won't read the output of previous ones. Replacing them with a single barrier outside the loop gets us roughly another 0.4ms on 5080 and nearly 4ms on 3050. This is mostly because as a mip dispatch starts running out of work, the next dispatch can start early and fill in the rest of the GPU. The final tiny dispatches for small mips might even run entirely concurrent. The whole pass is now a monolithic block with very little idle time.

RTX 5080 GPU Trace

We could go even further by merging the irradiance, lut and prefiltering stages all together by removing the barrier at the end of specular prefiltering, but that will make it harder to investigate other improvements to the smaller passes, so we'll leave that till the end. For now, one more trivial change we can make is to not copy the equirectangular texture into the scratch cubemap every frame. This pass is only needed when we're loading the environment map from a file, and it would be completely skipped for dynamic cubemaps rendered per frame, so we only need to do it once, at most. That brings our total execution time to ~1.56ms on 5080 and 17.5ms on 3050.

The big picture

There is not a lot more obviously wrong in the prefiltering implementation at this point. The shader code itself follows the reference implementation very directly (except for some heuristics about which mip level to sample from, which we can revisit in the future). It is also heavily ALU bound, so messing with memory and thread dispatch order won't probably help much.

The last thing to do for now is to re-evaluate whether we actually need 1024 samples for each pixel. This was hard coded by the AI implementation, but I think it's highly unlikely that 1024 will be the sweetspot for all three integration shaders. Playing around with different scenes and high contrast environment maps, I settled for 64 samples. For dynamic cubemaps, which are the real case where you would want to do all this per-frame, you usually don't have the sun embedded in the cubemap, so the contrast will generally be lower than the one from an HDR environment map. Maybe a good solution could be to keep a high sample count for static loaded maps (maybe even higher than 64), and lower it even further for real time (I think Karis mentioned using about 16 in the original paper). But for now, 64 seems like a good compromise, and it brings the specular execution time to about 0.18ms on 5080 and 2.17ms on 3050, and the total execution time to 0.62ms and 3.59ms respectively. Interestingly, on the 5080, specular prefiltering is now faster than convolving irradiance.

RTX 5080 GPU Trace

RTX 3050 Ti GPU Trace

This is quite an improvement from the initial, AI driven implementation. Both platforms run easily an order of magnitud faster, with very simple changes.

Next time, we'll take a closer look at irradiance convolution, where we'll want to change the implementation much more deeply, and do some work on finishing our derivations.

2025/12/29

IBL Optimization Study

How Image-Based Lighting works

The initial implementation

Performance and other issues

Low hanging fruit

Barriers

The big picture

No comments:

Post a Comment

Blog Archive

About Me