This is the first post in a series dedicated to Image-Based Lighting. This is a very common technique in modern videogames used to implement indirect lighting. My original idea was to use this problem as an excuse to explore the use of AI coding agents.
At first, the idea looked promising. I was able to blast through the implementation really quickly and get to a working prototype - which is, in itself, a good result. However, on closer inspection the implementation was actually quite bad, and it ran really slow. It reminded me of the work a very junior engineer might produce. Which gave me the idea that a better project could be to take this initial working implementation, and guide you through my process of fixing it and optimizing it, as one would in a real world production environment.
The intern is gone, and you're left with his half finished, promising project that you need to turn into something actually usable.
Without further ado, let's dive right in.
How Image-Based Lighting works
Image-Based Lighting (IBL for short) in modern games usually follows some variation of Brian Karis's split sum approximation technique. The basic idea is to use a cubemap to capture your environment, apply some preprocessing to it, and then use it to apply indirect lighting to your scene.
Specular lighting is solved by splitting the lighting integral in two parts: a preintegrated BRDF in the form of a polynomial approximation or a look up table (LUT), and a series of prefiltered incoming radiance maps stored in a cubemap mip chain. The two are then combined with some math at shading time to generate the final per-pixel specular contribution (we'll get into more details later).
Diffuse lighting is similarly achieved by precomputing the irradiance into a low resolution cubemap (convolving the environment map with a normalized cosine weighted function). At shading time, this irradiance is combined with the material's diffuse albedo to generate the final diffuse light contribution (basic implementations will just use this to compute a Lambert diffuse term, while more advanced versions might take into account the microfacet model, energy conservation, and other effects).
For now, we're going to focus on the process of generating these prefiltered specular and irradiance maps and the preintegrated BRDF LUT for lighting. There are many variations of the algorithm, but the basic steps are as follows:
- Get a cubemap that captures the environment in all directions from a given point. This might be pre-rendered (either in your engine, or an external tool), or rendered dynamically (all at once or over multiple frames), or it might even be in the form of a 360 photograph like these encoded in equirectangular coordinates.
- Generate a cubemap mip chain with simple box-filtering. We'll use it to speed up later steps and improve convergence.
- Do some filtering on this cubemap to generate a mip chain of pre-convolved incoming radiance for different roughnesses.
- Do some other filtering to convolve the initial cubemap with a cosine weighted function to generate the irradiance cubemap.
- Preintegrate your specular BRDF (in our case a simple Cook-Torrance GGX) into a look up table.
Then use all these precomputed maps to evaluate indirect lighting in your scene, possibly in conjuction with other techniques like screen space ambient occlusion, screen space reflections, or even some ray tracing.
| Basic flow for image-based lighting |
The initial implementation
I actually had a hard time choosing what I mean by "initial implementation" here. The process of implementing a working prototype with an AI agent was very much iterative. In my experience, it's not like you can give a prompt to the agent, and it will generate a whole working implementation for you. Instead, the way it works best for me is to iterate with it until I get a very detailed implementation plan, divided in multiple phases, and further into small individual steps. Only when granularity is fine enough that I can implement each step without running out of context, do I let the AI actually start writing code. And even then, there is a lot of back and forth, rewriting and reviewing.
However, for this project in particular, I refrained from going too deep into fixes before having the end to end pipeline at least running, and producing something visual. So I am going to consider this commit as the "Initial implementation", meaning the AI agent output. From there on, all the changes are completely hand made as the result of my exploring the code more carefully and investigating performance with NSight graphics.
So what does it look like? Well, we have an EnvironmentProbe class that encapsulates all the work, and multiple compute shaders, one for each stage of the pipeline. EnvironmentProbe loads an HDR equirectangular environment map, uploads it to the GPU and dispatches all the compute shaders one after another to generate our IBL cubemaps as follows:
- equirectToCubemap reads from the HDR environment map and generates a 512x512 cubemap from it.
- generateCubemapMips generates the mipmap chain for this cubemap. We'll call this the "scratch" cubemap, since it's only needed as an intermediate product during filtering.
- convolveIrradiance performs cosine weighted importance sampling (with 1024 samples per texel) to integrate irradiance into a 64x64 cubemap.
- prefilterSpecular also implements importance sampling, but in this case the sampling follows Karis implementation (from the reference in the intro). 512x512 resolution at the top mip, also 1024 samples per texel.
- finally, generateBRDFLUT generates a 64x64 look up table by importance sampling the GGX BRDF as in Karis's reference.
All the cubemaps are rgba16f textures (so, 64bit per pixel), which is pretty standard for a quick implementation, but not really the best idea for production (more on that later). Overall, a reasonable implementation. Although we can see some visual artifacts:
Performance and other issues
Performance wise, things look pretty bad. I tested the implementation on two machines with NVidia GPUs:
- - A laptop with an NVidia GeForce RTX 3050 Ti with 4GB of VRAM (I beleive it is the 50W variant)
- - A desktop machine with an NVidia GeForce RTX 5080 with 16 GB of VRAM.
| Initial GPU trace on RTX 5080 |
Almost all of the time is spent in the specular prefilter pass, which is to be expected. This is the most complex pass, it takes a lot of samples, and runs at the highest resolution. The good news is that there is a lot of time where the GPU is basically idle. Idle time usually indicates poor parallelization rather than fundamental algorithmic cost. We might be able to improve things without even making deep changes.
Then there are the artifacts we mentioned above. Turns out both the specular and diffuse cubemaps have some issues. The prefiltered specular cubemap has a lot of noise in mip level 0
This is probably as a result of naively trying to sample the GGX at really low roughness values. In practice, there is no need for this. Mip 0 (which is also roughness 0) can just be trivially copied from the scratch cubemap. This would be both faster and cleaner than trying to sample the distribution.
Finally, the irradiance map also suffers from aliasing, probably because the high contrast from the HDR map I'm using is too much for the 1024 samples used here. The solution here will be more interesting, since we can't solve this one with a trivial copy.
Low hanging fruit
Since most time is spent on specular filtering, and since the main visible artifact is also there, that's where we're going to start fixing things. Adding a special case to our prefilterSpecular shader, to direct copy mip 0 reduces the cost to about 5ms on the 5080 and about 34ms on the 3050, and it gets rid of the specular artifact.
That's about a 44% perf saving, and the GPU traces now look like this, which is again mostly specular work, and most of it really idle time.
| RTX 5080 GPU Trace |
| RTX 3050 Ti GPU Trace |
The main issue is that we fail to parallelize our work. Most of the time, the GPU is doing nearly nothing, just waiting for a few threads to do all the work. Here we encounter the first obvious design problem with the AI generated code. It decided to loop over all cubemap faces inside the compute shader! This is really a terrible idea, since we could just be executing all six faces in parallel instead.
| GPU Trace on RTX 5080 |
| GPU Trace on RTX 3050 Ti |
We are still mostly ALU bound (purple in the throughput timelines above), meaning we are mostly busy doing math. Speeding this up would probably require deeper changes in the shader, but I want to stick with easy wins for now.
| RTX 3050 Ti GPU Trace |
Barriers
A better way to get rid of this problem, is to try to interleave our irradiance convolution with some other work. Since these two passes are independent, I am going to move them to the end of our cubemap work, and run them without any barriers in between (just a combined UAV barrier at the end of both). This saves about 0.1ms of overall processing time on our 5080 and 0.3ms on our 3050.
| RTX 5080 GPU Trace |
We could go even further by merging the irradiance, lut and prefiltering stages all together by removing the barrier at the end of specular prefiltering, but that will make it harder to investigate other improvements to the smaller passes, so we'll leave that till the end. For now, one more trivial change we can make is to not copy the equirectangular texture into the scratch cubemap every frame. This pass is only needed when we're loading the environment map from a file, and it would be completely skipped for dynamic cubemaps rendered per frame, so we only need to do it once, at most. That brings our total execution time to ~1.56ms on 5080 and 17.5ms on 3050.
The big picture
There is not a lot more obviously wrong in the prefiltering implementation at this point. The shader code itself follows the reference implementation very directly (except for some heuristics about which mip level to sample from, which we can revisit in the future). It is also heavily ALU bound, so messing with memory and thread dispatch order won't probably help much.
| RTX 5080 GPU Trace |
| RTX 3050 Ti GPU Trace |
This is quite an improvement from the initial, AI driven implementation. Both platforms run easily an order of magnitud faster, with very simple changes.
Next time, we'll take a closer look at irradiance convolution, where we'll want to change the implementation much more deeply, and do some work on finishing our derivations.
No comments:
Post a Comment