Technik's blog: IBL Optimization Study II: Faster Irradiance

Today, we're picking up right were we left last post. We are looking at building our irradiance map. As a refresher, the irradiance map is a low resolution cubemap where each texel corresponts to the cosine weighted integral of the incoming radiance across the hemisphere centered on that texel's direction. Given that description, it seems like a reasonable choice to build the irradiance map would be to solve the integral with importance sampling. Choose N samples from a cosine weighted distribution, and average them. And that, with N=1024 samples, is the approach chosen by the coding agent for writing our "initial implementation".

If you recall from the previous post, though, this came with a couple issues. First, we found siginificant aliasing artifacts in the cubemap. The following image is a capture of our irradiance cubemap, at it's current resolution of 32x32 texels per face, where I've highlighted some of the areas that show these aliasing artifacts.

Initial irradiance cube at 32x32 texels per face. Aliasing artifacts highlighted in yellow.

We could mitigate this by increasing the number of samples, but this would increase the cost. And that is precisely the second issue: computing this low resolution irradiance map is already pretty expensive. To fix those two issues, we are going to completely rewrite the algorithm, and instead implement a version of Ramamoorthi's spherical harmonics irradiance. However, we're going to go one step further and fully resolve some of the math analitically, not so much for the speed up, but because it will make the implementation cleaner and simpler to understand.

Improving the baseline

Before we rewrite everything, we're going to do some quick clean up on the initial implementation to have a fair reference to compare against.

First is to stop generating our brdf look up table every frame. This look up table has no dependency whatsoever on the contents of the cubemap itself, so there's no need to generate it more than once. And because we removed the barrier between convolving irradiance and generating the lut, both passes can now execute simultaneously, making it harder to measure the cost of irradiance itself.

Second, the initial implementation of irradiance makes the same mistake specular filter made regarding cubemap faces. Instead of launching separate threads to process texels on each of the cubemap's faces, it loops over all 6 faces for each thread in the dispatch. So once again, I'm removing this loop and instead using the dispatch's z coordinate to select each face.

Finally, for the same of completion, I'm changing the cubemap's texture format to the 32 bit per pixel (bpp) format r11g11b10. In this case, this really doesn't have any effect on performance or memory for this pass, but there's no reason to use a 64bpp format here, and every time we read from this texture during the rest of the frame we'd be paying unnecessary cost.

This table shows the effect on performance of all those changes on both our low end (RTX 3050 Ti) and high end (RTX 5080) gpus.

	RTX 5080		RTX 3050 Ti
	Irradiance Cost (ms)	Total Cost (ms)	Irradiance Cost (ms)	Total Cost (ms)
Initial	0.28	0.62	0.9	4.54
No BRDF LUT	0.29	0.51	0.9	3.65
No face loop	0.05	0.27	0.16	2.85
R11g11b10	0.05	0.27	0.16	2.85

Enter spherical harmonics

In their 2001 paper, Ramamoorthi and Hanrahan make the observation that irradiance is a very smooth function of the incoming direction, and as such it tends to be well represented by low degree sphereical harmonics (SH). They then go on to precompute the first few numerical SH coefficients, which a lot of people just copy paste into their implementations blindly. As an example, take a look at Unreal Engine's code and look for one of these magic constants, e.g. 0.282095f. As of today, I find it hardcoded 84 times across the engine. There are 5 different such coefficients, so that's a lot of magic numbers with little to no explanation where they come from. We can do better.

After providing the magic constants for everyone, the paper details a two step algorithm for efficiently computing your irradiance: "Prefiltering" your incoming radiance into SH, then "Rendering" this SH to extract irradiance per direction. If you get your math right, you get the cos-weighted convolution for free because that's a nice property of SH. It is also great for performance. Instead of each texel of irradiance needing its own bunch of samples to solve its own integral from scratch, we can put all our threads to work together into computing a single SH function, and then just evaluate (or "render") this SH once per output texel.

The step I think is usually missed here, is that if only you take a look at the math for both steps together, you'll realize that a lot of it simplifies away. The magic constants go away, and you are left with a much more intuitive final result). Let's see how it works.

First, let’s take a look at what we actually want to compute. Irradiance (E) is defined as the cosine weighted integral of incoming light (L) over the hemisphere around the normal direction (n):

Note the 1/π factor there is important for energy conservation, although it was omitted in the paper. Like we said, we can follow our two step process to first prefilter (or project) this function into SH coefficients, and then reconstruct an approximation of irradiance from these SH coefficients. The projection step is an integral we will need to solve numerically because it involves the light incoming from all directions:

That expression seems very computationally expensive. It is a nested integral, first over the hemisphere and then over the whole sphere, so it would seem like we would need to do numerical integration over 4 dimensions. However, we can use some nice properties of SH that allow us to express the convolution of two functions as the product of these functions's SH projections, scaled by a constant factor per band. As long as the convolving function has radial symmetry (which our cosine does). So we define our convolution function A(ω) as follows:

Where I am using <> to express the dot product clamped to [0,1]. And now we can express our irradiance approximation as:

We have now moved from a 4d integral, to the product of two independent 2d integrals, one of which has a closed form solution. This is much more efficient. To solve this properly, we just need one more step: Defining our SH basis coefficients

SH Basis

Unlike Ramamoorthi, Sloan's famous SH tricks paper gives us a polynomial form for our SH basis as a function of direction, ω = (x, y, z). For bands 0, 1 and 2, they look like this

That's a lof of numbers, but bear with me. I am going to define each of these bases as the product of a constant part Qlm which doesn't depend on the direction, and a polynomial part Plm(ω) that does depend on the direction. Incidentally, Q corresponds to the precomputed constants from Ramamoorthi.

Solving Coefficients Analytically

We can now express A as follows (see Appendix A for derivation details):

Neat. The integral is solved and A are now simply scaled versions of our SH basis constants. We can similarly express our $L$ projection as

Pluging A and L back into our convolution expression for E we get

We can solve this integral stochastically, which can be expressed like this

Where 1/4π is the inverse pdf of sampling uniformly over the surface of the unit sphere. With that, our final irradiance reconstruction becomes:

We can again now split the constant part of that equation into a new coefficient named B, and solve B for each of our bands

Which resolve neatly for each band to the following (go to Appendix B for detailed derivations of all coefficients):

Not only have we expressed all of our coefficients without a single rational number; all of our coefficients can now be expressed exactly in float32 precision. Notice as well that the coefficient for band 0 is just 1, which means the band 0 SH is exactly the average of all incoming light, as we would expect for energy conservation. An interesting result here is that the only "magic numbers" in band 0 and 1 are 1 and 2, respectively. This means that for working with 1st degree SH, as is common in many denoisers, lightmaps and irradiance probe implementations, you really don't need any rational numbers. Band 0 is the averange, and band 1 is a simple dot product with your direction, scaled by 2. SH1 is that simple.

Putting B back into our Irradiance expression:

Irradiance Compute Shader

Implementing that equation is the job of our compute shader. Everything after the1/N sum is our numerical integral, which we solve during the projection step. Blm is the constants we just calculated, and finally multiplying by Plm(ω) is our reconstruction step.

We will run our shader in a single dispatch of 6 groups (one for each face) of 1024 threads each, to keep the same number of samples as the base line for now.

Each group will write to a separate face of the cubemap, but they all compute their individual SH projection from the same input directions across the whole sphere, so the faces are consistent with each other. For that, we define a struct to store our SH coefficients in groupshared memory

Notice how each coefficient is a float3, since each color channel gets its own set of SH. Then we define our projection and reconstruction functions

Our projection function needs no hard coded constants, other than the polynomial for the sixth coefficient, and our reconstruction function (evaluateSH2) has our neatly computed B coefficients. We are ready to implement the shader's main function.

Lines 225 to 227 sample incoming radiance from the scratch cubemap's mip 5. Mip 5 is chosen to get good coverage over the cubemap here. Mip 5 is 16x16, for a total of 256 texels per face. Times 6 faces that is 1536 texels, so on the order of our 1024 samples.

Line 229 runs our projection step.

Lines 232 to 245 use wave intrinsics and shared memory to average our SH across all threads. This is the key to the new shader's performance: All threads in a single face contribute a single sample to the integral, and they all run in parallel at once.

Finally line 253 runs our reconstruction per texel of the output cubemap and line 255 writes the final irradiance to each texel of the group's face.

This final version of the shader runs at the following performance on our two test platforms

	RTX 5080		RTX 3050 Ti
	Irradiance Cost	Total Cost	Irradiance Cost	Total Cost
SH Irradiance	0.01	0.22	0.05	2.35

NSight Graphics GPU Trace on RTX 5080

At 0.01ms this is near the edge of the precision I can measure, and it is even faster than mipmap generation, so we'll stop here for now, and move to other bottlenecks. In the next post, we'll take a look at speeding up mip-map generation and maybe creating scalable quality presets for different platforms.

Finally, here's the SH derived cubemap for that same environment map