2026/02/16

IBL Optimization Study III: Shared memory and Latency Hiding

Last time, we looked at precomputing irradiance using spherical harmonics and how the math can be simplified compared to the expressions commonly used. This time, we're back to looking at performance captures in NSight to try to squeeze more time out of our GGX prefiltering, especially on lower end hardware.

At the end of last post, our execution times looked something like this


RTX 5080

RTX 3050 Ti


GGX filter (ms)

Total (ms)

GGX filter (ms)

Total (ms)

Initial

0.18

0.22

2.17

2.35


Given those numbers, I'm going to focus on optimizing for the 3050. The first thing I tried was to use CopyTextureRegion for copying mip 0 instead of doing that manually in the shader. The hope here was that maybe the built-in implementation would hit some internal fast path and be faster than my manual copy. But this introduces a resource state transition barrier between mip 0 and mip 1, and didn't make any performance difference on either platform, so I'll revert this later.

Partial Unroll


NSight GPU Trace on NVidia 3050Ti

Looking at the perf captures in NSight, though, I noticed a few things. My occupancy could be better. I am getting about 32 warps out of the 48 you can reach on Ampere cards. Memory throughput is actually pretty low - about 23%, so we're not saturating bandwidth. Yet 75% of the time the shader is waiting on texture samples. This suggests we're not doing a good job at hiding our latency.
One thing we can do to improve this is partial unrolling. Our shader is basically a huge loop where we read from a texture at the start, wait for it, then do some math. So if, instead of taking one sample per iteration, we take two, the compiler will have some room to move the texture reads around and schedule the code more efficiently. So I extracted the code inside the loop into a separate function, and the loop now looks like this:

2026/01/20

IBL Optimization Study II: Faster Irradiance

 Today, we're picking up right were we left last post. We are looking at building our irradiance map. As a refresher, the irradiance map is a low resolution cubemap where each texel corresponts to the cosine weighted integral of the incoming radiance across the hemisphere centered on that texel's direction. Given that description, it seems like a reasonable choice to build the irradiance map would be to solve the integral with importance sampling. Choose N samples from a cosine weighted distribution, and average them. And that, with N=1024 samples, is the approach chosen by the coding agent for writing our "initial implementation".

If you recall from the previous post, though, this came with a couple issues. First, we found siginificant aliasing artifacts in the cubemap. The following image is a capture of our irradiance cubemap, at it's current resolution of 32x32 texels per face, where I've highlighted some of the areas that show these aliasing artifacts.

Initial irradiance cube at 32x32 texels per face. Aliasing artifacts highlighted in yellow.

We could mitigate this by increasing the number of samples, but this would increase the cost. And that is precisely the second issue: computing this low resolution irradiance map is already pretty expensive. To fix those two issues, we are going to completely rewrite the algorithm, and instead implement a version of Ramamoorthi's spherical harmonics irradiance. However, we're going to go one step further and fully resolve some of the math analitically, not so much for the speed up, but because it will make the implementation cleaner and simpler to understand.

2025/12/29

IBL Optimization Study

This is the first post in a series dedicated to Image-Based Lighting. This is a very common technique in modern videogames used to implement indirect lighting. My original idea was to use this problem as an excuse to explore the use of AI coding agents.

At first, the idea looked promising. I was able to blast through the implementation really quickly and get to a working prototype - which is, in itself, a good result. However, on closer inspection the implementation was actually quite bad, and it ran really slow. It reminded me of the work a very junior engineer might produce. Which gave me the idea that a better project could be to take this initial working implementation, and guide you through my process of fixing it and optimizing it, as one would in a real world production environment.

 The intern is gone, and you're left with his half finished, promising project that you need to turn into something actually usable.

Without further ado, let's dive right in.

2018/08/10

The other pathtracer 5: Optimizing Triangle-Ray Intersections

In the last post of this series, we spent some time optimizing aabb-ray intersections. In this post, we will do the same for triangles, and in the process, will find one more optimization for aabbs.

So, the code I'm using now, is what I wrote for this previous post. It was supposed to be simple to understand, and overall gets the job done. But we can do better. Let's start with some profiling, and establish a baseline performance.




2018/06/21

The other Pathtracer 4: Optimizing AABB-Ray intersection

This post is about optimizing the AABB tree that we're using as our main acceleration structure.
I will use the polly scene from previous post for the tests, but have increased the output resolution to get results a bit more meaningful.



Initial performance:
Scene: Project Polly
Resolution: fullHD 1920x1080
Primary rays per pixel: 64
Results: 200 s, ~660k Rays/s

Taking a first look at the profiler, there's an obvious fail:

2018/06/17

The other pathtracer 3: Complex scenes



Going from 1 triangle to many triangles is a trivial thing to do. At least if you don't care about performance at all. Just add a vector of triangles, and test them all.

float t = tMax;
// Bruteforce approach
bool hit_anything = false;
HitRecord tmp_hit;
for(auto& tri : mTris)
{
 if(tri.hit(r,tMin,t,tmp_hit))
 {
  collision = tmp_hit;
  t = tmp_hit.t;
  hit_anything = true;
 }
}

That's all the code you need to render a bunch of triangles. However, that's not so interesting unless you can use those triangles to render something interesting.

2018/06/07

The other Pathtracer 2: The triangle

Following on the idea of my last post, today we're building on a very important matter: Intersecting triangles. Or more specifically, intersecting one triangle. This is not covered in many ray tracing tutorials, and that's a shame. Adding triangle intersection is the corner stone for a lot of interesting functionality (like loading full meshes), and it's actually a very simple thing to do.
All the relevant code is in this commit.

The Algorithm


There are several possible algorithms for intersecting triangles, with varying degrees of complexity and performance. For example see this wikipedia article or, if you have access to gdcvault, this talk by Earl Hammon (which has a ton of valuable material).
However. since my goal here is to get a working implementation quickly and easily, I will explain the algorithm that I find most intuitive. In later posts, we will be revisiting it for performance improvements, and even then, it will be useful to have a solid base line to benchmark against.

So the idea is to do intersection in two parts: First, we find whether our ray segment intersects the triangle plane, and if it does, then we see if the intersection point lies inside the triangle.

Part one is basic geometry, and can be decomposed in two parts as well: Find the plane defined by the three vertices of the triangle, then intersect the plane with our ray.

auto edge0 = v[1]-v[0];
auto edge1 = v[2]-v[1];
auto normal = normalize(cross(edge0,edge1));
auto planeOffset = dot(v[0],normal);

In regular production code, we should handle the case of a degenerate triangle, where cross(edge0,edge1) can't be normalized, but for now we will behave ourselves and just not make weird triangles. We can already see a possible optimization path: since none of the above depends on the ray, we could cache the plane definition, instead of recomputing it for every ray. Not now, anyway.