The other pathtracer 3: Complex scenes

Going from 1 triangle to many triangles is a trivial thing to do. At least if you don't care about performance at all. Just add a vector of triangles, and test them all.

float t = tMax;
// Bruteforce approach
bool hit_anything = false;
HitRecord tmp_hit;
for(auto& tri : mTris)
  collision = tmp_hit;
  t = tmp_hit.t;
  hit_anything = true;

That's all the code you need to render a bunch of triangles. However, that's not so interesting unless you can use those triangles to render something interesting.

At this point we need a way to load complex scenes. While there are many model formats that can serve that purpose, I've chosen gltf 2.0 for several reasons: a) It's an open format, with a clear and simple specification. b) It's mostly text based, using json, which makes it very easy to read. That comes in handy if you're debugging something. c) There are several header-only libraries that are easy to integrate, and will save me the hassle of loading the and parsing the json. I'm going with this one, just because I've used it before and it works very well.
Full load code is in this file, and it just loads the interesting chunks of gltf into my renderer's custom objects.

The scene structure on my pathtracer has the following elements:
- The scene: which is the top level 'bunch of things'. A vector of MeshInstance (spheres, meshes) with utility code and a camera
- MeshInstance: Since the same mesh can be used many times in a single scene, and I don't want to duplicate them in memory, this is a wrapper of mesh + matrix. I collapse the hierarchy of matrices on load time, which simplifies rendering. If in the future I decide to animate things, I will have to rethink that, but for static scenes it's ok.
- MultiMesh: Several triangle meshes attached to the same node (usually with different materials)
TriangleMesh: A vector of triangles. Intersect them in a loop with every tested ray.

The problem with this approach is that performance is awful. Very, very, very bad. I'm testing the raytracer with two sample gltf scenes: The damaged helmet (15k triangles, 1 mesh) and the project polly scene (122k triangles, 8 meshes), and the numbers are:
Rendering at 200x200 pixels, 16 rays per pixel (Numbers only count primary rays)
- Damaged Helmet -> 4 min, ~2600 Rays/s
- Polly -> 58 min, 183 Rays/s

My machine isn't the fastest one out there (core i5-7600k @3.8GHz, 16 GB ram), but 58 min for a 200x200 image of a not so complex scene seems a little excessive.

Initial Profiling

The first obvious problem is that we are not using any kind of acceleration structure, but a quick profiling also reveals something else:

When you start doing graphics, the first thing you learn is not to send your vertices to the GPU repeated with every triangle. Instead, you use indices in the triangles, and send the vertices in an array. So I just used the same approach with the raytracer, and it turns out that recreating the triangles again and again was costing me about 65% of the render time.
It may not have such a big impact after we add some acceleration structures, but it seems worth exploring, even if we revisit the change later.

Also, we are testing every triangle for every pixel of the image, even if there's nothing even close. That is another huge waste. Adding an Axis-Aligned Bounding box to the MeshInstance class should speed up the free pixels immensely. So let's do just that.

You can see shaded in red the pixels where rays intersected the AABBs. All the pixels in black are now much faster. As a result, Damaged helmet goes down to 2 min 13s, 4800 Rays/s. I didn't feel like wasting another hour profiling Polly just yet.

Also, constructing the triangles at load time, and storing them inside TriangleMesh was a huge improvement, as we saw before. The new numbers are:

DamagedHelmet -> 44s 14300 Rays/s
Polly -> 7m 22s 1447 Rays/s

This is better. Still, 7 minutes for that crappy image is a lot, but it's definitely better than an hour, and we haven't even added a BVH yet.
At this point, both scenes spend 96% of the time intersecting triangles, so we have two options:
- Microoptimize the triangle intersection code
- Add an acceleration structure and test waaay less triangles.
I'm going with the latest, which seems more promising. Testing 122k triangles per ray doesn't seem like a very good idea.

Acceleration Structure

One of the simplest acceleration structures I can think of, is a binary tree of AABBs. I have already added AABBs for simple culling above, so sticking them in a tree should be pretty easy. You can see the code in this commit.

A few things are worth mentioning in that code. First, the tree is not well balanced, and the way I classify triangles is far from optimal. I just sort triangles by their centroid along a given axis, and send half of them to each branch of the node. There are smarter sorting strategies using the Surface area heuristic, but I wanted to have the simplest system working first.

Also, I'm doing a lot of copies of full vectors during construction, which make the polly scene to take about a full second to load (actually, I though it would be a lot worse), but one second is ok with me by now. Another problem is that my triangles now end up scattered all around the memory, which is probably not good for my memory access patterns :). But we will walk all those paths in the future.

So, did it work? Here are the numbers after using the AABB tree inside TriangleMesh:
DamagedHelmet -> 0.5s, 1.2 MRays/s
Polly -> 1.69s, 378 kRays/s

Wow, that absolutely is an improvement. Polly took one hour to render at the beginning of this post, and is now under two seconds.

What's next?

There's still a lot of room for optimization in the AABB tree. We could also accelerate the AABB intersection tests using SIMD. And I think we'll do one of those things.


The image at the top of the header took 40 seconds to render at full HD resolution, 64 rays per pixel

No comments:

Post a Comment