2018/06/06

The other pathtracer: Basic job system

Inspired by the Daily Pathtracer, by Aras Pranckevičius, I decided to also write a path-tracer, mainly to better follow along his posts (seriously, they're very good). I also noticed there are some other interesting aspects of a ray tracer that he hasn't mentioned (so far) and that may be worth talking about, so this post is my first attempt at doing just that. I want to touch issues like triangle intersection, texture blending or BVH optimization, but let me know in the comments if you're interested in specific parts. As of now, there are:
- Part one: Job system (this post)
- Pat two: The triangle
- Part three: Complex scenes
- Part four: AABB-Ray optimization

Spoiler: This will not be a daily series, but we will do some cool stuff with triangle meshes, and loading gltf scenes.





If you want to follow the code, everything is in this GitHub repo, but take a couple things into account. When I started the project, I didn't intend to write about it, so the initial commits are very dirty and disorganized. The code will not even compile out of the box (although the fixes are simple things like setting the proper include directories in your machine). Recent commits, much more recent than this post's, are way cleaner, should compile directly, and have less hard-coded stuff.

Part One: A simple job system.


Ray tracing is always a very intensive activity, and as you play with it, multithreading quickly becomes a necessity. For example, at the very beginning of my raytracer, barely after chapter 7 of "Raytracing in one weekend", this image:

took about 48 seconds to render in my machine. It's 400x200 pixels, 100 rays per pixel, for an average performance of about 166k rays per second. It's more than 400 spheres, with no acceleration structures and all kind of suboptimal stuff, so it's not that bad, or is it? Let's do some profiling and see what happens.

Well, as expected we're spending most of our time hitting spheres, so no surprise there.
The interesting part is what we're not doing. We're not using most of our cpu!


Naive multithreading.


The simplest thing to do, to add multithreading to our ray tracer is dividing the image, and feeding each part to a thread. For example, divide the image vertically by the number of threads. This way, each thread will access a completely different region of the image and there will be no conflicts.
Launch all threads at once, join them all, and you're done. Let's see the performance of this approach:



About 14 seconds, or 570k rays per second, quite a boost. We're making a better use of the cpu during the first part, but about second 6, usage drops, and by the end it drops even more. What happens here is that the top part of the image runs much faster (none of the rays hits anything) than the lower part, so the first thread will exit pretty soon, while the last one will take longer. We can probably find a better scheme for how to split the work more evenly, but eventually this approach won't scale to more complex images, and will not generalize well to different types of scenes.

Enter the job system


Instead, we can create many tasks, many more than threads we have, and let the threads choose which tasks they consume. This will ensure none of the thread starves while there's still work to do, but introduces a little complexity, so let's walk through it.

Let's start by dividing our image in small tiles. The most time a thread will be stall is the time it takes the other threads to finish their last tile, so the smallest the tile, the better. However, it pays of in cache coherency, and reduced overhead, to not switch tile after every pixel or ray, so you can play with tile sizes and see how the balance affects performance. We'll settle for tiles of 20x20 pixels, which makes an array of 20 by 10 tiles. I use a rectangle structure to describe tiles, and pass that information to the threads, and prepare all tiles ahead in a vector before starting the actual render.

The interesting part starts with synchronization. Since threads now are going to decide which tiles they consume, you need a way to prevent them from overlapping, or leaving gaps in between.
I think the simplest approach for this is to use an atomic counter. The counter points to the next task to be taken, and every time a thread wants more work, it reads the counter and leaves it incremented for the next thread. Something like

// Allocate threads to consume
std::atomic<size_t> tileCounter = 0;
const int nThreads = 16;
std::vector<std::thread> ts(nThreads);
...
// Run jos
for(int i = 0; i < nThreads; ++i)
{
 ts[i] = std::thread(threadRoutine, cam, world, size, outputBuffer.data(), tiles, &tileCounter);
 if(!ts[i].joinable())
 {
  return -1;
 }
}
for(int i = 0; i < nThreads; ++i)
 ts[i].join();

Appart from the new counter, this part of the code is exactly as simple as the naive multithreaded version. Now the new thread routine:
void threadRoutine(
 const Camera& cam,
 const Hitable& world,
 Rect imgSize,
 Vec3f* outputBuffer,
 const std::vector<Rect>& tiles,
 std::atomic<size_t>* tileCounter)
{
 size_t selfCounter = (*tileCounter)++;
 while(selfCounter < tiles.size()) // Valid job
 {
  auto& tile = tiles[selfCounter];
  traceImageSegment(cam, world, tile, imgSize.x1, imgSize.y1, outputBuffer);
  selfCounter = (*tileCounter)++;
 }
}

As simple as it gets. The important part is that we read and increment the counter in a single operation, since doing it in two steps would create race conditions. That is the only synchronization point between threads, which guarantees they will not spend a lot of time waiting on concurrent access. Once we update 'selfCounter', all work is local and perfectly safe.

Once thing worth mentioning is that all threads write to outputBuffer at the same time, but since tiles don't overlap, neither do the actual regions of memory accessed by the threads. In theory, if your data is not perfectly aligned, two threads could try to write at the same time to the same line of cache, and if this happened a lot, it could hurt your performance, but I'd say the probability of that is very low, and the time spent writing to the output image is small compared to the time tracing rays, so the effect would be tiny anyway.

So, what's the performance gain of all this? Let's see:



We can clearly see that cpu usage is now consistent during the whole run. We will never get 100% cpu usage because the OS has more things to do than just running our ray tracer, but this is already very good. The numbers are 9.5 seconds for 8 million rays, or 842k rays per second. We have 5x better performance than the single-threaded version, and more than 30% better than basic multithreading, not bad for such small change.

Next time


Next time, I think we will take a look at loading more interesting scenes :)

No comments:

Post a Comment