The other Pathtracer 4: Optimizing AABB-Ray intersection

This post is about optimizing the AABB tree that we're using as our main acceleration structure.
I will use the polly scene from previous post for the tests, but have increased the output resolution to get results a bit more meaningful.

Initial performance:
Scene: Project Polly
Resolution: fullHD 1920x1080
Primary rays per pixel: 64
Results: 200 s, ~660k Rays/s

Taking a first look at the profiler, there's an obvious fail:


The other pathtracer 3: Complex scenes

Going from 1 triangle to many triangles is a trivial thing to do. At least if you don't care about performance at all. Just add a vector of triangles, and test them all.

float t = tMax;
// Bruteforce approach
bool hit_anything = false;
HitRecord tmp_hit;
for(auto& tri : mTris)
  collision = tmp_hit;
  t = tmp_hit.t;
  hit_anything = true;

That's all the code you need to render a bunch of triangles. However, that's not so interesting unless you can use those triangles to render something interesting.


The other Pathtracer 2: The triangle

Following on the idea of my last post, today we're building on a very important matter: Intersecting triangles. Or more specifically, intersecting one triangle. This is not covered in many ray tracing tutorials, and that's a shame. Adding triangle intersection is the corner stone for a lot of interesting functionality (like loading full meshes), and it's actually a very simple thing to do.
All the relevant code is in this commit.

The Algorithm

There are several possible algorithms for intersecting triangles, with varying degrees of complexity and performance. For example see this wikipedia article or, if you have access to gdcvault, this talk by Earl Hammon (which has a ton of valuable material).
However. since my goal here is to get a working implementation quickly and easily, I will explain the algorithm that I find most intuitive. In later posts, we will be revisiting it for performance improvements, and even then, it will be useful to have a solid base line to benchmark against.

So the idea is to do intersection in two parts: First, we find whether our ray segment intersects the triangle plane, and if it does, then we see if the intersection point lies inside the triangle.

Part one is basic geometry, and can be decomposed in two parts as well: Find the plane defined by the three vertices of the triangle, then intersect the plane with our ray.

auto edge0 = v[1]-v[0];
auto edge1 = v[2]-v[1];
auto normal = normalize(cross(edge0,edge1));
auto planeOffset = dot(v[0],normal);

In regular production code, we should handle the case of a degenerate triangle, where cross(edge0,edge1) can't be normalized, but for now we will behave ourselves and just not make weird triangles. We can already see a possible optimization path: since none of the above depends on the ray, we could cache the plane definition, instead of recomputing it for every ray. Not now, anyway.


The other pathtracer: Basic job system

Inspired by the Daily Pathtracer, by Aras Pranckevičius, I decided to also write a path-tracer, mainly to better follow along his posts (seriously, they're very good). I also noticed there are some other interesting aspects of a ray tracer that he hasn't mentioned (so far) and that may be worth talking about, so this post is my first attempt at doing just that. I want to touch issues like triangle intersection, texture blending or BVH optimization, but let me know in the comments if you're interested in specific parts. As of now, there are:
- Part one: Job system (this post)
- Pat two: The triangle
- Part three: Complex scenes
- Part four: AABB-Ray optimization

Spoiler: This will not be a daily series, but we will do some cool stuff with triangle meshes, and loading gltf scenes.


C vs C++, Part II, Beautiful & efficient

In the first part of this series, we used C++ features (operator overloading and templates) to eliminate all defines and macros necessary for using a Pin. This way, we achieved the same performance of C code, but slightly increased readability, and hugely increased code safety. The result is a library that looks like this:

template<uint16_t address_>
struct Register {
 void operator=   (uint8_t _r)
  *reinterpret_cast<volatile uint8_t*>(address_) = _r;
 operator uint8_t   () const
  return *reinterpret_cast<volatile uint8_t*>(address_);
 operator volatile uint8_t& ()
  return *reinterpret_cast<volatile uint8_t*>(address_);

 template<uint8_t bit_>
 void setBit() { *reinterpret_cast<volatile uint8_t*>(address_) |= (1 << bit_); }
 template<uint8_t bit_>
 void clearBit() { *reinterpret_cast<volatile uint8_t*>(address_) &= ~(1 << bit_); }

Register<0x24> DDRB;
Register<0x25> PORTB;

constexpr uint8_t DDB5 = 5;
constexpr uint8_t PORTB5 = 5;


C vs C++, performance on AVR

The aim of this post is to fight the generalized belief of C++ being too slow of a language for embedded environments. This belief goes around, saying that microcontrollers should still be programmed in C, or even in assembler. Probably you don't agree with me right now. The idea of C being much more efficient than C++ is so extended that it almost seems like sacrilege to debate it. That's why I'm about to make a series of comparisons between both languages, throwing in some real and objective numbers (code size, execution time, etc). After we prove that not only can C++ compete with good old C, we'll see it's actually a better alternative. For that, besides performance metric, I will compare things like safety, code readability or portability.


Data Oriented Design vs Object Oriented Programming

I've been raised in the culture of Object Oriented Programming. I've always been told about the benefits of encapsulation, cohesion, locality, etc. There are very good reasons why a lot of smart people deeply support OOP. Designing good OOP architectures pays off. It saves a lot of time debugging errors, makes code easy to read and understand, and lets you focus on one part of the problem at a time.

But what if it's all wrong? In the last few years I've read about a concept known as Data Oriented Design, which many claim is a different paradigm promising huge performance improvements and that will make you question why you ever used OOP in the first place. Kind of a big claim, and big claims require good proof. So, when I came across with this talk by Mike Acton, I did the only thing I could do: I wrote a test.

The idea is simple: Have a bunch os squares defined by their radius and compute their areas. This is where a traditional OOP beginner tutorial would say "Make a class for Square ...".

class NiceSquare {
 float radius;
 float color[3];
 NiceSquare() : radius(3.f) {}
 void computeArea(float& area) { area = radius*radius; }

However, following the principles of DoD, we realise that our data is not a square, but a bunch of squares, so...

struct BunchOfSquares {
 float * radius;
 float * color;

There is a good reason for that color member. We will use it later to control the packing factor of our data. But we just sacrificed encapsulation for no good reason. If computing a square's area is something the square can do itself, then computing a bunch of areas should be something a bunch of squares can do itself too. What if we took this DoD approach to the problem, but implemented it with OOP?

class BunchOfSquares {
 float *radius;
 float* color;
 BunchOfSquares() : radius(new float[N_SQUARES]) {
  for(unsigned i = 0; i < N_SQUARES; ++i) radius[i] = 3.f;

 ~BunchOfSquares() {
  delete[] radius;

 void computeAreas(float* area) {
  for (unsigned i = 0; i < N_SQUARES; ++i) {
   float rad = radius[i];
   area[i] = rad*rad;

Much better now. Notice we didn't really sacrifice any Object-Orientation here. We just realised what objects really belong to our problem. And that's actually the key. Most of the time, when we do OOP, we tend to design our classes to fit our mental model of day to day life. WRONG! You are not solving day to day life, you are solving a specific problem! Now the question of performance still remains, so lets measure it:

duration<double> oldBenchmark() {
  NiceSquare *squares = new NiceSquare[N_SQUARES];
  float*areas = new float[N_SQUARES];
  auto begin = high_resolution_clock::now();

  for (auto i = 0; i < N_SQUARES; ++i) {

  duration<double> timing = high_resolution_clock::now() - begin;
  delete[] areas;
  delete[] squares;

  return timing;

duration<double> dodBenchmark() {
 BunchOfSquares squares;
 float* areas = new float[N_SQUARES];
 auto begin = high_resolution_clock::now();


 duration<double> timing = high_resolution_clock::now() - begin;
 delete[] areas;

 return timing;

int main(int, const char**) {
 ofstream log("log.txt");

 for(int i = 0; i < 100; ++i) {
  double oldTiming = real_millis(oldBenchmark()).count();
  double dodTiming = real_millis(dodBenchmark()).count();
  log << oldTiming << ", " << dodTiming << endl;

 return 0;

If you pay attention, you will see the benchmarks have been written the old fashioned way. It would be better to realise I don't want just 1 timing, and that I won't perform just one benchmark, and write the test to do a bunch of benchmarks and store the results in a bunch of timming records. But for now, we'll stick to this format because it will be easier to read for people not used to DoD, and because I like the irony of it.

Back to the test, a quick run shows this.

The improvement is obvious, even for a dumb example like this, DoD is about 40% faster. Cool. But can we do better? Theory says that the big performance improvements of DoD come from not wasting cache space. The better we use our caches, the faster the test will run. That's what the color member is there for. It represents the more realistic scenario where classes have more than one member. By controlling the size of color, we control how sparse in memory are the radiuses. That way, completely removing the color should make both paradigms perform almost identically, right?
Definitely right. And if we move the other way around and increase color from 3 to 63 floats ...

That's absolutely a win. We have almost 85% improvement. DoD code is running more than 6x faster now. And it's still Object Oriented! We've lost none of the benefits of OOP!

In conclusion, Data Oriented Design doesn't mean throwing away all you know about OOP and good programming practices. It is a reminder to solve the problems we do have, instead of the problems we are comfortable thinking of. Even though its performance gain is very thightly coupled with low level hardware, DoD principles tell us that our code is really messed up from a very high level. The moment you forget what data you are dealing with, you're already going the wrong way. Know your problem, know your data. Then you can apply whatever programming paradigm you see fits better. And if you decide to go for OOP, remember there's no rule saying an "object" in your code has to match any object in your day to day life. So just choose the right objects for your data.