So, the code I'm using now, is what I wrote for this previous post. It was supposed to be simple to understand, and overall gets the job done. But we can do better. Let's start with some profiling, and establish a baseline performance.
The running parameters are the following:
- Scene: Project Polly
- Resolution: 800 x 600
- Primary rays per pixel: 64
And the results:
- Average running time: 24.9s
- ~1.23 million primary rays / s
Overall, triangle intersection code is taking about 24% of execution time now, so it seems like a reasonable optimization target. And looking at the code, there's a lot we can do.
By looking at the screenshot above, one can see that we spend about the same time inside the if as we do outside. This makes sense. By the time we get to triangle intersection, we're already in a leaf of the branch, so the ray is very likely to intersect our triangle, even more the triangle's plane. So this if isn't saving us much work, and it's introducing extra branching. We can do better.
But before that, there's another obvious thing. We're computing the interpolation factors everytime we detect an intersection, even if the intersection may later be discarded, and even if we're not using it. Even worse, that alone is taking about 3% of the running time.
Erase it and run again:
- Average time: 24.1s
- ~1.27 million r/s
3% faster than previous version, which matches the expectation nicely.
A new algorithm
Now back to that first branch. To get rid of that, we're going back to the drawing board. In the old algorithm, we first checked the intersection of the ray with the triangle's plain, and then figured out if the intersection point was inside or outside the triangle. In the new algorithm, we take use the fact that the ray origin, together with each of the triangle's edges, defines a new plane. And if we are consistent in how we define the aristas (all three using the same winding direction in the triangle), a ray that intersects the triangle will always be on the same side of both three planes. Better explained with images:
Figure 1: Ray origin O defines a new plane with triangle edge v2-v0 |
Figure 2: A side view of the previous diagram. The green ray intersects the triangle, the red one doesn't |
Being on the same side of the triangle means that the ray direction's dot product with the plane normal will always be the same sign. In fact, if we choose the edges carefully, we can get it to always be positive for front facing triangles, and negative for backfacing ones. And we can take advantage of this to improve robustness of secondary rays. When we start a secondary ray, the origin is very close to the triangle's plane, and small floating point errors can put it behind the triangle. By intersecting only forward facing rays, we skip that plane altogether, so the problem disappears, and we don't need to add a small bias.
So in code, it looks like this:
bool hit(const math::Ray& r, float tMin, float tMax, HitRecord& collision) const { auto p0 = r.at(tMin); auto p1 = r.at(tMax); auto n0 = cross(p0-v[0], edge0); auto n1 = cross(p0-v[1], edge1); auto n2 = cross(p0-v[2], edge2); if((dot(n0,r.direction()) > 0.f) && (dot(n1,r.direction()) > 0.f) && (dot(n2,r.direction()) > 0.f)) { auto offset0 = dot(p0, mNormal); auto offset1 = dot(p1, mNormal); float t = tMin + (tMax-tMin)*(mPlaneOffset-offset0)/(offset1-offset0); if( t >= tMin && t < tMax) { auto p = r.at(t); collision.t = t; collision.p = p; collision.normal = mNormal; return true; } } return false; }
We have all three edges precomputed there, one more than before. Overall, the code looks nicer, and we start to see some repetitions that might benefit from vectorization. So, how does it run?
- Avg time: 23.8s
- ~1.29 million r/s
Not a huge improvement, but it shows consistently, with all runs under 24 seconds. The next thing we can do is simplifying the computation of t. First we get rid of tMin by enforcing all rays to start at tMin = 0; That simplifies the math, and removes one argument from the function. Then, by assuming the ray direction is normalized, computation of t further simplifies to:
float t = dot(mNormal, v[0]-p0)/dot(r.direction(), mNormal);
And we already have v[0] - p0 from before. Notice that the code above also assumes the ray direction to be normalized, and this can be problematic when you have meshes with scaled transforms, because you need to re-scale t accordingly when you transform the collision info back into world coordinates. Otherwise, entire objects can disappear:
Figure 3: The display is missing because it has scale, and we didn't fix t accordingly, so the floor gets falsely intersected at a smaller t. |
With that in mind, we realise we no longer need offset1, offset0, mPlaneOffset or p1, so we gain back 4 bytes of storage per triangle and the code becomes:
bool hit(const math::Ray& r, float tMax, HitRecord& collision) const { auto p0 = r.origin(); auto h0 = v[0]-p0; auto n0 = cross(h0, edge0); auto n1 = cross(v[1]-p0, edge1); auto n2 = cross(v[2]-p0, edge2); if((dot(n0,r.direction()) < 0.f) && (dot(n1,r.direction()) < 0.f) && (dot(n2,r.direction()) < 0.f)) { float t = dot(mNormal, h0)/dot(r.direction(), mNormal); if( t >= 0.0 && t < tMax) { auto p = r.at(t); collision.t = t; collision.p = p; collision.normal = mNormal; return true; } } return false; }
Which is... slower. Turns out, renormalizing the ray in different parts of the code, plus rescaling t accordingly, costs us a bit more than the gain we get with the simplification. So, let's stick with unnormalized directions, but tMin = 0, which is the fastest version with an average running time of 23.5 seconds, and all runs under 23.7.
Bonus: Even faster AABBs
Now that we don't use tMin in triangle intersections, we can go one step further, and remove it from the whole AABB tree. The only thing we need to do for that is removing tMin from the AABB intersection test, and using 0 instead. Again, the improvement isn't huge, about 1%, but it is consistent, with an average time of 23.3 seconds, and all runs under 23.5
At this point, the profiler shows that triangle intersection is taking 16% of total run time, which means the theoretical limit for how much performance can we gain by 3-way vectorization is 10% of total running time, and our running times will have a lower bound of 21 seconds. However, we'd most likely not meet that figure, since we can't parallelize everything inside that function.
There are 4 things we can trivially vectorize in that code: computing the aristas (v[i] - p0), the cross products, the dot products, and the comparisons against 0.
Naively doing that results in the above code, which gives as another incremental boost.
We can do a little better by pre-packing the vertices and edges in the constructor, which gives us the final version of this code:
This runs at an average 21.6 seconds, with all runs under 22 seconds, and triangle code is now an 11% of our running time. That's an nice improvement of 7% over the non-simd version, and an overall improvement of more than 13% since the beginning of the post, and it's not too far from the theoretical limit of 21 seconds. I also tried pre-replicating the ray components, but it didn't give me any meaningful difference, probably because it means we either have to extract the x component anyway, or carry an extra parameter around. Anyway, 13% faster is quite a boost for optimizing a single function :)
The SIMD Triangle
At this point, the profiler shows that triangle intersection is taking 16% of total run time, which means the theoretical limit for how much performance can we gain by 3-way vectorization is 10% of total running time, and our running times will have a lower bound of 21 seconds. However, we'd most likely not meet that figure, since we can't parallelize everything inside that function.
There are 4 things we can trivially vectorize in that code: computing the aristas (v[i] - p0), the cross products, the dot products, and the comparisons against 0.
bool hit(const math::Ray& r, float tMax, HitRecord& collision) const { auto p0 = r.origin(); auto p = replicate(p0); auto packedV = packHorizontal(v[0], v[1], v[2], v[2]); auto edge = packHorizontal(edge0, edge1, edge2, edge2); auto a = cross(edge, packedV-p); auto dir = replicate(r.direction()); auto zero = math::float4(0.f); if((dot(a,dir) <= zero).none()) { auto h0 = v[0]-p0; float t = dot(mNormal, h0)/dot(r.direction(), mNormal); if( t >= 0.0 && t < tMax) { auto p = r.at(t); collision.t = t; collision.p = p; collision.normal = mNormal; return true; } } return false; }
Naively doing that results in the above code, which gives as another incremental boost.
We can do a little better by pre-packing the vertices and edges in the constructor, which gives us the final version of this code:
bool hit(const math::Ray& r, float tMax, HitRecord& collision) const { auto p0 = r.origin(); auto p = replicate(p0); auto a = cross(edge, packedV-p); auto dir = replicate(r.direction()); auto zero = math::float4(0.f); if((dot(a,dir) <= zero).none()) { auto h0 = v[0]-p0; float t = dot(mNormal, h0)/dot(r.direction(), mNormal); if( t >= 0.0 && t < tMax) { auto p = r.at(t); collision.t = t; collision.p = p; collision.normal = mNormal; return true; } } return false; }
This runs at an average 21.6 seconds, with all runs under 22 seconds, and triangle code is now an 11% of our running time. That's an nice improvement of 7% over the non-simd version, and an overall improvement of more than 13% since the beginning of the post, and it's not too far from the theoretical limit of 21 seconds. I also tried pre-replicating the ray components, but it didn't give me any meaningful difference, probably because it means we either have to extract the x component anyway, or carry an extra parameter around. Anyway, 13% faster is quite a boost for optimizing a single function :)
Next
In future posts, if we want to keep gaining performance, we'll probably change to algorithmic optimizations, or review our memory layout. Or maybe we'll take a different direction and start working on a nice PBR material system.
No comments:
Post a Comment