Porting C++ code to CUDA

Recently I spent some time moving parts of my raytracer from regular C++ code to CUDA so the GPU could handle the expensive ray intersection work.

At a high level the goal sounds simple: take the code that shoots rays and checks for collisions and run it on the GPU instead of the CPU.

In practice, the interesting part is that CUDA is not just "faster C++". Some patterns that feel very natural on the CPU become awkward or expensive once the code has to run inside a kernel.

This post is a quick overview of what had to change and why.

The original CPU structure

The raytracer already had a pretty standard structure. A Scene contains geometry, a Camera casts rays, and those rays are tested against the scene to compute the color of each pixel.

The expensive part is collision detection. For any non-trivial scene, testing every ray against every triangle would be far too slow, so the engine uses a BVH. That lets us discard large parts of the scene early by first testing the ray against bounding boxes.

On the CPU this kind of code is very comfortable to write recursively. A simplified version looks like this:

bool Hit(Node* node, const Ray& ray) {
if (!node->box.Hit(ray)) return false;

if (node->is_leaf) {
return HitTriangles(node, ray);
}

return Hit(node->left, ray) || Hit(node->right, ray);
}

That is easy to read and easy to extend. Unfortunately it is not a great fit for GPU execution.

CUDA is a different environment

When I first approached the port I made the mistake of thinking mostly in terms of syntax. CUDA lets you write code that looks a lot like C++, so it is tempting to assume you can move whole classes over with just a few annotations.

That is not really how it goes.

The important difference is that CUDA code runs in a separate execution environment with its own memory and its own constraints. Data has to be copied to the GPU, kernels need simple memory layouts, and some features that are convenient on the CPU are much less convenient on the GPU.

For this project, the two main friction points were:

Recursive traversal
Object-oriented data structures with pointers

Those two things show up everywhere in a CPU raytracer.

Why recursion becomes a problem

Recursive tree traversal is the most natural way to walk a BVH, but it is not something I wanted to rely on in CUDA kernels.

NVIDIA does support recursion in device code in some cases, but it comes with practical limitations and is generally something you want to avoid in hot code like ray traversal. It is harder to reason about stack usage and it pushes the implementation away from the simple data-parallel model that GPUs like best. The CUDA programming guide also calls out recursion-related limits and stack considerations here.

So instead of trying to preserve the recursive CPU traversal exactly, I rewrote it as an iterative depth-first search using an explicit stack.

This is the key idea behind the GPU BVH implementation in gpu_bvh.cu.

Flattening the BVH

Another important change was the representation of the tree itself.

On the CPU, a BVH node can naturally point to its left and right child using regular pointers. That is flexible, but it is not an ideal layout to copy to the GPU. Pointer-heavy structures are awkward to serialize, difficult to move between host and device, and usually not great for memory locality.

The solution I used was:

Build the BVH on the CPU
Traverse that CPU BVH once
Generate a flat array of GPU nodes and a flat array of GPU triangles
Copy those arrays to GPU memory

That is exactly what the TraverseBvh and FromBvh functions are doing in the repository.

Instead of storing child pointers, each GPUBvhNode stores integer indices for its children. Leaf nodes store indices to triangles, while internal nodes store indices to other nodes. This makes the structure much easier to upload and traverse on the GPU.

In other words, the CPU version owns a tree of objects, while the GPU version owns a few compact arrays.

Replacing recursion with an explicit stack

Once the BVH is flattened, traversal becomes an iterative loop.

The nice thing about depth-first traversal is that the recursive call stack can be replaced by a very small fixed-size array. In the GPU code I used:

const int MAX_STACK_SIZE = 64;

The comment in the source explains the reasoning well: for a binary tree, the height grows roughly with log2(N), so a stack of 64 entries is enough for an absurdly large number of nodes.

The traversal then looks roughly like this:

GPUBvhNode stack[MAX_STACK_SIZE];
size_t count = 0;
stack[count++] = root;

while (count > 0) {
GPUBvhNode node = stack[--count];

if (!node.Hit(ray, t_min, closest_so_far))
continue;

if (!node.is_leaf) {
stack[count++] = right_child;
stack[count++] = left_child;
} else {
// test leaf triangles
}
}

This is still a depth-first search, just written in a way that works better inside a CUDA kernel.

The traversal algorithm stayed the same. The main change was representing the BVH as flat arrays and replacing recursive calls with an explicit stack.

Materials and object ownership

The same kind of flattening happened for materials.

On the CPU it is convenient for triangles to reference material objects directly, but on the GPU you want a much simpler representation. In the port, materials are created on the CPU, converted into GPUMaterial values, stored in a flat array and then referenced by index when triangles are uploaded.

There is a small but important detail in the code: materials are deduplicated before being copied. The AddTriangle helper keeps a map from CPU material id to GPU material index so the same material is only uploaded once.

That is a practical detail that shows up often in CUDA code. Once the data is moved into flat arrays, you need to keep track of indices, deduplicate shared data and copy everything to device memory explicitly.

What actually stayed the same

One thing I like about this port is that the high-level renderer design did not need to be thrown away.

The scene is still a scene. Rays are still rays. A BVH is still a BVH. The intersection math is the same, the shading model is the same, and the overall rendering pipeline is still recognizable.

The CPU version can afford richer object graphs and recursive logic because that makes the code easier to express. The GPU version benefits from compact arrays, explicit indices and iterative traversal because that makes execution simpler and memory access more predictable.

Porting CPU C++ code to CUDA