What's New in V3 Renderer Core


Overhaul of the Integration KernelsBy definition, this is the central or most important part of something. In Octane, the Kernels are the heart of the render engine.

Since the beginning of Octane, the integration kernels had one CUDA thread calculate one complete sample. This was changed for various reasons, the main one being the fact that the integration kernels became huge and impossible to optimize. Also OSL and OpenCL are difficult to implement this way. To solve the problem, the task of calculating a sample was split into smaller steps which are then processed one by one by the CUDA threads. I.e. there are a lot more kernel calls are happening than in the past.


There are two major consequences associated with this new approach: Octane needs to keep information for every sample that is calculated in parallel between kernel calls, which requires additional GPUThe GPU is responsible for displaying graphical elements on a computer display. The GPU plays a key role in the Octane rendering process as the CUDA cores are utilized during the rendering process. memory. And, the CPU is stressed a bit more since it has to do more work to do many more kernel launches. To give some control over the kernel execution, two options were added to the direct lighting / path tracing / info channel kernel nodes:


Comparison for VRAM/RAM Usage Capabilities

Here is the comparison table between V2 and V3




Render buffers



TexturesTextures are used to add details to a surface. Textures can be procedural or imported raster files.











Triangles count: Max. 19.6 millions


Triangles count: Max 76 millions



It's difficult to quantify the performance impact, but the old system was hard to beat. That is because in scenes were samples of neighboring pixels are very coherent (similar), the CUDA threads did almost the same task and didn't have to wait for each other.


The problem is that in real production scenes, the execution of CUDA threads diverges very quickly causing CUDA threads to wait a long time for other CUDA threads to finish their rendering tasks. For these more complex scenes, the new system usually works better since the coherency is increased by the way each step is processed. The kernels can be optimized more because the scope of their task is much more narrow.


Moved Film Buffers to the Host and Tiled Rendering

The second major overhaul of the render core was the way render results are stored. Until v3, each GPU had its own film buffer where part of the calculated samples were aggregated. This has various drawbacks: For example, a CUDA error usually meant that the samples calculated by that GPU would be lost. Another problem was that large images meant a large film buffer, especially with render passes enabled.


To solve these issues, the film buffer was moved into host memory. This means that Octane has to deal with a huge amount of data that GPU's produce, especially in multi-GPU setups or when network rendering is used. As a solution, tiled rendering was introduced for all integration kernels except PMC (where tiled rendering is not possible). The tiles are relatively large in comparison to most other renders.


The film buffer in system memory means more memory usage, so make sure that you have enough RAM installed before you increase the resolution. Another consequence is that the CPU has to merge render results from the various sources like local GPUs or net render slaves into the film buffers which requires some computational power. This area has been optimized, but there is obviously an impact on the CPU usage. Increasing the max. tile samples option in the kernels reduces the overhead accordingly. Info passes are now rendered in parallel, as well.


Overhauled Work Distribution in Network RenderingThe utilization of multiple CPUs or GPUs over a network to complete the rendering process.

How render work is distributed to net render slaves and how their results are sent back has been modified, to make it work with the new film buffer. The key issue is that transmitting samples to the master is 1 to 2 magnitudes slower than generating them on the slave. The only way to solve this is to aggregate samples on the slaves and de-coupling the work distribution from the result transmission. This has a side effect benefit that rendering large resolutions (like stereo GearVR cube maps) doesn't throttle slaves anymore.


Of course, caching results on the slaves means that they require more system memory than in the past and if the tiles rendered by a slave are distributed uniformly, the slave will produce a large amount of cached tiles that needs to be transmitted to the master eventually. I.e. after all samples have been rendered, the master still needs to receive all those cached results from the slaves, which can take quite some time. To solve this problem an additional option to the kernel nodes that support tiled rendering was introduced,

Minimize net traffic. If enabled, it distributes only the same tile to the net render slaves, until the max samples/pixel has been reached for that tile and only then the next tile is distributed to slaves. Work done by local GPUs is not affected by this option. This way a slave can merge all its results into the same cached tile until the master switches to a different tile.