OpenCL

FangQ · ‎03-23-2019

I have an OpenCL code that does Monte Carlo photon transport simulations in a voxelated space (https://github.com/fangq/mcxcl). The code involves simulating a large number of random photon trajectories, each in a thread, and saving some floating-point quantities (related to energy loss of the photon) to the voxel grid that it traverses. Therefore, it has a lot of random memory writing operations.

Recently, I ported another similar code (but do ray-tracing in a mesh instead of a voxel space) (https://github.com/fangq/mmc) to the GPU using OpenCL. I was hoping that the new mesh-based OpenCL code to have a comparable speed to the voxel-based one. However, I was surprised.

On NVIDIA GPUs, the mesh-based code is slower than the voxel-based code, but not by a big margin (1.5x-2x slower). However, on AMD GPUs (I tested 4 generations - R9 nano, RX480, Vega64 and Vega II), the speed of the mesh-based code is significantly slower (>10x slower) than the voxel based code.

I was very much puzzled by this result, so I did some further analysis. I narrowed it down to global memory writing. In my kernel, there is only one single line responsible for memory writing - if I comment out that line, the speed on Vega64 and VegaII jumps by 20-fold! The rest of the computation is exactly the same.

I did a similar test on the voxel-based code, and also removed the memory writing lines (https://github.com/fangq/mcxcl/blob/master/src/mcx_core.cl#L1017-L1029). I see that both NVIDIA and AMD GPUs got a 2x speed increase.

So, it is clear that memory writing is a big issue in both of my OpenCL simulators, but somehow, the memory latency is significantly worse for the mesh-based simulator on AMD GPUs. I even printed the total number of global-memory writing (atomicadd) for the voxel and mesh-based codes, the mesh-based code has 10% less atomicadd than the voxe-based one, yet, it is >10x slower.

Now I am trying to understand why the voxel-based code is less impacted by global memory writing than the mesh-based code. I can see that for the voxel-based code, the workflow looks like

voxel_kernel_thread{
   for each photon{
      move_one_voxel_a_time{
         do_ray_voxel_ray_tracing()
         current_voxel=get_voxel_id(p.x,p.y,p.z);
         if(current_voxel!=last_voxel){
            atomicadd(last_voxel, last_energy);// writing to voxel when moving out
            last_voxel=current_voxel;
         }
      }
  }
}

This way, my memory writing is less than once per voxel a photon traverses.

While in my mesh-based code, I have the following structure

mesh_kernel_thread{
   for each photon{
      move_one_tetrahedron{
         do_ray_tet_ray_tracing()
         for each voxel along the ray{
            current_voxel=get_voxel_id(p.x,p.y,p.z);
            if(current_voxel!=last_voxel){
               atomicadd(last_voxel,last_energy);//writing to voxel when moving out
               last_voxel=current_voxel;
            }
         }
      }
   }
}

So, for the voxel kernel, the sequence of events is like

ray_tracing()
atomicadd();
ray_tracing()
atomicadd();
ray_tracing()
atomicadd();
ray_tracing()
ray_tracing()
atomicadd();
ray_tracing()
atomicadd();
ray_tracing()
atomicadd();
....

while in the mesh kernel, I see a pattern like

ray_tracing()
atomicadd();
atomicadd();
atomicadd();
ray_tracing()
atomicadd();
atomicadd();
atomicadd();
ray_tracing()
atomicadd();
atomicadd();
...

So, for the mesh-kernel, the total number of ray-tracing call is significantly less than the ray_tracing call in the voxel kernel (10-20x less), and the atomicadd() call is about 10% less than the voxel kernel, however, it becomes more clustered - that means an average of 9 atomicadd() calls are issued one after another before the next ray_tracing call.

This seems to make AMD GPUs unhappy - despite less total memory transactions, the clustered memory writing makes AMD GPU 10x slower in speed compared to the comparable NVIDIA GPU as well as the voxel kernel on the same AMD gpu.

so, my question is, do you have any suggestion how to mitigate the memory latency in this scenario? the output data (a 3D volume) are unfortunately too big to fit in to the shared memory.

My voxelbased code can be found at https://github.com/fangq/mcxcl, the mesh-based code is still under development but I am happy to share the code offline by emails if you think it is helpful to debug the problem.

Thank you in advance for any helpful input.

OpenCL

Need tips to hide memory latency - 20x speed-loss when writing to memory