Hi, I'm working on an OpenCL-based simulation project, and I've been running it on OpenCL implementations from the three major vendors (AMD, Intel, Nvidia). So far, the simulation runs fastest on the AMD implementation (although the AMD hardware is the most powerful in my small sampling of the three:). There is, however, a significant stall in between iterations of the simulation where values are read from the device and written to disk. And strangely, this stall is only present when run on AMD hardware, reducing the performance advantage I'm seeing with AMD.
For a minimal test case, I've included an augmented example program from the pyopencl distribution. There does seem to be a resource utilization threshold to trigger this stall. In running the attached program, you should be able to note the stall that occurs around line 52, but is reduced when the number of loop iterations is also reduced.
Please let me know if there is a possible fix for this stall in processing, or if there is any more information I could provide. I would very much like to be able to run the simulation I'm working on at full speed via AMD.
1) AMD Radeon R9 285, driver: OpenCL 2.0 AMD-APP (1800.11)
2) Intel HD 5500 Broadwell U-Processor, driver: OpenCL 1.2 beignet 1.1.1
3) Nvidia GTX 650, driver: OpenCL 1.2 CUDA 7.5.18
Thanks for reading!