I've written an OpenCL application which runs much faster in Linux if one calls kernels concurrently on R9 290(X) devices because only one of the three kernels has high register usage. The performance of GCN-based devices (excluding Hawaii) scales very well with CU count and GPU core frequency in Windows and Linux. That's my problem and my question:
with best regards,
I do not want to publish the device code. But I can provide a link to the LLVMIR binary or to native binaries.
I can do an update to my initial post:
a) 1st kernel (mainly limited by compute resources, but also has high LDS and register usage): ~0.045s
b) 2nd kernel (high LDS usage): ~0.010s
c) 3rd kernel (high LDS usage): ~0.014s
Average execution time of all three kernels is ~0.046s with two threads per GPU. Therefore, the execution of the 1st kernel has to overlap with the 2nd and 3rd kernel. With Windows and a R9 290 the average execution time is the sum of the execution times of the kernels even with two threads per GPU. On Linux a R9 290 is more than 25% faster than on Windows, but if the kernels would overlap like on a R9 280X, I would expect better performance. Perhaps only the execution of the 1st and 2nd kernel overlaps.