Archives Discussions

nan · ‎02-11-2014

Hi,

I've written an OpenCL application which runs much faster in Linux if one calls kernels concurrently on R9 290(X) devices because only one of the three kernels has high register usage. The performance of GCN-based devices (excluding Hawaii) scales very well with CU count and GPU core frequency in Windows and Linux. That's my problem and my question:

Hawaii devices have a significant performance drop in Windows compared to Linux (about 1/3rd of the performance is lost) because it seems impossible to execute kernels in parallel on the device. Is this caused by different feature sets of the Windows and Linux Catalyst driver?
If a monitor is attached to the GPU and kernels are called concurrently in Windows, then high CPU usage is observed (any GCN-based device). The CPU does nothing else but calling OpenCL kernels and reading/writing a few bytes to device memory. Is this a driver bug?

with best regards,

NaN

nan · ‎02-14-2014

Is there any way to get into direct contact to AMD engineers? The OpenCL driver bugs and inconsistencies are quite annoying.

pinform · ‎02-18-2014

Thanks for your question. I have reported this to the team, and will post an update as soon as I get one.

sudarshan · ‎03-02-2014

Hi nan,

Would it be possible for you to share the code you have written? I would like to verify it at my end.

Thanks,

- Sudarshan

amd_support · ‎03-03-2014

Hi Nan,

Could you share the host and device code?

Thanks,

-AMD Support

nan · ‎04-08-2014

Hi,

I do not want to publish the device code. But I can provide a link to the LLVMIR binary or to native binaries.

I can do an update to my initial post:

If multiple threads are run on one GPU, then they execute the same task on completely different data sets (e.g. buffers).
The high CPU usage bug seen on R9 290(X) does not show up with Catalyst 14.x beta drivers, but e.g. with Catalyst 13.12 it does show up.
Some information about my kernels and execution times on R9 280X:

a) 1st kernel (mainly limited by compute resources, but also has high LDS and register usage): ~0.045s

b) 2nd kernel (high LDS usage): ~0.010s

c) 3rd kernel (high LDS usage): ~0.014s

Average execution time of all three kernels is ~0.046s with two threads per GPU. Therefore, the execution of the 1st kernel has to overlap with the 2nd and 3rd kernel. With Windows and a R9 290 the average execution time is the sum of the execution times of the kernels even with two threads per GPU. On Linux a R9 290 is more than 25% faster than on Windows, but if the kernels would overlap like on a R9 280X, I would expect better performance. Perhaps only the execution of the 1st and 2nd kernel overlaps.

Archives Discussions

Concurrent OpenCL-kernel execution and Windows/Linux driver differences (Hawaii GPU)