Recently, I translated a CPU code into OpenCL, and it has been debugged and tested (using GTX1060).
The calculating process of this code is an iteration process. The calculating results are presented in the form of residual (the difference between the result of the last iteration step and that of the previous iteration step). The process of decreasing residual is called convergence process.
My computing environment VS2015, configed to using fp64, and Kernel functions are called in the same command queue in turn, and constraints by waiting cl_event.
That's ok on Nvidia Cards, but failed at AMD.
The phenomenon is:
1. Using CUDA 9.2 (OpenCL 1.2) to run on GTX1060, the results of each iteration step are almost the same as that of CPU results ( Although it is a slight deviation after decimal point 12).
2. For AMD cards, there are some difference between Debug and Release (only changing Debug and Release on VS)
(a) When Debug is used to run directly, the iteration will diverge under few steps (calculation result NaN). The results of the steps output are incorrect and show randomness (the results of are inconsistent each time);
B) Using Release to run, it can be calculated without divergence, but the value of each iteration step is quite different from that of CPU and GTX1060.
C) When I trying debug it, if the function step over (process by process) which contains calling more than one clEnqueueNDRangeKernel () is run, the result will be wrong, but when I entered this function and debug it step by step , the result will be correct.
D) Trying to change the AMD driver version, replacing the AMD graphics card (I have 2 pieces R9 390, and one R9 280x) or adding OpenCL compilation options (such as - cl-std = CL1.2 - cl-opt-disable) , it no effect.
E) Considering unordered execution is not enabled by itself, I suspected that the execution is not in process as the code, and I had set the callback function clSetEventCallback() to monitor the function trigger time but find that the order is correct.
To sum up, especially only the one-by-one monitoring can be correct, this is extremely unscientific.
Just my 2 cents...
I don't know if it helps you but I had a similar problem.
Amd OpenCl compiler aligns structs/unions very different than VS. I had a nightmare before I realize that. Now I test/print the size of every struct from VS and OpenCl. Since sometimes sizes of structs (especially complex structs) are not the same, and compiler is enough intelligent, it tends to overoptimize the code.
VS/Cuda work nicely together, there are no such problems.