Hi Himanshu, thank you for answering.
Well, i'm sorry but I can't really post the code, it's a really big kernel more than 2000 lines. I wont post cut down version for the moment either, my companie wouldn't allow it anyway.
When I said the compatibility did not follow, I meant, it compiles fine, it just doesn't produce the right result when NVidia cards are able to.
I can give you the followind details :
I used the casting vector type to pointer trick to access a vector type component dynamically.
float f = ((float*)&vec)
That kind of code didn't work well with HD7950 card. I replaced it with that trick :
It fixed couple of problems, from what I remember the vector type to pointer trick worked with float4 that were not inside a structure (???), but maybe it is pure coincidence.
From there it started to work better, and I finally found that adding a volatile on some random float variable made the kernel functional, that's mainly why I think there is something broken in the OpenCL to AMD IL process.
I never had any problems with small and medium size kernels, I still think AMD Opencl is very reliable but it seems there is some kind of glitchs happening when kernels become quite complex, the kernel I encounter difficulties with has 4/5 nested loop, many break and a load of conditionals statements .
The kernel also relies on warp/wavefront lockstep principal... I used 32/64 to define their width, so it should be ok there.
Thanks for sharing the Tips and Tricks page. You can probably check if you can send the kernel using some private channel. I have added you as my friend on the forum, just check if private messages allow attachments. I would also suggest to start new threads, if your issue topic is different. The actual thread creator is still active here.
Can your issue be reproduced using the repository code, you had shared in the beginning? If so, please give some helpful steps to reproduce it. If things were working earlier and are now failing with new driver, it is a very critical issue for us.
The problem probably exists also in the repository, but i haven't updated it yet since there are still issues and it will crash due to the compiler issue mentioned earlier.
However i've made a smaller and more clear testcase (Attached as testcase.rar). Snapshots of outputs from runs with driver 12.10 and 13.4 are also attached.
This version is completely deterministic and hashes a single set key (instead of random keys), clearly the program is correct in driver 12.10 and broken in 13.4.
Sorry but I can't find a way to send you private messages, really well hidden.
Top right, select create, then Direct message. (not used it myself yet)
As of driver version 13.9 the program and the test case runs correctly if i disable optimizations with "-cl-opt-disable" when calling clBuildProgram (line 69 of testcase.c).
A more gpu agnostic version i've made will run slowly without optimizations, and will force me to create separate versions for amd and nvidia. But this is manageable so my thread can be marked as solved.