Hy. I started to write an OpenCL program and it behaves strangely.
If i debug it with codexl and stop it at some breakpoints it works fine. But without debugging and breakpoints my output is just a mess, and i have no idea why this happens. Its my first program so im quite sure im doing sth wrong. I attach my kernel, if anybody have a suggestion, please share it with me.
Solved! Go to Solution.
In this code:
block0[localIndex] = input[globalIndex];
//IP
if(localIndex < 32) {
L0[localIndex] = block0 [(localIdy * 2) + 57 - (localIdx * 8)];
}
You read into local memory, then read out of it at different addresses but don't synchronize in the middle. You need a barrier in there where you say //IP to make it work.
In this code:
block0[localIndex] = input[globalIndex];
//IP
if(localIndex < 32) {
L0[localIndex] = block0 [(localIdy * 2) + 57 - (localIdx * 8)];
}
You read into local memory, then read out of it at different addresses but don't synchronize in the middle. You need a barrier in there where you say //IP to make it work.
Yes, thank you, ive already found out that. There was plenty cases where i had to synchronize(and other where i didnt have to), and now its working well.
I dont know how bad these synchronizations affecting performance, maybe I should write and implementation where i dont have to use them.
It can affect performance. My inclination is to never use a workgroup size that isn't 64 when targeting AMD hardware. Doing that means you can:
a) have more workgroups live (because on recent hardware we can manage a very large number of wavefronts, but only a small number of workgroups due to the use of barrier resources)
b) the barriers will optimise away because they are not needed to synchronise within the wavefront.
It's a vector architecture, so in many ways you are better off writing code to it as if it's a vector architecture rather than thinking of it as a set of fine-grained threads that synchronize.