cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

zoli0726
Journeyman III

Strange memory

Hy. I started to write an OpenCL program and it behaves strangely.

If i debug it with codexl and stop it at some breakpoints it works fine. But without debugging and breakpoints my output is just a mess, and i have no idea why this happens.  Its my first program so im quite sure im doing sth wrong. I attach my kernel, if anybody have a suggestion, please share it with me.

0 Likes
1 Solution
LeeHowes
Staff

In this code:    

block0[localIndex] = input[globalIndex];      

//IP           

if(localIndex < 32)      {          

  L0[localIndex] = block0 [(localIdy * 2) + 57 - (localIdx * 8)];                

}

You read into local memory, then read out of it at different addresses but don't synchronize in the middle. You need a barrier in there where you say //IP to make it work.

View solution in original post

0 Likes
3 Replies
LeeHowes
Staff

In this code:    

block0[localIndex] = input[globalIndex];      

//IP           

if(localIndex < 32)      {          

  L0[localIndex] = block0 [(localIdy * 2) + 57 - (localIdx * 8)];                

}

You read into local memory, then read out of it at different addresses but don't synchronize in the middle. You need a barrier in there where you say //IP to make it work.

0 Likes

Yes, thank you, ive already found out that. There was plenty cases where i had to synchronize(and other where i didnt have to), and now its working well.

I dont know how bad these synchronizations affecting performance, maybe I should write and implementation where i dont have to use them.

0 Likes

It can affect performance. My inclination is to never use a workgroup size that isn't 64 when targeting AMD hardware. Doing that means you can:

a) have more workgroups live (because on recent hardware we can manage a very large number of wavefronts, but only a small number of workgroups due to the use of barrier resources)

b) the barriers will optimise away because they are not needed to synchronise within the wavefront.

It's a vector architecture, so in many ways you are better off writing code to it as if it's a vector architecture rather than thinking of it as a set of fine-grained threads that synchronize.

0 Likes