3 Replies Latest reply on Mar 4, 2013 10:17 AM by LeeHowes

    Strange memory

    zoli0726

      Hy. I started to write an OpenCL program and it behaves strangely.

       

      If i debug it with codexl and stop it at some breakpoints it works fine. But without debugging and breakpoints my output is just a mess, and i have no idea why this happens.  Its my first program so im quite sure im doing sth wrong. I attach my kernel, if anybody have a suggestion, please share it with me.

        • Re: Strange memory
          LeeHowes

          In this code:    

          block0[localIndex] = input[globalIndex];      

          //IP           

          if(localIndex < 32)      {          

            L0[localIndex] = block0 [(localIdy * 2) + 57 - (localIdx * 8)];                

          }

           

          You read into local memory, then read out of it at different addresses but don't synchronize in the middle. You need a barrier in there where you say //IP to make it work.

            • Re: Strange memory
              zoli0726

              Yes, thank you, ive already found out that. There was plenty cases where i had to synchronize(and other where i didnt have to), and now its working well.

              I dont know how bad these synchronizations affecting performance, maybe I should write and implementation where i dont have to use them.

                • Re: Strange memory
                  LeeHowes

                  It can affect performance. My inclination is to never use a workgroup size that isn't 64 when targeting AMD hardware. Doing that means you can:

                  a) have more workgroups live (because on recent hardware we can manage a very large number of wavefronts, but only a small number of workgroups due to the use of barrier resources)

                  b) the barriers will optimise away because they are not needed to synchronise within the wavefront.

                   

                  It's a vector architecture, so in many ways you are better off writing code to it as if it's a vector architecture rather than thinking of it as a set of fine-grained threads that synchronize.