11 Replies Latest reply on Jun 9, 2011 7:37 PM by LeeHowes

    Barrier to simulate SIMT on CPU

    spectral

      Hi,

      I have a small kernel, very simple. If you have the following input

      "1 1 1 1 1 1 1 1 1 1 1 1 1 1 1"

      It should create a simple scan

      1 2 3 4 5 6 7 8 ...

      It works only on NVidia SDK but not on AMD one !

      But I got wrong results. 

      __kernel void kernel__scanIntra(__global uint* input, uint size) { size_t idx = get_global_id(0); const uint lane = get_local_id(0); const uint bid = get_group_id(0); if (lane >= 1 && idx < size) input[idx] = input[idx - 1] + input[idx]; barrier(CLK_LOCAL_MEM_FENCE); if (lane >= 2 && idx < size) input[idx] = input[idx - 2] + input[idx]; barrier(CLK_LOCAL_MEM_FENCE); if (lane >= 4 && idx < size) input[idx] = input[idx - 4] + input[idx]; barrier(CLK_LOCAL_MEM_FENCE); if (lane >= 8 && idx < size) input[idx] = input[idx - 8] + input[idx]; barrier(CLK_LOCAL_MEM_FENCE); if (lane >= 16 && idx < size) input[idx] = input[idx - 16] + input[idx]; barrier(CLK_LOCAL_MEM_FENCE); }

        • Barrier to simulate SIMT on CPU
          himanshu.gautam

          Hi viewon01,

          I would expect the results to be wrong. There seem to be read before write issue as new input[dx-1] is needed to calculate input[idx] while in most cases the kernel will use the old value.

          • Barrier to simulate SIMT on CPU
            maximmoroz

            Why do you use barrier(CLK_LOCAL_MEM_FENCE) while you need to synchronize access to GLOBAL buffer?

            • Barrier to simulate SIMT on CPU
              MicahVillmow
              rick.weber,
              It actually is important. If the memory fence is not there for the correct memory type, then the compiler is free to move memory instructions across the barrier.
              • Barrier to simulate SIMT on CPU
                MicahVillmow
                OpenCL inherits from C, so unless overridden by the OpenCL spec, anything that is in C99 is also in OpenCL.
                  • Barrier to simulate SIMT on CPU
                    rick.weber

                    I clearly have no idea what I'm talking about. Carry on. I thought volatile was reserved for future use, but it's in section 6.1.4.

                      • Barrier to simulate SIMT on CPU
                        himanshu.gautam

                        viewon01,

                        Okay. Here is some sort of better solution for the problem.

                        Let's say we have an array with 20 elements with numbers 1-20 in them(for convinience). And we need to execute arr+= arr[i-1]  for each element.

                        Again for convenience we divide it into 2 parts of 10 elements each and assign the work to two separate threads. So thread0 get the value 55(1-10) and thread2 get the value 155(11-20). But the values calculated by thread1 are not correct but we can correct them by(a constant offset) adding (55-10)*10 to each item later in a separate kernel.

                        Well I understand this looks easy, but it would again be serialized on the steps to calculating offsets. But the number of offsets to be calculated will be reduced by a factor which is equal to the number of elements processed in the first step.

                        Hope this might be helpful.



                          • Barrier to simulate SIMT on CPU
                            spectral

                            Thanks Himanshu, it is the way most scan algorithms are working.

                            Today we have 2 scan algorithms, a general one and a GPU optimized one. I still have to used SIMD capabilities of the GPU and minimize memory access serialization !

                              • Barrier to simulate SIMT on CPU
                                LeeHowes

                                 

                                In the current AMD OpenCL implementation, rick, it's actually the barrier he doesn't need and the fence he does. On the GPU at least. On the CPU the barriers would be necessary (split of course because of the RAW depency, I'm surprised that code works on nvidia's implementation... maybe I misunderstand the post).

                                Of course, that's out of the OpenCL spec and non-portable (but the only way to get good performance on a vector scan).

                                On either the CPU or GPU those barriers are going to lead to a slow scan operator, unfortunately. On the GPU you'd find it more efficient to allocate a set of identity values before the set of actual values and then you can take all that code without the conditionals:

                                 

                                input[idx] = input[idx - 1] + input[idx];

                                   barrier(CLK_LOCAL_MEM_FENCE);

                                input[idx] = input[idx - 2] + input[idx];

                                   barrier(CLK_LOCAL_MEM_FENCE);

                                input[idx] = input[idx - 4] + input[idx];

                                   barrier(CLK_LOCAL_MEM_FENCE);

                                input[idx] = input[idx - 8] + input[idx];

                                   barrier(CLK_LOCAL_MEM_FENCE);

                                input[idx] = input[idx - 16] + input[idx];

                                 

                                you know you're reading 0s then, rather than off the beginning of the array. Much more efficient!

                                All of that depends on knowing that you're running on a vector architecture, of course, and that the mapping of work items to the vectors is predictable.