6 Replies Latest reply on Jun 10, 2011 3:11 PM by LeeHowes

    Fastest way to copy from global memory to local memory

    spectral

      Hi,

      I need to copy 32 uint from global memory to local memory.

      Maybe I can try vload16, or something else ?

       

      int16 i1_16 = vload16(offset + idx, in); int16 i2_16 = vload16(offset + idx + 64, in); ((int16*)localBuf)[0] = i1_16; ((int16*)localBuf)[1] = i1_16;

        • Fastest way to copy from global memory to local memory
          spectral

          In fact I need to load 32xint and process them, so I do this :

           

          size_t idx = get_local_id(0); if (idx & 15 == 0) ((int16*)localBuf)[0] = vload16(0, in + idx); Notice that "in" is a "int*" This way I try to avoid bank conflict and to serialize too much memory access. But, strangly my algorithm is not improved !!!! And even, I got wrong results !!!

          • Fastest way to copy from global memory to local memory
            maximmoroz

            If copying these 32 integers are the only data you need to get from gobal buffer to local one then the following code might be better:

             

             

            size_t idx = get_local_id(0); if (idx < 32) localBuf[idx] = in[idx];

              • Fastest way to copy from global memory to local memory
                spectral

                Each work-item will retreive one value then you will have bank conflict between :

                banks 0 & 4 & 8 & 12 & 16 & 22 ..

                banks 1 & 5 & 9 & 13 & 17 & 23 ..

                ...

                Right ?

                  • Fastest way to copy from global memory to local memory
                    maximmoroz

                    1. What bank conflicts? When reading from global memory or when writing to local one?

                    2. Try it.

                    3. Read the "AMD Accelerated Parallel Processing, OpenCL Programming Guide" document, as I already suggested you in one of your posts. Paragraph 4.6.2 Channel Conflicts. It says:

                    If every work-item in a work-group references consecutive memory addresses, the entire wavefront accesses one channel. Although this seems slow, it actually is a fast pattern because it is necessary to consider the memory access over the entire device, not just a single wavefront.

                      • Fastest way to copy from global memory to local memory
                        spectral

                        I don't know how it work with ATI... but it is how it works with NVidia cards :-(

                        So, I search for a fast and portable way to transfer the whole set of data in a minimum set of instructions !

                          • Fastest way to copy from global memory to local memory
                            LeeHowes

                            Well the right way to copy n data items using m work items is something like:

                            float4 *data = (float4*)(data);

                            for( address = data+ get_local_id(0)*4; address < base + n; address += get_group_size(0)*4 ) {

                              float4 a = *address;

                              local[somewhere] = a;

                            }

                             

                            You'll get conflicts that way but that hardly matters because you're global bandwidth limited. You could get rid of the *4s and cut it down to float reads and do it without conflicts if you like, but you'll make much worse use of the memory system so I doubt that would be faster. If you do a write to base + get_local_id(0) % wave_size then you won't get any bank conflicts. It's a vector read and a vector write, nice and efficient and the perfect way to use a banked memory system - that's what it's designed for on a vector architecture.

                            Once it's in local memory do what you like with it.

                             

                            This kind of thing is really the same between nvidia and AMD architectures. They're both vector machiens, they work in largely the same way from a high level. It's just AMD has a wider vector and issues multiple vector instructions at the same time VLIW fashion.