5 Replies Latest reply on Jul 29, 2016 6:19 PM by boxerab



      Does anyone here have experience with this call on AMD hardware?

      Any feedback on how to use this?


      I found this blog post Using async_work_group_copy() on 2D data - StreamComputing

      which is a good start.


      Also, in what situations should I consider using the prefetch() call ?




        • Re: async_work_group_copy()


          Basically, async copy and prefetch functions (Async Copy and Prefetch Functions) are used to improve the kernel performance by interleaving memory operations with computation work.


          Here I can see the usefulness of these functions:

          Async Copy functions:

          • Independent computational work can be performed when memory transfer is going on. Don't need to block the execution by barrier calls.
          • Don't require to specify how the individual elements to be copied by the each work item. Whereas, just need to specify how much data needs to be copied (for that workgroup) and the rest will be taken care by the implementation. So, implementation may choose better ways to achieve the same.


          Prefetch Function:

          • can request the memory unit to prefetch a chunk of data to the global cache which will be needed in near future and continue performing the computational work on available data. So, when the prefetch data will be required, it will be available in cache and thus, improve the access time.



          • Re: async_work_group_copy()

            Has anyone gotten superior performance using the work_group functions as compared to well-coded manual methods? I've anecdotally only heard of them being similar (which makes them merely convenient) or even slower (which would be really terrible).

              • Re: async_work_group_copy()

                Hi Dithermaster,

                Glad to hear someone else is also interested in this method.

                I think it would be easy to do some experiments to show whether it is worth it or not.

                I will try this out on my HD7700 and share the results.




                • Re: async_work_group_copy()

                  So, I can confirm that using async_workgroup_copy from global to local memory is slightly slower than doing a regular coalesced copy, for uncached data access pattern.

                  This is for GCN 1.0 on Cape Verde arch.  My impression is that the method is implemented synchronously, so there is no benefit to making the call, doing some work,

                  and then waiting for the event.


                  For my test, I simply took the AMD GlobalMemoryBandwidth sample project, and changed one of the kernels (see below)


                  #define ASYNC_COPY


                  __kernel void read_linear_uncached(__global DATATYPE *input,__global DATATYPE *output)


                      event_t evt;

                      IDXTYPE gid = get_global_id(0);

                      IDXTYPE index = gid;

                      local DATATYPE scratch[256*2];

                      scratch[get_local_id(0)] = (DATATYPE)(0.0f);

                      for (int i=0; i < 32; ++i) {

                          uint flipBuffer = i&2;

                  #ifndef ASYNC_COPY

                          scratch[get_local_id(0) + flipBuffer*256] = input[index];



                           evt = async_work_group_copy(scratch + flipBuffer*256,

                                                           input + index,




                          index += OFFSET;

                          for (int k=0; k < 100; ++k)

                              scratch[get_local_id(0)+(flipBuffer^1)*256] += pow(scratch[get_local_id(0)+(flipBuffer^1)*256],2);

                  #ifdef ASYNC_COPY

                          //wait_group_events(1, &evt); // waits until the copy has finished.



                      output[gid] = scratch[get_local_id(0)];