cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

boxerab
Challenger

async_work_group_copy()

Does anyone here have experience with this call on AMD hardware?

Any feedback on how to use this?

I found this blog post Using async_work_group_copy() on 2D data - StreamComputing

which is a good start.

Also, in what situations should I consider using the prefetch() call ?

Thanks,

Aaron

0 Likes
5 Replies
dipak
Big Boss

Hi,

Basically, async copy and prefetch functions (Async Copy and Prefetch Functions) are used to improve the kernel performance by interleaving memory operations with computation work.

Here I can see the usefulness of these functions:

Async Copy functions:

  • Independent computational work can be performed when memory transfer is going on. Don't need to block the execution by barrier calls.
  • Don't require to specify how the individual elements to be copied by the each work item. Whereas, just need to specify how much data needs to be copied (for that workgroup) and the rest will be taken care by the implementation. So, implementation may choose better ways to achieve the same.

Prefetch Function:

  • can request the memory unit to prefetch a chunk of data to the global cache which will be needed in near future and continue performing the computational work on available data. So, when the prefetch data will be required, it will be available in cache and thus, improve the access time.

Regards,

0 Likes

Great. thanks.

0 Likes
Dithermaster
Adept I

Has anyone gotten superior performance using the work_group functions as compared to well-coded manual methods? I've anecdotally only heard of them being similar (which makes them merely convenient) or even slower (which would be really terrible).

0 Likes

Hi Dithermaster,

Glad to hear someone else is also interested in this method.

I think it would be easy to do some experiments to show whether it is worth it or not.

I will try this out on my HD7700 and share the results.

Cheers,

Aaron

0 Likes

So, I can confirm that using async_workgroup_copy from global to local memory is slightly slower than doing a regular coalesced copy, for uncached data access pattern.

This is for GCN 1.0 on Cape Verde arch.  My impression is that the method is implemented synchronously, so there is no benefit to making the call, doing some work,

and then waiting for the event.

For my test, I simply took the AMD GlobalMemoryBandwidth sample project, and changed one of the kernels (see below)

#define ASYNC_COPY

__kernel void read_linear_uncached(__global DATATYPE *input,__global DATATYPE *output)

{

    event_t evt;

    IDXTYPE gid = get_global_id(0);

    IDXTYPE index = gid;

    local DATATYPE scratch[256*2];

    scratch[get_local_id(0)] = (DATATYPE)(0.0f);

    for (int i=0; i < 32; ++i) {

        uint flipBuffer = i&2;

#ifndef ASYNC_COPY

        scratch[get_local_id(0) + flipBuffer*256] = input[index];

#else

       

         evt = async_work_group_copy(scratch + flipBuffer*256,

                                         input + index,

                                         256,

                                         evt);

#endif

        index += OFFSET;

        for (int k=0; k < 100; ++k)

            scratch[get_local_id(0)+(flipBuffer^1)*256] += pow(scratch[get_local_id(0)+(flipBuffer^1)*256],2);

#ifdef ASYNC_COPY

        //wait_group_events(1, &evt); // waits until the copy has finished.

#endif

    }

    output[gid] = scratch[get_local_id(0)];

}

0 Likes