Does anyone here have experience with this call on AMD hardware?
Any feedback on how to use this?
I found this blog post Using async_work_group_copy() on 2D data - StreamComputing
which is a good start.
Also, in what situations should I consider using the prefetch() call ?
Thanks,
Aaron
Hi,
Basically, async copy and prefetch functions (Async Copy and Prefetch Functions) are used to improve the kernel performance by interleaving memory operations with computation work.
Here I can see the usefulness of these functions:
Async Copy functions:
Prefetch Function:
Regards,
Great. thanks.
Has anyone gotten superior performance using the work_group functions as compared to well-coded manual methods? I've anecdotally only heard of them being similar (which makes them merely convenient) or even slower (which would be really terrible).
Hi Dithermaster,
Glad to hear someone else is also interested in this method.
I think it would be easy to do some experiments to show whether it is worth it or not.
I will try this out on my HD7700 and share the results.
Cheers,
Aaron
So, I can confirm that using async_workgroup_copy from global to local memory is slightly slower than doing a regular coalesced copy, for uncached data access pattern.
This is for GCN 1.0 on Cape Verde arch. My impression is that the method is implemented synchronously, so there is no benefit to making the call, doing some work,
and then waiting for the event.
For my test, I simply took the AMD GlobalMemoryBandwidth sample project, and changed one of the kernels (see below)
#define ASYNC_COPY
__kernel void read_linear_uncached(__global DATATYPE *input,__global DATATYPE *output)
{
event_t evt;
IDXTYPE gid = get_global_id(0);
IDXTYPE index = gid;
local DATATYPE scratch[256*2];
scratch[get_local_id(0)] = (DATATYPE)(0.0f);
for (int i=0; i < 32; ++i) {
uint flipBuffer = i&2;
#ifndef ASYNC_COPY
scratch[get_local_id(0) + flipBuffer*256] = input[index];
#else
evt = async_work_group_copy(scratch + flipBuffer*256,
input + index,
256,
evt);
#endif
index += OFFSET;
for (int k=0; k < 100; ++k)
scratch[get_local_id(0)+(flipBuffer^1)*256] += pow(scratch[get_local_id(0)+(flipBuffer^1)*256],2);
#ifdef ASYNC_COPY
//wait_group_events(1, &evt); // waits until the copy has finished.
#endif
}
output[gid] = scratch[get_local_id(0)];
}