Hello,
I have created a microbenchmark that performs very differently depending on whether I am using the runtime provided with version 2.6 of the SDK (APP Runtime 2.6.x) or the runtime provided with "tested" drivers for version 2.6 of the SDK (Catalyst 11.12 provides APP Runtime 10.0.x).
Please consider the following host code using the C++ OpenCL bindings where we are using the GPU provided on a Fusion A6-3650 APU under Windows 7 x64:
cl_uint nElements = 1024 * 1024 * 16;
cl_uint nCachedRuns = 5;
cl_uint nUncachedRuns = 5;
... // Set up platform, GPU device, kernel, etc...
cl::Buffer buff (context, buff_flags, sizeof (cl_uint) * nElements, NULL, &err);
cl::NDRange gsize = nElements; // Global kernel threads
cl::NDRange lsize = 256; // Local kernel threads
... // Set kernel arguments, run kernel once to warm up
// Run 5 iterations of the kernel after the CPU writes to the buffer
for (cl_uint iteration = 0; iteration < nCachedRuns; ++iteration) {
cl_uint *mapped_buff = (cl_uint *)queue.enqueueMapBuffer (buff, CL_TRUE,
CL_MEM_WRITE_ONLY, 0, sizeof (cl_uint) * nElements, NULL, NULL, NULL);
for (cl_uint i = 0; i < nElements; ++i) { // Fill the buffer from the host
mappedBuff = 0;
}
queue.enqueueUnmapMemObject (buff, mappedBuff, NULL, NULL);
queue.enqueueNDRangeKernel (kernel, cl::NullRange, gsize, lsize, NULL, NULL);
}
// Run 5 iterations of the kernel back to back
for (cl_uint iteration = 0; iteration < nUncachedRuns; ++iteration) {
queue.enqueueNDRangeKernel (kernel, cl::NullRange, gsize, lsize, NULL, NULL);
}
Each work item in the corresponding kernel just writes a zero to the buffer location given by the work item's global ID.
The purpose was to test how execution performed on buffers allocated with various flags. Here is a table that illustrates the unexplained portion of the results:
Buffer Allocation Flags | Runtime Version | Map/Unmap data transfer mode | Kernel Execution Time |
---|
default | 2.6 | Copy/Copy | Same across both loops (fast access) |
default | 10.0 | Zero Copy/Zero Copy | Kernels in first loop are 4x slower than kernels in second loop |
It appears that version 10.0.x of the runtime changes the rules about zero copy buffers on Fusion APUs. It also appears that the most recent version of the AMD APP OpenCL Programming Guide (rev1.3f at the time of this writing) - which was released around the same time as version 2.6 of the SDK and Catalyst 11.12 - has not yet been updated to reflect these changes. I would like to know if anyone can shed some light on what is happening here in version 10.0.x of the runtime. Specifically:
- What type of memory is this buffer being allocated in (device local memory, cacheable system memory, or uncacheable system memory)? Does this paradigm still apply here?
- It appears as though zero copy is taking place on maps/unmaps. How do each of these devices access the memory since no copying is taking place (write combining buffers, through the UNB, via the L2 cache, etc...)?
- Why is there a difference between the GPU writes to the buffer in the first loop and the second loop? Is it possible that some of the buffer data is cached in the first loop and because of that the GPU uses the cache coherency protocol to perform writes? Is the new runtime "smart" enough to recognize that in the second loop none of the data resides in the CPU's cache and bypass it?
- It appears that the GPU writes to the buffer differently depending on whether the CPU or GPU last wrote to it. Is this true for CPU writes as well? That is, would the CPU write differently depending on whether the last operation to that buffer was a host map/unmap or a GPU kernel write?
- Are there other undocumented behavioral changes to version 10.0.x of the runtime that are related to zero copy and/or how memory is created/pinned/accessed on Fusion APUs?
Thanks,
Mike