Archives Discussions

mcdelorme · ‎02-01-2012

Hello,

I have created a microbenchmark that performs very differently depending on whether I am using the runtime provided with version 2.6 of the SDK (APP Runtime 2.6.x) or the runtime provided with "tested" drivers for version 2.6 of the SDK (Catalyst 11.12 provides APP Runtime 10.0.x).

Please consider the following host code using the C++ OpenCL bindings where we are using the GPU provided on a Fusion A6-3650 APU under Windows 7 x64:

cl_uint nElements = 1024 * 1024 * 16;

cl_uint nCachedRuns = 5;

cl_uint nUncachedRuns = 5;

... // Set up platform, GPU device, kernel, etc...

cl::Buffer buff (context, buff_flags, sizeof (cl_uint) * nElements, NULL, &err);

cl::NDRange gsize = nElements; // Global kernel threads

cl::NDRange lsize = 256; // Local kernel threads

... // Set kernel arguments, run kernel once to warm up

// Run 5 iterations of the kernel after the CPU writes to the buffer

for (cl_uint iteration = 0; iteration < nCachedRuns; ++iteration) {

cl_uint *mapped_buff = (cl_uint *)queue.enqueueMapBuffer (buff, CL_TRUE,

CL_MEM_WRITE_ONLY, 0, sizeof (cl_uint) * nElements, NULL, NULL, NULL);

for (cl_uint i = 0; i < nElements; ++i) { // Fill the buffer from the host

mappedBuff = 0;

}

queue.enqueueUnmapMemObject (buff, mappedBuff, NULL, NULL);

queue.enqueueNDRangeKernel (kernel, cl::NullRange, gsize, lsize, NULL, NULL);

}

// Run 5 iterations of the kernel back to back

for (cl_uint iteration = 0; iteration < nUncachedRuns; ++iteration) {

queue.enqueueNDRangeKernel (kernel, cl::NullRange, gsize, lsize, NULL, NULL);

}

Each work item in the corresponding kernel just writes a zero to the buffer location given by the work item's global ID.

The purpose was to test how execution performed on buffers allocated with various flags. Here is a table that illustrates the unexplained portion of the results:

Buffer Allocation Flags	Runtime Version	Map/Unmap data transfer mode	Kernel Execution Time
default	2.6	Copy/Copy	Same across both loops (fast access)
default	10.0	Zero Copy/Zero Copy	Kernels in first loop are 4x slower than kernels in second loop

It appears that version 10.0.x of the runtime changes the rules about zero copy buffers on Fusion APUs. It also appears that the most recent version of the AMD APP OpenCL Programming Guide (rev1.3f at the time of this writing) - which was released around the same time as version 2.6 of the SDK and Catalyst 11.12 - has not yet been updated to reflect these changes. I would like to know if anyone can shed some light on what is happening here in version 10.0.x of the runtime. Specifically:

What type of memory is this buffer being allocated in (device local memory, cacheable system memory, or uncacheable system memory)? Does this paradigm still apply here?
It appears as though zero copy is taking place on maps/unmaps. How do each of these devices access the memory since no copying is taking place (write combining buffers, through the UNB, via the L2 cache, etc...)?
Why is there a difference between the GPU writes to the buffer in the first loop and the second loop? Is it possible that some of the buffer data is cached in the first loop and because of that the GPU uses the cache coherency protocol to perform writes? Is the new runtime "smart" enough to recognize that in the second loop none of the data resides in the CPU's cache and bypass it?
It appears that the GPU writes to the buffer differently depending on whether the CPU or GPU last wrote to it. Is this true for CPU writes as well? That is, would the CPU write differently depending on whether the last operation to that buffer was a host map/unmap or a GPU kernel write?
Are there other undocumented behavioral changes to version 10.0.x of the runtime that are related to zero copy and/or how memory is created/pinned/accessed on Fusion APUs?

Thanks,

Mike

jeff_golds · ‎02-02-2012

You don't state how your buffers are created. Without the creation flags it's impossible to answer your questions.

mcdelorme · ‎02-02-2012

Hi Jeff,

Thanks for your response. I'm sorry if I wasn't clear in my posting - the first column of the table I presented is called "Buffer Allocation Flags". This was meant to explain how the buffers were created for each of the questionable results. As per section 5.2.1 of the OpenCL specification (Creating Buffer Objects), the default flags should translate to CL_MEM_READ_WRITE (and I'm assuming AMD is adhering to this in their implementation of the SDK).

Any light you could shed on this would be greatly appreciated.

Thanks in advance for your help!

--

Mike

Archives Discussions

AMD APP SDK Runtime 2.6.x vs 10.0.x & Fusion APU Zero Copy Behaviour