11 Replies Latest reply on Aug 10, 2011 10:47 AM by maximmoroz

    Is double buffering working for AMD GPUs?

    maximmoroz

      I cannot make double buffering working for my Cayman.

      AMD APP SDK 2.5, Catalyst 11.7, Win7 Ultimate 64bit SP1.

      I made a simple sample project illustrating the problem. The code for the single cpp file is attached.

      Here is the screenshot from the profiler:

      Full size image.

      Here is the output the sample application produce:

      MAP: In host queue: 30409ns, In device queue: 769531ns, Execution time: 0ns, UNMAP: In host queue: 7137ns, In device queue: 559805ns, Execution time: 1279000ns
      MAP: In host queue: 22031ns, In device queue: 76642ns, Execution time: 0ns, UNMAP: In host queue: 4344ns, In device queue: 10036475ns, Execution time: 685444ns
      MAP: In host queue: 20790ns, In device queue: 66092ns, Execution time: 0ns, UNMAP: In host queue: 30409ns, In device queue: 7196744ns, Execution time: 703666ns
      MAP: In host queue: 21720ns, In device queue: 59577ns, Execution time: 0ns, UNMAP: In host queue: 39407ns, In device queue: 4989752ns, Execution time: 450333ns
      MAP: In host queue: 20479ns, In device queue: 63300ns, Execution time: 0ns, UNMAP: In host queue: 29168ns, In device queue: 4360131ns, Execution time: 437333ns
      MAP: In host queue: 21410ns, In device queue: 56474ns, Execution time: 0ns, UNMAP: In host queue: 38787ns, In device queue: 4340302ns, Execution time: 442889ns
      MAP: In host queue: 20480ns, In device queue: 63300ns, Execution time: 0ns, UNMAP: In host queue: 28858ns, In device queue: 4342435ns, Execution time: 424000ns
      MAP: In host queue: 20790ns, In device queue: 56473ns, Execution time: 0ns, UNMAP: In host queue: 26065ns, In device queue: 4305923ns, Execution time: 427000ns
      MAP: In host queue: 21410ns, In device queue: 55853ns, Execution time: 0ns, UNMAP: In host queue: 38167ns, In device queue: 4299522ns, Execution time: 425333ns
      MAP: In host queue: 33511ns, In device queue: 27617ns, Execution time: 0ns, UNMAP: In host queue: 28547ns, In device queue: 4351932ns, Execution time: 403333ns

      Notice very large time the unmap operation is residing in device queue. It just is waiting there for the other command queue to finish executing kernels. Look at the screenshot from the profiler, there is a hint there, it shows the same problem (compare Submit time and Start time).

      What am I doing wrong? How am I supposed to implement double-buffering? This is VERY important functionality, it just should work.

      #include "stdafx.h" #define __CL_ENABLE_EXCEPTIONS #include <iostream> #include <CL/cl.hpp> #include <string> #include <vector> using namespace std; cl::Context GetContext( const string& platform_vendor, cl_device_type native_device_type = CL_DEVICE_TYPE_DEFAULT) { vector<cl::Platform> platforms; cl::Platform::get(&platforms); for (vector<cl::Platform>::iterator p = platforms.begin(); p != platforms.end(); p++) { if ((*p).getInfo<CL_PLATFORM_VENDOR>() == platform_vendor) { cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM, (cl_context_properties)(*p)(), 0 }; return cl::Context(native_device_type, cps); } } throw string("Platform not found"); } cl::Device GetDevice(cl::Context context) { return context.getInfo<CL_CONTEXT_DEVICES>()[0]; } cl::CommandQueue GetCommandQueue( cl::Context context, cl::Device device) { return cl::CommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE); } string workload_kernel_source_code = "__kernel void workload(const __global float * input, __global float * output) \n\ { \n\ size_t index = get_global_id(0); \n\ float input_value = input[index]; \n\ for(int i = 0; i < 100; i++) \n\ { \n\ input_value = native_divide(1.0F, input_value + 1.0F); \n\ } \n\ output[index] = input_value; \n\ } \n\ "; struct profiling_info { unsigned long queue_time; unsigned long submit_time; unsigned long start_time; unsigned long end_time; }; struct map_unmap_profiling_info { profiling_info map_profiling_info; profiling_info unmap_profiling_info; }; int _tmain(int argc, _TCHAR* argv[]) { try { string platform_vendor = "Advanced Micro Devices, Inc."; int element_count_to_process = 10000 * 64; // multiple to wavefront size int iteration_count = 10; vector<map_unmap_profiling_info> profiling_info_list; cl::Context context = GetContext(platform_vendor); cl::Device device = GetDevice(context); cl::CommandQueue transfer_input_data_command_queue = GetCommandQueue(context, device); cl::CommandQueue execute_workload_command_queue = GetCommandQueue(context, device); // 2 input buffers cl::Buffer input_buffer[2] = { cl::Buffer( context, CL_MEM_READ_ONLY, element_count_to_process * sizeof(cl_float)), cl::Buffer( context, CL_MEM_READ_ONLY, element_count_to_process * sizeof(cl_float)) }; // 1 output buffer cl::Buffer output_buffer( context, CL_MEM_READ_WRITE, element_count_to_process * sizeof(cl_float)); cl::Program::Sources sources(1, make_pair(workload_kernel_source_code.c_str(), workload_kernel_source_code.length())); cl::Program program(context, sources); program.build(vector<cl::Device>(1, device)); cl::Kernel workload_kernel(program, "workload"); workload_kernel.setArg(1, output_buffer); int current_slot = 0; cl::Event map_input_buffer_event; cl::Event unmap_input_buffer_event; // Initial input buffer fill { cl_float * input_buffer_mapped = (cl_float *)transfer_input_data_command_queue.enqueueMapBuffer( input_buffer[current_slot], CL_TRUE, CL_MAP_WRITE, 0, element_count_to_process * sizeof(cl_float), 0, &map_input_buffer_event); for(int i = 0; i < element_count_to_process; i++) input_buffer_mapped[i] = 2.0F; transfer_input_data_command_queue.enqueueUnmapMemObject( input_buffer[current_slot], input_buffer_mapped, 0, &unmap_input_buffer_event); transfer_input_data_command_queue.flush(); } for(int k = 0; k < iteration_count; k++) { // Make sure the input data are copied to the device transfer_input_data_command_queue.finish(); map_unmap_profiling_info new_profiling_info; new_profiling_info.map_profiling_info.queue_time = map_input_buffer_event.getProfilingInfo<CL_PROFILING_COMMAND_QUEUED>(); new_profiling_info.map_profiling_info.submit_time = map_input_buffer_event.getProfilingInfo<CL_PROFILING_COMMAND_SUBMIT>(); new_profiling_info.map_profiling_info.start_time = map_input_buffer_event.getProfilingInfo<CL_PROFILING_COMMAND_START>(); new_profiling_info.map_profiling_info.end_time = map_input_buffer_event.getProfilingInfo<CL_PROFILING_COMMAND_END>(); new_profiling_info.unmap_profiling_info.queue_time = unmap_input_buffer_event.getProfilingInfo<CL_PROFILING_COMMAND_QUEUED>(); new_profiling_info.unmap_profiling_info.submit_time = unmap_input_buffer_event.getProfilingInfo<CL_PROFILING_COMMAND_SUBMIT>(); new_profiling_info.unmap_profiling_info.start_time = unmap_input_buffer_event.getProfilingInfo<CL_PROFILING_COMMAND_START>(); new_profiling_info.unmap_profiling_info.end_time = unmap_input_buffer_event.getProfilingInfo<CL_PROFILING_COMMAND_END>(); profiling_info_list.push_back(new_profiling_info); // Enqueue actual workload execution workload_kernel.setArg(0, input_buffer[current_slot]); for(int j = 0; j < 10; j++) { execute_workload_command_queue.enqueueNDRangeKernel( workload_kernel, cl::NDRange(0), cl::NDRange(element_count_to_process), cl::NullRange); execute_workload_command_queue.flush(); } execute_workload_command_queue.flush(); // Prefill ANOTHER input buffer for the processing in the next cycle if (k < iteration_count - 1) { cl_float * input_buffer_mapped = (cl_float *)transfer_input_data_command_queue.enqueueMapBuffer( input_buffer[1 - current_slot], CL_TRUE, CL_MAP_WRITE, 0, element_count_to_process * sizeof(cl_float), 0, &map_input_buffer_event); for(int i = 0; i < element_count_to_process; i++) input_buffer_mapped[i] = 2.0F; transfer_input_data_command_queue.enqueueUnmapMemObject( input_buffer[1 - current_slot], input_buffer_mapped, 0, &unmap_input_buffer_event); transfer_input_data_command_queue.flush(); } // Make sure the chunk of data are processed execute_workload_command_queue.finish(); current_slot = 1 - current_slot; } for (vector<map_unmap_profiling_info>::iterator p = profiling_info_list.begin(); p != profiling_info_list.end(); p++) { cout << "MAP: In host queue: " << (p->map_profiling_info.submit_time - p->map_profiling_info.queue_time) << "ns, " << "In device queue: " << (p->map_profiling_info.start_time - p->map_profiling_info.submit_time) << "ns, " << "Execution time: " << (p->map_profiling_info.end_time - p->map_profiling_info.start_time) << "ns, " << "UNMAP: In host queue: " << (p->unmap_profiling_info.submit_time - p->unmap_profiling_info.queue_time) << "ns, " << "In device queue: " << (p->unmap_profiling_info.start_time - p->unmap_profiling_info.submit_time) << "ns, " << "Execution time: " << (p->unmap_profiling_info.end_time - p->unmap_profiling_info.start_time) << "ns" << endl; } } catch (cl::Error e) { cout << "Exception at " << e.what() << ", error code: " << e.err() << endl; } catch (string e) { cout << "Exception: " << e << endl; } return 0; }

        • Is double buffering working for AMD GPUs?
          maximmoroz

          And here is the zipped folder with profiler results.

          • Is double buffering working for AMD GPUs?
            maximmoroz

            I have also checked TransferOverlapped project from AMD APP SDK samples.

            Well, it is obviously not able to to overlap host<->device memory copy operations with kernel execution as it is using single in-order command queue.

            And the profiler shows that it is not able indeed:

            Full size image.

            Yes, it is possible to overlap OpenCL workloads (kernel runs and memory transfers) with host CPU workloads but it is not a big deal.

            So the question remains: What is the way one can overlap host-device memory copyoperations with kernels execution on AMD GPUs?

              • Is double buffering working for AMD GPUs?
                maximmoroz

                Okay, I managed to get it working.

                I used the advice I like to give a lot here: I read the programming guide again. It directly says that one should use CL_MEM_USE_PERSISTENT_MEM_AMD flag when creating buffers.

                I used it and it works. And now I understand that the profiler screenshot of TransferOverlap sample program I attached earlier actually shows overlapping data ransfers and GPU compute. Yep, just with single queue.

                Now I have 2 questions:

                1) How can I programatically determine that I am able to create buffers with this options specified? For the time being I compare platform vendor name with "Advanced Micro Devices, Inc.".

                2) Is it a long term approach - to implement overlapping memory transfers and kernel execution by providing user with zero copy buffers (zero map/unmap time)? Can we expect different approach, that is generic write/read/map/unmap simaltaneously with kernel execution through several command queues or out-of-order queue?

                  • Is double buffering working for AMD GPUs?
                    genaganna

                     

                    Originally posted by: maximmoroz Okay, I managed to get it working.

                     

                    I used the advice I like to give a lot here: I read the programming guide again. It directly says that one should use CL_MEM_USE_PERSISTENT_MEM_AMD flag when creating buffers.

                    It is not recommended to use CL_MEM_USE_PERSISTENT_MEM_AMD flag for write_only or Read_Write buffer.

                     

                    Now I have 2 questions:

                     

                    1) How can I programatically determine that I am able to create buffers with this options specified? For the time being I compare platform vendor name with "Advanced Micro Devices, Inc.".

                    What you are following is the best one.

                     

                    2) Is it a long term approach - to implement overlapping memory transfers and kernel execution by providing user with zero copy buffers (zero map/unmap time)? Can we expect different approach, that is generic write/read/map/unmap simaltaneously with kernel execution through several command queues or out-of-order queue?

                     

                    Zero copy buffers are better when data transfers are involved. They is not direct relationship b/w zero copy buffers and overlapping.

                    Overlapping will be good if we use zero copy buffers instead of generic buffers.

                      • Is double buffering working for AMD GPUs?
                        maximmoroz

                        > It is not recommended to use CL_MEM_USE_PERSISTENT_MEM_AMD flag for write_only or Read_Write buffer.

                        Why? My understanding is that it is CL_MAP_WRITE flag which really matters when mapping/unmapping the buffer. Why should driver cares about whether kernels are able to write to the buffer or no if I am specifying that I don't care for the current content of the buffer?

                        > Zero copy buffers are better when data transfers are involved. They is not direct relationship b/w zero copy buffers and overlapping.

                        Okay, but it is overlapping which is important to me, not the zero-copy buffers. Zero-zopy buffers "accidently" enable overlapping between data transfers and kernel execution by enabling host code writing (or reading) directly to global device memory. I am interested in more generic overlapping when kernels are being executed with data transferred at the same time. And if this data transfer is "zero copy" than it is great, but it is better to work with "non-zero copy" transfers.

                        If we are limited to in-order-queue then we might use 2 (or more) command queues, one for kernel execution, another for memeory transfers. What could be easier for the end-user? But it doesn't work, *sigh*.

                        I don't want to be rude but the current solution with these flags (*ALLOC_HOST*, *PERSISTENT*) is ugly. You know what is probably more ugly? NVidia's implementation of overlapping memory transfers and kernel execution. The problem is that it is completely different from AMD's one.

                        P.S. And one more note: The sample project from AMD APP SDK is called TransferOverlap, not ZeroCopyBuffers. I guess it implies that Overlapping Transfer is recognized as important functionality in AMD.

                          • Is double buffering working for AMD GPUs?
                            nou

                            CL_MEM_USE_PERSISTENT_MEM_AMD is defined in header file under cl_amd_device_memory_flags. so maybe in future they will add this extension to list.

                            • Is double buffering working for AMD GPUs?
                              genaganna

                               

                              Originally posted by: maximmoroz > It is not recommended to use CL_MEM_USE_PERSISTENT_MEM_AMD flag for write_only or Read_Write buffer.

                               

                              Why? My understanding is that it is CL_MAP_WRITE flag which really matters when mapping/unmapping the buffer. Why should driver cares about whether kernels are able to write to the buffer or no if I am specifying that I don't care for the current content of the buffer?

                               

                              > Zero copy buffers are better when data transfers are involved. They is not direct relationship b/w zero copy buffers and overlapping.

                               

                              Okay, but it is overlapping which is important to me, not the zero-copy buffers. Zero-zopy buffers "accidently" enable overlapping between data transfers and kernel execution by enabling host code writing (or reading) directly to global device memory. I am interested in more generic overlapping when kernels are being executed with data transferred at the same time. And if this data transfer is "zero copy" than it is great, but it is better to work with "non-zero copy" transfers.

                               

                              If we are limited to in-order-queue then we might use 2 (or more) command queues, one for kernel execution, another for memeory transfers. What could be easier for the end-user? But it doesn't work, *sigh*.

                               

                              I don't want to be rude but the current solution with these flags (*ALLOC_HOST*, *PERSISTENT*) is ugly. You know what is probably more ugly? NVidia's implementation of overlapping memory transfers and kernel execution. The problem is that it is completely different from AMD's one.

                              I am not sure how you got this conclusion. I see overlapping with non-zero copy buffers also. You can also experiment with MonteCarloAsian(You need to make Read_only for all input buffers and make write_only for all output buffers)

                               

                              P.S. And one more note: The sample project from AMD APP SDK is called TransferOverlap, not ZeroCopyBuffers. I guess it implies that Overlapping Transfer is recognized as important functionality in AMD.

                               

                              TransferOverlap has a option to change buffer type. Please help message of TransferOverlap and try with different buffer types.

                                • Is double buffering working for AMD GPUs?
                                  maximmoroz

                                  > TransferOverlap has a option to change buffer type. Please help message of TransferOverlap and try with different buffer types.

                                  Here is profiler output with input buffers created with CL_MEM_READ_ONLY flag only (CL_MEM_USE_PERSISTENT_MEM_AMD is NOT specified):

                                  Full size image.

                                  As you can see memory transfers and kernel executions are NOT overlapped.

                                  AMD APP SDK 2.5, Catalyst 11.7, Windows 7 64bit SP1, AMD Radeon HD 6950, Asus P8P67 system board.