4 Replies Latest reply on Aug 17, 2010 5:15 AM by philips

    Device fission performance

      why so slow?


      I tried using device fission to speed up my program, but instead it takes a big performance hit.


      My raycasting algorithm runs at 19 FPS using a machine with two 4-core CPUs as a single OpenCL device (8 real cores + hyper threading). Every core is at 90 to 95% load.

      Since the cores basically work on random work-items (in this case rays) the caches are not used efficiently.

      The goal was to split the CPUs up in single cores and have each core work on a column of rays. But for testing purposes I started with two sub-devices.



      I split the CPUs in two sub-devices (CL_DEVICE_PARTITION_EQUALLY_EXT, 8, ...)

      One sub-device works on the first half of all work-groups and the second device on the rest. (each renders half the image)

      To do this, my kernel has an int parameter for the offset. So every frame I use kernel.setArg for all parameters and launch it on the first sub-device. Then I use kernel.setArg to change only the offset parameter and launch it on the second sub-device. 


      Doing this I only get 7 FPS and the cores only have about 25% load.


      If I split the device into 8 sub-devices, I only get 2 FPS and 7% load.



      Now I was wondering, why that is...


      Any ideas?











        • Device fission performance

          It looks like you are running sub-devices one after the other.

          Please post your code here or send to streamdeveloper@amd.com

            • Device fission performance

              I tried to cut it down as far as possible.

              (1) I first create the regular device and command queue.

              (2) then the sub-devices. Stored in the vector s_fissionDevices. For every sub-device I create a queue.

              (3) I create the cl:rogram from the same context. I add the sub-devices to the s_devices vector. So in this vector there should be the parent device as well as all sub-devices. Then I build the program and create the kernel from that program.

              (4) I set all the parameters for the kernel.

              (5) For every sub-device I set the offset parameter for my work-items. Then I enqueue the launch to the queue of that sub-device. When all launches are lined up, finish() is called for every queue.


              and that's basically it.





              printf("OpenCL - looking for CPU device\n"); // ( 1 ) REGULAR DEVICE err = platforms [i].getDevices (CL_DEVICE_TYPE_CPU, &s_devices); checkError("OpenCL - cl::Platform::getDevices()", err); s_context = cl::Context(s_devices, NULL, NULL, NULL, &err); checkError("OpenCL - cl::Context()", err); s_device = s_devices[0]; printf("OpenCL - found CPU Device: %s\n", s_device.getInfo<CL_DEVICE_NAME>().c_str()); s_commandQueue = cl::CommandQueue (s_context, s_device, CL_QUEUE_PROFILING_ENABLE, &err); checkError("OpenCL - cl::CommandQueue()", err); // ( 2 ) DEVICE FISSION DEVICES cl_device_partition_property_ext extProps[] = { CL_DEVICE_PARTITION_EQUALLY_EXT, 8, CL_PROPERTIES_LIST_END_EXT, 0 }; err = s_device.createSubDevices( extProps, &s_fissionDevices ); checkError("OpenCL - clCreateSubDevicesEXT()", err); printf("OpenCL - partitioned CPU Device\n"); for(int i=0; i<s_fissionDevices.size(); i++) { s_fissionQueues.push_back(cl::CommandQueue(s_context, s_fissionDevices[i], CL_QUEUE_PROFILING_ENABLE, &err)); checkError("OpenCL - cl::CommandQueue() - fissionQueue", err); s_fissionEvents.push_back(cl::Event()); } // ... // ( 3 ) PROGRAM AND KERNELS m_program = cl::Program (s_context, sources, &err); checkError("OpenCL - cl::Program() GPU", err); for(int i = 0; i < s_fissionQueues.size(); i++) { s_devices.push_back(s_fissionDevices[i]); } err = m_program.build(s_devices, m_buildOptions.getPtr()); printf("OpenCL - finished compiling\n"); if( err != CL_SUCCESS) { std::string buildLog = m_program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(s_device); printf(buildLog.c_str()); fail("OpenCL - failed to compile program"); } std::vector<cl::Kernel> vecKernels; cl::Kernel k = cl::Kernel(m_program, "kernelName", &err); checkError("OpenCL - cl::Kernel() GPU", err); vecKernels.push_back(k); err = m_program.createKernels(&vecKernels); checkError("m_gpuProgram.createKernels()", err); // ... // ( 4 ) KERNEL PREPARATION cl::Kernel& kernel = module->getKernel("kernelName"); // ... kernel.setArg(....) // ( 5 ) KERNEL LAUNCH // ... cl>>NDRange local und global for(int i = 0; i < s_fissionQueues.size(); i++) { kernel.setArg(10, i); // OFFSET err = s_fissionQueues[i].enqueueNDRangeKernel( kernel, cl::NullRange, global, local, NULL, &s_fissionEvents[i]); checkError("ClModule::launchKernel() - enqueueNDRangeKernel", err); } for(int i = 0; i < s_fissionQueues.size(); i++) { err = s_fissionQueues[i].finish(); checkError("cl::CommandQueue::finish()", err); }

                • Device fission performance

                  for(int i = 0; i < s_fissionQueues.size(); i++)
                     err = s_fissionQueues.finish();

                      checkError("cl::CommandQueue::finish()", err);

                  By calling finish() in a loop, you are serializing the execution on the sub-devices.

                  You should use Event::waitForEvents(s_fissionEvents) instead, it will first issue the commands identified by the events in s_fissionEvents to the sub-devices then wait for all the commands to complete.

                  Laurent Morichetti
                  Advanced Micro Devices Inc.
                  The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.

                    • Device fission performance

                      Thank you. It is now using all cores simultaneously. However the CPU load is still very low.

                      when I use two sub-devices I get 10 FPS now instead of 7 before. With around 35% CPU load.

                      when I use eight sub-devices I get 8 FPS instead of 2 before. With around 30% CPU load.



                      For these numbers I use a "persistent thread" mode. So one work-group stays on a core and always fetches new work till it's done. Normally, this is faster than having every work-group only run one job.


                      How I tried switching that function off.

                      I get around 13-14 FPS for both two and eight sub-devices. Around 60% CPU load.





                      EDIT: the reason I only get 13-14 FPS might be the scene itself, because there is less work to do in some regions of the picture. I'll have to run some more tests on that.

                      The bigger issues is that the persistent thread mode is slower...