8 Replies Latest reply on Aug 13, 2010 12:05 PM by genaganna

    device_fission samples?

    philips
      questions after reading the documentation

      hi.

       

      Do you know of any samples where the device fission extension is used?

       

      For better cache usage in a raycaster I am trying to have every core of the cpu work on a column of the picture instead of working on random work-groups.

      There are a couple of things the documentation left me unsure about.

       

      - Can I have both 8 command queues for the cores of the CPU as well as a command queue for the parent device (the entire CPU)?

      So I could use the individual cores for the raycasting, but then use the entire CPU for post-processing

      - Can I build the program with only one build call using all subdevices and the parent device as parameter?

      - If I use only one program can I use only one instance of the kernels as well?

      And they could all work using the same kernel with different arguments? at the same time?

      So when I want to start the kernel on all cores I would first set the common kernel arguments and then for every core just set the kernel-specific arguments, enqueue the kernel and then move on to the next core (change a kernel argument, enqueue...) ?

       

      - can you somehow use the device fission extension with the OpenCL 1.1 C++ bindings and the current StreamSDK?

      If I define USE_CL_DEVICE_FISSION it won't compile because the cl_ext.h is still in revision 10424 instead of 11702

       

      Thanks for reading.

       

       

       

        • device_fission samples?
          omkaranathan

           

          Do you know of any samples where the device fission extension is used?

          Currently, no. You can expect a sample in one of the upcoming releases.

          - Can I have both 8 command queues for the cores of the CPU as well as a command queue for the parent device (the entire CPU)?

          So I could use the individual cores for the raycasting, but then use the entire CPU for post-processing

          Yes, you can. Attaching example OpenCL code using the extension.

          On machine that has a CPU device with 8 cores (real or hyper-threaded), the given code will give you 8 sub-devices.

          - Can I build the program with only one build call using all subdevices and the parent device as parameter?

          The intended behavior is that a single program can be built against the parent device and kernel objects constructed from the resulting programs can be run against any of the sub-device queues.

          - If I use only one program can I use only one instance of the kernels as well?

          And they could all work using the same kernel with different arguments? at the same time?

          You can share the same kernel object in the case that you set the arguments and then call enqueueNDRange and then set kernel args and call the next enqueueNDRange, all from within the same thread. However, if you intend to call the same kernel from different threads, then the kernel object is not thread safe and you may see races, along with possible data corruption. The threading issue can easily be worked around by simply creating a kernel object for each queue that will be dispatched from a different thread.

          - can you somehow use the device fission extension with the OpenCL 1.1 C++ bindings and the current StreamSDK?

          If I define USE_CL_DEVICE_FISSION it won't compile because the cl_ext.h is still in revision 10424 instead of 11702

          You can expect OpenCL 1.1 support in the upcoming release.

          cl:: CommandQueue parentQueue; std::vector<cl:: CommandQueue> subQueues; std::vector<cl::Platform> platforms; std::vector<cl::Device> subDevices; cl::Platform::get(&platforms); if (platforms.size() == 0) { std::cout << "Platform size 0\n"; return -1; } cl_context_properties properties[] = { CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0])(), 0}; // CPU device only supports fission, currently context = cl::Context(CL_DEVICE_TYPE_CPU, properties); std::vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES>(); // Check that device fission is supported – just use first device if (devices[0].getInfo<CL_DEVICE_EXTENSIONS>().find("cl_ext_device_fission") == std::string::npos) { std::cout << "Required that device support cl_ext_device_extension" << std::endl; return -1; } cl_device_partition_property_ext subDeviceProperties[] = { CL_DEVICE_PARTITION_EQUALLY_EXT, 1, CL_PROPERTIES_LIST_END_EXT, 0}; devices[0].createSubDevices(subDeviceProperties, &subDevices); if (subDevices.size() <= 0) { std::cout << "Failed to allocate subdevices" << std::endl; return -1; } std::cout << "Number of sub-devices " << subDevices.size() << std::endl; // create command queue for parent device parentQueue = cl::CreateQueue(context,devices[0]); // create command queues for the sub-devices for (std::vector<cl::Device>::iterator i = subDevices.begin(); i != subDevices.end(); i++) { subQueues.push_back(cl::CommandQueue(context, *i)); }

            • device_fission samples?
              philips

              Thank you for your answer and example.

                • device_fission samples?
                  philips

                  new question.

                   

                  The PC I plan to use device fission on has two 4-core CPUs and I want to make the most of the caches.

                  If I use CL_DEVICE_PARTITION_EQUALLY, will the first 4 returned devices belong to only one of the CPUs or is the order of the 8 devices independent of the CPUs?

                  Of course I could split it in two 4-core devices, but I would rather have eight to also use the smaller caches.

                    

                    • device_fission samples?
                      genaganna

                       

                      Originally posted by: philips new question.

                       

                      The PC I plan to use device fission on has two 4-core CPUs and I want to make the most of the caches.

                       

                      If I use CL_DEVICE_PARTITION_EQUALLY, will the first 4 returned devices belong to only one of the CPUs or is the order of the 8 devices independent of the CPUs?

                       

                      Of course I could split it in two 4-core devices, but I would rather have eight to also use the smaller caches.

                       

                       

                      Could you please run CLInfo sample and past output here?

                      If OpenCL treats you have one device with 8 compute units,  8 sub-devices will be created with CL_DEVICE_PARTITION_EQUALLY

                        • device_fission samples?
                          philips

                          Unfortunately I won't have access to the machine before monday or tuesday. so I can't run CLInfo. But yes, it should create 8 sub-devices.

                           

                          I want to use the machine for a raycasting algorithm. One CPU should render one half of the picture, the other CPU the rest, so as to ideally use the Level 3 caches. However I also want to make good use of the Level 2 caches. So one core should render a column of the picture.

                          Therefore I need 8 sub-devices, but also need to know which CPU a sub-device belongs to.

                          How do I make this happen?

                            • device_fission samples?
                              genaganna

                               

                              Originally posted by: philips Unfortunately I won't have access to the machine before monday or tuesday. so I can't run CLInfo. But yes, it should create 8 sub-devices.

                               

                               

                               

                              I want to use the machine for a raycasting algorithm. One CPU should render one half of the picture, the other CPU the rest, so as to ideally use the Level 3 caches. However I also want to make good use of the Level 2 caches. So one core should render a column of the picture.

                               

                              Therefore I need 8 sub-devices, but also need to know which CPU a sub-device belongs to.

                               

                              How do I make this happen?

                               

                               

                              I don't think there is a way If OpenCL reports you have one device with 8 compute units.

                               

                                • device_fission samples?
                                  philips

                                  can you make a sub-device of a sub-device?

                                  If that were possible I could first make two sub-devices via AFFINITY_DOMAIN_L3_CACHE and then subdivide those in their cores.

                                    • device_fission samples?
                                      genaganna

                                       

                                      Originally posted by: philips can you make a sub-device of a sub-device?

                                      Yes It is possible to create sub-device from sub-devices.

                                       

                                      If that were possible I could first make two sub-devices via AFFINITY_DOMAIN_L3_CACHE and then subdivide those in their cores.

                                       

                                      CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN_EXT is not supported yet.