cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

philips
Journeyman III

device_fission samples?

questions after reading the documentation

hi.

 

Do you know of any samples where the device fission extension is used?

 

For better cache usage in a raycaster I am trying to have every core of the cpu work on a column of the picture instead of working on random work-groups.

There are a couple of things the documentation left me unsure about.

 

- Can I have both 8 command queues for the cores of the CPU as well as a command queue for the parent device (the entire CPU)?

So I could use the individual cores for the raycasting, but then use the entire CPU for post-processing

- Can I build the program with only one build call using all subdevices and the parent device as parameter?

- If I use only one program can I use only one instance of the kernels as well?

And they could all work using the same kernel with different arguments? at the same time?

So when I want to start the kernel on all cores I would first set the common kernel arguments and then for every core just set the kernel-specific arguments, enqueue the kernel and then move on to the next core (change a kernel argument, enqueue...) ?

 

- can you somehow use the device fission extension with the OpenCL 1.1 C++ bindings and the current StreamSDK?

If I define USE_CL_DEVICE_FISSION it won't compile because the cl_ext.h is still in revision 10424 instead of 11702

 

Thanks for reading.

 

 

 

0 Likes
8 Replies
omkaranathan
Adept I

Do you know of any samples where the device fission extension is used?

Currently, no. You can expect a sample in one of the upcoming releases.

- Can I have both 8 command queues for the cores of the CPU as well as a command queue for the parent device (the entire CPU)?

So I could use the individual cores for the raycasting, but then use the entire CPU for post-processing

Yes, you can. Attaching example OpenCL code using the extension.

On machine that has a CPU device with 8 cores (real or hyper-threaded), the given code will give you 8 sub-devices.

- Can I build the program with only one build call using all subdevices and the parent device as parameter?

The intended behavior is that a single program can be built against the parent device and kernel objects constructed from the resulting programs can be run against any of the sub-device queues.

- If I use only one program can I use only one instance of the kernels as well?

And they could all work using the same kernel with different arguments? at the same time?

You can share the same kernel object in the case that you set the arguments and then call enqueueNDRange and then set kernel args and call the next enqueueNDRange, all from within the same thread. However, if you intend to call the same kernel from different threads, then the kernel object is not thread safe and you may see races, along with possible data corruption. The threading issue can easily be worked around by simply creating a kernel object for each queue that will be dispatched from a different thread.

- can you somehow use the device fission extension with the OpenCL 1.1 C++ bindings and the current StreamSDK?

If I define USE_CL_DEVICE_FISSION it won't compile because the cl_ext.h is still in revision 10424 instead of 11702

You can expect OpenCL 1.1 support in the upcoming release.

cl:: CommandQueue parentQueue; std::vector<cl:: CommandQueue> subQueues; std::vector<cl::Platform> platforms; std::vector<cl::Device> subDevices; cl::Platform::get(&platforms); if (platforms.size() == 0) { std::cout << "Platform size 0\n"; return -1; } cl_context_properties properties[] = { CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0])(), 0}; // CPU device only supports fission, currently context = cl::Context(CL_DEVICE_TYPE_CPU, properties); std::vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES>(); // Check that device fission is supported – just use first device if (devices[0].getInfo<CL_DEVICE_EXTENSIONS>().find("cl_ext_device_fission") == std::string::npos) { std::cout << "Required that device support cl_ext_device_extension" << std::endl; return -1; } cl_device_partition_property_ext subDeviceProperties[] = { CL_DEVICE_PARTITION_EQUALLY_EXT, 1, CL_PROPERTIES_LIST_END_EXT, 0}; devices[0].createSubDevices(subDeviceProperties, &subDevices); if (subDevices.size() <= 0) { std::cout << "Failed to allocate subdevices" << std::endl; return -1; } std::cout << "Number of sub-devices " << subDevices.size() << std::endl; // create command queue for parent device parentQueue = cl::CreateQueue(context,devices[0]); // create command queues for the sub-devices for (std::vector<cl::Device>::iterator i = subDevices.begin(); i != subDevices.end(); i++) { subQueues.push_back(cl::CommandQueue(context, *i)); }

0 Likes

Thank you for your answer and example.

0 Likes

new question.

 

The PC I plan to use device fission on has two 4-core CPUs and I want to make the most of the caches.

If I use CL_DEVICE_PARTITION_EQUALLY, will the first 4 returned devices belong to only one of the CPUs or is the order of the 8 devices independent of the CPUs?

Of course I could split it in two 4-core devices, but I would rather have eight to also use the smaller caches.

  

0 Likes

Originally posted by: philips new question.

 

The PC I plan to use device fission on has two 4-core CPUs and I want to make the most of the caches.

 

If I use CL_DEVICE_PARTITION_EQUALLY, will the first 4 returned devices belong to only one of the CPUs or is the order of the 8 devices independent of the CPUs?

 

Of course I could split it in two 4-core devices, but I would rather have eight to also use the smaller caches.

 

 

Could you please run CLInfo sample and past output here?

If OpenCL treats you have one device with 8 compute units,  8 sub-devices will be created with CL_DEVICE_PARTITION_EQUALLY

0 Likes

Unfortunately I won't have access to the machine before monday or tuesday. so I can't run CLInfo. But yes, it should create 8 sub-devices.

 

I want to use the machine for a raycasting algorithm. One CPU should render one half of the picture, the other CPU the rest, so as to ideally use the Level 3 caches. However I also want to make good use of the Level 2 caches. So one core should render a column of the picture.

Therefore I need 8 sub-devices, but also need to know which CPU a sub-device belongs to.

How do I make this happen?

0 Likes

Originally posted by: philips Unfortunately I won't have access to the machine before monday or tuesday. so I can't run CLInfo. But yes, it should create 8 sub-devices.

 

 

 

I want to use the machine for a raycasting algorithm. One CPU should render one half of the picture, the other CPU the rest, so as to ideally use the Level 3 caches. However I also want to make good use of the Level 2 caches. So one core should render a column of the picture.

 

Therefore I need 8 sub-devices, but also need to know which CPU a sub-device belongs to.

 

How do I make this happen?

 

I don't think there is a way If OpenCL reports you have one device with 8 compute units.

 

0 Likes

can you make a sub-device of a sub-device?

If that were possible I could first make two sub-devices via AFFINITY_DOMAIN_L3_CACHE and then subdivide those in their cores.

0 Likes

Originally posted by: philips can you make a sub-device of a sub-device?

Yes It is possible to create sub-device from sub-devices.

 

If that were possible I could first make two sub-devices via AFFINITY_DOMAIN_L3_CACHE and then subdivide those in their cores.

 

CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN_EXT is not supported yet.

0 Likes