Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Adept I

Device Fission Core Utilization

seems like sub-device start one after another

Hello Everyone,

I am presently studying the performance of the device fission extension and partitioning kernels over CPU cores. I have written a simple microbenchmark that runs many small-ish matrix multiplications in many different queues. I am testing my example on a Intel Xeon CPU E5520 (16 CPU cores). 

However the problem I see is that while each command queue do use  a subsection of the device. Each subqueue runs serially.

I tried different partitions which are shown in the code snippet. When you use only 1 core per subdevice, I see CPU utilization using "top" remains 100%. However in the 2nd partition case the utilization moves from 200% to 300% to 400% (2,3,4 cores out of 16) over time which means that the queues are executing one after another.


I would expect the full CPU to be better utilized, since I enqueue ~ 300 to 500 kernels  of atleast 256*256 multiplication per queue. The only sync in this benchmark is at the end, where the code waits. I can see all the kernels get enqueued and the host waits at the end.

I populate a cIass "topology" with subqueues and so on. I understand its a mix of C and C++, but I dont expect it to matter ?

Does anyone have an toy example of device fission where you utilize all the compute units of the device with multiple subqueues ?  I have attached my code that creates command queues. I can send a test case if required. Using SDK 2.4


Thank You.


cl_context_properties cps[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[platform_touse]), 0}; cl_context_properties *cprops = cps; topo->root_context = clCreateContextFromType( cprops, (cl_device_type)dtype, NULL, NULL, &status); if(cl_errChk(status, "creating Root Context")) exit(1); bool extcheck = check_for_extensions("cl_ext_device_fission",device_touse, topo->devices); if(!extcheck) { printf("Device Fission Not Supported"); exit(1); } // Initialize required partition property - I have tried both the different property sets. cl_device_partition_property_ext partitionPrty[3] = { CL_DEVICE_PARTITION_EQUALLY_EXT, 1, CL_PROPERTIES_LIST_END_EXT }; cl_device_partition_property_ext partitionPrty[6] = { CL_DEVICE_PARTITION_BY_COUNTS_EXT, 2,3,4, CL_PARTITION_BY_COUNTS_LIST_END_EXT, CL_PROPERTIES_LIST_END_EXT }; // Initialize clCreateSubDevicesEXT function pointer // Get number of sub-devices INIT_CL_EXT_FCN_PTR(clCreateSubDevicesEXT); printf("no of subdevices %d\n",topo->numSubDevices); topo->root_device = topo->devices[device_touse]; status = pfn_clCreateSubDevicesEXT(topo->devices[device_touse], partitionPrty, 0, NULL, &topo->numSubDevices); if(cl_errChk(status,"clCreateSubDevicesEXT failed.")) exit(1); printf("no of subdevices %d\n",topo->numSubDevices); topo->subDevices = (cl_device_id*)malloc( (topo->numSubDevices) * sizeof(cl_device_id)); topo->subQueue= (cl_command_queue *)malloc( (topo->numSubDevices) * sizeof(cl_command_queue )); if(NULL == (topo->subDevices)) printf("Failed to allocate memory(subDevices)"); status = pfn_clCreateSubDevicesEXT(topo->devices[device_touse], partitionPrty, topo->numSubDevices, topo->subDevices, NULL); if(cl_errChk(status, "clCreateSubDevicesEXT failed.")) exit(1); // Create context for sub-devices topo->subContext = clCreateContext(cps, topo->numSubDevices, topo->subDevices, NULL, NULL, &status); if(cl_errChk(status,"clCreateContext for subdevices failed."))exit(1); printf("Creating contexts for subdevices\n"); for(int i=0;i<(topo->numSubDevices);i++) { printf("Init Sub-queue \t %d\n",i); topo->subQueue = clCreateCommandQueue(topo->subContext, topo->subDevices, NULL, &status); cl_errChk(status,"clCreateCommandQueue for subdevices failed."); }

3 Replies
Adept I

Hello, I dont mean to be a pain, but does anyone have information about this. It fell off the stack over the weekend

To restate the problem in short. When one creates multiple subdevices and command queue to each of them, The resource utilization seen from "top" or the "task manager" say that the queues dont get distributed over a multicore.

I do generate a number of problems which run for a significant amount of time

I was hoping some one had an example that does use a full CPU using multiple command queues

My core loop is below. Only after all the problems are enqueued, do we wait.

for(unsigned int k = 0; k < N; k++) { //This function just returns the number of the next queue in a simple round robin fashion. int sub_queue_id = topo->schedule_kernel(); computekernel(k, topo->subQueue[sub_queue_id] ); } printf("Waiting for Kernels to be done \n"); for(int i=0; i< (topo->numSubDevices);i++) clFinish(topo->subQueue);


do you have clFlush() before clFinish? before execution on CPU is "lazy". it dont start until you call clWaitForEvents/clFinish/clFlush/ blocking read write.

so when you have clFinish in loop it call clFinish on first queue start execution and block until it finished. then on secod. this is why you see serialization.


Thanks Nou, That was spot on.

Adding the below snipppet before the finish loop saturates my entire system. I just thought that enqueuing 1000+ kernels on each queue would have started execution on the device, however lazy it may be


    for(int i=0; i< (topo->numSubDevices);i++)

Ideally the deviceFission SDK example should discuss this. In the SDK example, it is strange that there is a finish in the loop with the enqNDRange on subqueues.