perhaad

Device Fission Core Utilization

Discussion created by perhaad on May 13, 2011
Latest reply on May 17, 2011 by perhaad
seems like sub-device start one after another

Hello Everyone,

I am presently studying the performance of the device fission extension and partitioning kernels over CPU cores. I have written a simple microbenchmark that runs many small-ish matrix multiplications in many different queues. I am testing my example on a Intel Xeon CPU E5520 (16 CPU cores). 

However the problem I see is that while each command queue do use  a subsection of the device. Each subqueue runs serially.

I tried different partitions which are shown in the code snippet. When you use only 1 core per subdevice, I see CPU utilization using "top" remains 100%. However in the 2nd partition case the utilization moves from 200% to 300% to 400% (2,3,4 cores out of 16) over time which means that the queues are executing one after another.

 

I would expect the full CPU to be better utilized, since I enqueue ~ 300 to 500 kernels  of atleast 256*256 multiplication per queue. The only sync in this benchmark is at the end, where the code waits. I can see all the kernels get enqueued and the host waits at the end.

I populate a cIass "topology" with subqueues and so on. I understand its a mix of C and C++, but I dont expect it to matter ?

Does anyone have an toy example of device fission where you utilize all the compute units of the device with multiple subqueues ?  I have attached my code that creates command queues. I can send a test case if required. Using SDK 2.4

 

Thank You.

Perhaad

cl_context_properties cps[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[platform_touse]), 0}; cl_context_properties *cprops = cps; topo->root_context = clCreateContextFromType( cprops, (cl_device_type)dtype, NULL, NULL, &status); if(cl_errChk(status, "creating Root Context")) exit(1); bool extcheck = check_for_extensions("cl_ext_device_fission",device_touse, topo->devices); if(!extcheck) { printf("Device Fission Not Supported"); exit(1); } // Initialize required partition property - I have tried both the different property sets. cl_device_partition_property_ext partitionPrty[3] = { CL_DEVICE_PARTITION_EQUALLY_EXT, 1, CL_PROPERTIES_LIST_END_EXT }; cl_device_partition_property_ext partitionPrty[6] = { CL_DEVICE_PARTITION_BY_COUNTS_EXT, 2,3,4, CL_PARTITION_BY_COUNTS_LIST_END_EXT, CL_PROPERTIES_LIST_END_EXT }; // Initialize clCreateSubDevicesEXT function pointer // Get number of sub-devices INIT_CL_EXT_FCN_PTR(clCreateSubDevicesEXT); printf("no of subdevices %d\n",topo->numSubDevices); topo->root_device = topo->devices[device_touse]; status = pfn_clCreateSubDevicesEXT(topo->devices[device_touse], partitionPrty, 0, NULL, &topo->numSubDevices); if(cl_errChk(status,"clCreateSubDevicesEXT failed.")) exit(1); printf("no of subdevices %d\n",topo->numSubDevices); topo->subDevices = (cl_device_id*)malloc( (topo->numSubDevices) * sizeof(cl_device_id)); topo->subQueue= (cl_command_queue *)malloc( (topo->numSubDevices) * sizeof(cl_command_queue )); if(NULL == (topo->subDevices)) printf("Failed to allocate memory(subDevices)"); status = pfn_clCreateSubDevicesEXT(topo->devices[device_touse], partitionPrty, topo->numSubDevices, topo->subDevices, NULL); if(cl_errChk(status, "clCreateSubDevicesEXT failed.")) exit(1); // Create context for sub-devices topo->subContext = clCreateContext(cps, topo->numSubDevices, topo->subDevices, NULL, NULL, &status); if(cl_errChk(status,"clCreateContext for subdevices failed."))exit(1); printf("Creating contexts for subdevices\n"); for(int i=0;i<(topo->numSubDevices);i++) { printf("Init Sub-queue \t %d\n",i); topo->subQueue[i] = clCreateCommandQueue(topo->subContext, topo->subDevices[i], NULL, &status); cl_errChk(status,"clCreateCommandQueue for subdevices failed."); }

Outcomes