Device fission performance

Discussion created by philips on Aug 16, 2010
Latest reply on Aug 17, 2010 by philips
why so slow?


I tried using device fission to speed up my program, but instead it takes a big performance hit.


My raycasting algorithm runs at 19 FPS using a machine with two 4-core CPUs as a single OpenCL device (8 real cores + hyper threading). Every core is at 90 to 95% load.

Since the cores basically work on random work-items (in this case rays) the caches are not used efficiently.

The goal was to split the CPUs up in single cores and have each core work on a column of rays. But for testing purposes I started with two sub-devices.



I split the CPUs in two sub-devices (CL_DEVICE_PARTITION_EQUALLY_EXT, 8, ...)

One sub-device works on the first half of all work-groups and the second device on the rest. (each renders half the image)

To do this, my kernel has an int parameter for the offset. So every frame I use kernel.setArg for all parameters and launch it on the first sub-device. Then I use kernel.setArg to change only the offset parameter and launch it on the second sub-device. 


Doing this I only get 7 FPS and the cores only have about 25% load.


If I split the device into 8 sub-devices, I only get 2 FPS and 7% load.



Now I was wondering, why that is...


Any ideas?