I tried using device fission to speed up my program, but instead it takes a big performance hit.
My raycasting algorithm runs at 19 FPS using a machine with two 4-core CPUs as a single OpenCL device (8 real cores + hyper threading). Every core is at 90 to 95% load.
Since the cores basically work on random work-items (in this case rays) the caches are not used efficiently.
The goal was to split the CPUs up in single cores and have each core work on a column of rays. But for testing purposes I started with two sub-devices.
I split the CPUs in two sub-devices (CL_DEVICE_PARTITION_EQUALLY_EXT, 8, ...)
One sub-device works on the first half of all work-groups and the second device on the rest. (each renders half the image)
To do this, my kernel has an int parameter for the offset. So every frame I use kernel.setArg for all parameters and launch it on the first sub-device. Then I use kernel.setArg to change only the offset parameter and launch it on the second sub-device.
Doing this I only get 7 FPS and the cores only have about 25% load.
If I split the device into 8 sub-devices, I only get 2 FPS and 7% load.
Now I was wondering, why that is...