My raytracer supports hybrid GPU-CPU execution.
When running with only the CPU, I can vary the number of threads using the CPU_MAX_COMPUTE_UNITS variable. I see that performance is scalableish using up to 32 Interlagos 6272 cores. At 16 cores, I see a 10x speedup, at 32 I see 19x.
The machine I'm running this on has 32 6272 cores and 3 Radeon 7970s. On the 7970s, performance scales virtually linearly with the number of GPUs.
When I hybridize the resources, I see modest performance improvement when using 16 cores + the 3 GPUs over just the 3 GPUs. However, the moment I increase CPU_MAX_COMPUTE_UNITS to 17 or higher, performance drops like a colorful euphemism and the overall algorithm is significantly slower than using just the 3 GPUs.
I'm trying to account for why this might be the case. It appears for every CPU core, APP spawns 2 + 2 threads and another 2 for every GPU. Thus, with 3 GPUs and 32 cores, I'm seeing 73 threads. I suspect I'm getting contention issues that delay the GPUs from executing. How would I go about checking this? Also, why are there 2 threads per core instead of 1? I can understand having a few extra for all the cores to do asynchronous copies and such, but 2 per core sounds excessive.