I thought I would try out some of the new OpenCL 2.0 workgroup functions.
Comparing perf of work_group_scan_inclusive_add vs my home-grown prefix scan, I found that
work_group_scan_inclusive_add led to less work-item divergence, but used up 10 more VGPRs.
My own scan, using local memory, led to more divergence but no increase in VGPR usage.
Overall, work_group_scan_inclusive_add was faster. But, is there a way for this method
to use existing registers and not increase register pressure ?
Thanks,
Aaron
Hi Aaron,
If you are asking about any performance/optimization hints to compiler that can control the register usage, there is no such flag at this moment.
Regards,
Thanks. My question is more: is it possible to use existing registers for this built-in function? It seems to allocate its own set of registers.
I don't think it's possible. Still I'll check with the compiler team.
Regards,
thanks for checking.
At this point, there is no control over the number of registers used nor which registers are used for this built-in function. I think, optimization of the register usage is a never ending task for the compiler team and hope, it will get better over time.
Regards,