Questions on performance of work-item built-in functions. I'm talking about these:
get_work_dim | Number of dimensions in use |
get_global_size | Number of global work items |
get_global_id | Global work item ID value |
get_local_size | Number of local work items |
get_local_id | Local work item ID |
get_num_groups | Number of work groups |
get_group_id | Work group ID |
Are these like real function calls or more like a constant? I am supposing the latter. I might typically write something like this for a starting location:
int offset = get_group_id(0)*myitemsize*get_local_size(0);
and initialize other similar offset variables for other things, also using the same built-ins. Or perhaps something like:
int x = get_global_id(0);
If I were to never change x, or offset, do I get a performance benefit by defining these variables in the first place? How are registers impacted? I was thinking about changing above to something like this for readability but don't know of the impact.
#define x get_global_id(0)
gpgpucoder wrote:
I might typically write something like this for a starting location:
int offset = get_group_id(0)*myitemsize*get_local_size(0);
and initialize other similar offset variables for other things, also using the same built-ins. Or perhaps something like:
int x = get_global_id(0);
Leaving aside that you're not going to measure any performance difference in any case as long as you call them once...
It is my understanding get_global_id(d) is in fact a register populated by the dispatcher before the kernel runs its first instruction. Feel free to rename it as you want, the compiler will happily detect this and remove. I would suggest against calling it x, but in some cases it is acceptable.
I haven't investigated the others but I'd expect them to come from registers as well.
Be sure to remember int x = get_global_id(0) will most likely be different from get_group_id(0)*get_local_size(0)+get_local_id(0) due to index offset at dispatch (I often get lazy on this one).
I think get_group_id(0) contains the base index too (lower bound of the kernel domain). In the disasm, get_group_id(0) uses one more instruction than get_global_id(0), I guess that extra instruction divides the groupsize aligned lower bound by the groupsize.
Only get_local_id(0) is able to work directly from registers, all others are have to access extra data from memory (that contains all the info on the kernel domain).
In pure asm it is possible to calculate a 24bit global id in one cycle with MAD24, but it is only 0 based.
Anyways, if the kernel is only a few instructions long, this is a big penalty, but if it is a thousands times bigger kernel, then these extra reads and calculations doesn't matter.
Thank you for the insight. I didn't make it sufficiently clear that I have places where I am presently calling those functions more than once, in order to initialize various indexes and offsets. So I wanted best insight on how to streamline my usage, such as to not impact registers too much. In the brief experiments I have done on some very simplified code, I've seen no difference in using the function vs a variable. I will review the generated IR code when I get a moment.