Hi to everybody.
I'm developing a benchmark to estimate the completion time of integrated and discrete GPUs considering the amount of operations executed per byte transferred.
The kernel is very simple and "useless". Simply put, each thread reads the same constant argument and adds this value to an accumulator variable a certain number of times.
What I'm a little bit surprised to discover is the kernel occupancy by varying the global size and the work group size.
In particular, I set the global size to 256K and the work group size to 64. On the 7970 the occupancy is 100%. On the A8-3850 (Llano) the occupancy is 25%. If i double the work group size (128) the occupancy of the integrated GPU becomes 50%.
Can you help me to understand why it is so?