Hi to everybody.

I'm developing a benchmark to estimate the completion time of integrated and discrete GPUs considering the amount of operations executed per byte transferred.

The kernel is very simple and "useless". Simply put, each thread reads the same constant argument and adds this value to an accumulator variable a certain number of times.

What I'm a little bit surprised to discover is the kernel occupancy by varying the global size and the work group size.

In particular, I set the global size to 256K and the work group size to 64. On the 7970 the occupancy is 100%. On the A8-3850 (Llano) the occupancy is 25%. If i double the work group size (128) the occupancy of the integrated GPU becomes 50%.

Can you help me to understand why it is so?

Thank you!

The Evergreen GPUs like that in Llano I think were limited to a small number of workgroups per core. 8, maybe. SI is limited by either the number of wavefronts (40 per core) or the number of barriers (8). Note that barriers == workgroups for > 1 wavefront in the group, but if there is only one wavefront in the group that group uses 0 barriers, hence the barrier limit becomes irrelevant and you are wavefront limited. So my guess here is that you use few resources per wave, and with 64 WIs per group you have one wave per group, getting only 8 waves per core which doesn't cover latency. Double the group size and you double that to 16, while on SI you always had a lot more so it didn't matter.