I have the following CL code, running on my A6-3420M APU.
__kernel __attribute__((reqd_work_group_size(64, 1, 1)))
void sample1(__global int *out
__local char8 cur_data;
int localid = (get_local_id(0));
int localid_by_8 = (localid >> 3) << 1;
char8 result = 0;
result += cur_data[localid_by_8];
result += cur_data[localid_by_8 + 1];
out[get_global_id(0)] = result.s0 + result.s1 + result.s2 + result.s3
+ result.s4 + result.s5 + result.s6 + result.s7 ;
Global work size = (4992, 43)
Local work size = (64,1)
The above code generates 20 VLIW instructions. When profiled it consumes around 300 microseconds.
Average number of VLIW stream-core cycles per workitem = (Total time consumed * Number of stream cores * GPU Clock frequency) / Global work size
For A6-3420M APU, number of stream cores = 64 and GPU clock frequency = 400 Mhz.
In this case,
Average number of VLIW stream-core cycles per workitem = (300µ * 64 * 400M) / (4992*43) = 35.77 cycles
So the 20 VLIW instructions are taking around 35 cycles, which means 15 cycles are wasted in stalls. But when I profile using AMD APP profiler, it is showing FetchUnitStalled = 0, WriteUnitStalled = 0, PathUtilization = 100 and LDSBankConflict = 0, which means there are no stalls in memory access.
Also only two GPRs are used by the kernel. The local memory requirement is also just 128 bytes. So I would assume the GPU can run maximum number of wavefronts in parallel, to hide any latencies. But still the ALUbusy is just 64%.
Why is the ALUbusy so low? Where do the 15 stall cycles come from?
I have attached the CL code(sample1.cl), ISA disassembly code(sample1.isa) and profiler csv file(sample1.csv) for reference.
Thanks and regards,