AnsweredAssumed Answered

Understanding Profiler numbers

Question asked by j_karthic on Mar 21, 2012

Hi,

 

I have the following CL code, running on my A6-3420M APU.

 

__kernel __attribute__((reqd_work_group_size(64, 1, 1)))

void sample1(__global int *out

                ) {

  

   __local char8   cur_data[16];

   int localid = (get_local_id(0));

   int localid_by_8 = (localid >> 3) << 1;

   char8 result = 0;

   result += cur_data[localid_by_8];

   result += cur_data[localid_by_8 + 1];  

  

   out[get_global_id(0)] = result.s0 + result.s1 + result.s2 + result.s3

      + result.s4 + result.s5 + result.s6 + result.s7 ;      

}

 

Other details:

Global work size = (4992, 43)

Local work size = (64,1)

 

The above code generates 20 VLIW instructions. When profiled it consumes around 300 microseconds.

 

Average number of VLIW stream-core cycles per workitem = (Total time consumed * Number of stream cores * GPU Clock frequency) / Global work size

For A6-3420M APU, number of stream cores = 64 and GPU clock frequency = 400 Mhz.

In this case,

Average number of VLIW stream-core cycles per workitem = (300µ * 64 * 400M) / (4992*43) = 35.77 cycles

 

So the 20 VLIW instructions are taking around 35 cycles, which means 15 cycles are wasted in stalls. But when I profile using AMD APP profiler, it is showing FetchUnitStalled = 0, WriteUnitStalled = 0, PathUtilization = 100 and LDSBankConflict = 0, which means there are no stalls in memory access.

 

Also only two GPRs are used by the kernel. The local memory requirement is also just 128 bytes. So I would assume the GPU can run maximum number of wavefronts in parallel, to hide any latencies. But still the ALUbusy is just 64%.

 

Why is the ALUbusy so low? Where do the 15 stall cycles come from?

 

I have attached the CL code(sample1.cl), ISA disassembly code(sample1.isa) and profiler csv file(sample1.csv) for reference.

 

Thanks and regards,

Karthick

Attachments

Outcomes