0 Replies Latest reply on Mar 21, 2012 9:09 AM by j_karthic

    Understanding Profiler numbers

    j_karthic

      Hi,

       

      I have the following CL code, running on my A6-3420M APU.

       

      __kernel __attribute__((reqd_work_group_size(64, 1, 1)))

      void sample1(__global int *out

                      ) {

        

         __local char8   cur_data[16];

         int localid = (get_local_id(0));

         int localid_by_8 = (localid >> 3) << 1;

         char8 result = 0;

         result += cur_data[localid_by_8];

         result += cur_data[localid_by_8 + 1];  

        

         out[get_global_id(0)] = result.s0 + result.s1 + result.s2 + result.s3

            + result.s4 + result.s5 + result.s6 + result.s7 ;      

      }

       

      Other details:

      Global work size = (4992, 43)

      Local work size = (64,1)

       

      The above code generates 20 VLIW instructions. When profiled it consumes around 300 microseconds.

       

      Average number of VLIW stream-core cycles per workitem = (Total time consumed * Number of stream cores * GPU Clock frequency) / Global work size

      For A6-3420M APU, number of stream cores = 64 and GPU clock frequency = 400 Mhz.

      In this case,

      Average number of VLIW stream-core cycles per workitem = (300µ * 64 * 400M) / (4992*43) = 35.77 cycles

       

      So the 20 VLIW instructions are taking around 35 cycles, which means 15 cycles are wasted in stalls. But when I profile using AMD APP profiler, it is showing FetchUnitStalled = 0, WriteUnitStalled = 0, PathUtilization = 100 and LDSBankConflict = 0, which means there are no stalls in memory access.

       

      Also only two GPRs are used by the kernel. The local memory requirement is also just 128 bytes. So I would assume the GPU can run maximum number of wavefronts in parallel, to hide any latencies. But still the ALUbusy is just 64%.

       

      Why is the ALUbusy so low? Where do the 15 stall cycles come from?

       

      I have attached the CL code(sample1.cl), ISA disassembly code(sample1.isa) and profiler csv file(sample1.csv) for reference.

       

      Thanks and regards,

      Karthick