4 Replies Latest reply on Jan 16, 2013 12:31 AM by buqchucker

    Bugs in clAmdBlas/gemm tuning runs?

    buqchucker

      Hello everybody,

       

       

      in order to achieve the best possible performance I tried to tune the sgemm

      routines in clAmdBlas-1.8.291 on Linux (Fedora 17, kernel 3.6.10-2.fc17, x86_64)

      for an AMD FirePro W7000 card.

       

      It seems that the routine tries to build new opencl kernels

      with different sets of parameters.

       

      All these builds fail with errors due to wrong types and missing variables:

       

      AN INTERNAL KERNEL BUILD ERROR OCCURRED!

      device name = Pitcairn error = -11 memory pattern = Cached global memory based block gemm, computing kernel generator Subproblem dimensions: dims[0].itemY = 16, dims[0].itemX = 128, dims[0].y = 16, dims[0].x = 128, dims[0].bwidth = 4; ; dims[1].itemY = 8, dims[1].itemX = 4, dims[1].y = 8, dims[1].x = 4, dims[1].bwidth = 4; ;  Parallelism granularity: pgran->wgDim = 2, pgran->wgSize[0] = 32, pgran->wgSize[1] = 2, pgran->wfSize = 64 Kernel extra flags: 3145728

       

      Source:

       

      typedef union GPtr {

      __global float *f;

      __global float2 *f2v;

      __global float4 *f4v;

      __global float8 *f8v;

      __global float16 *f16v;

      } GPtr;


      typedef union LPtr {

      __local float *f;

      __local float2 *f2v;

      __local float4 *f4v;

      __local float8 *f8v;

      __local float16 *f16v;

      } LPtr;


      typedef union PPtr {

      float *f;

      float2 *f2v;

      float4 *f4v;

      float8 *f8v;

      float16 *f16v;

      } PPtr;


      __attribute__((reqd_work_group_size(32, 2, 1)))

      void __kernel sgemmBlock(     uint M,     uint N,     uint K,

      const float alpha,     const float beta,

      const __global  *restrict A,     const __global  *restrict B,     __global  *C,

      uint lda,     uint ldb,     uint ldc) {

      float4 a0, a1, a2, a3, a4, a5, a6, a7;

      float4 b0, b1, b2, b3;

      float4 c0, c1, c2, c3, c4, c5, c6, c7;

      uint4 coord = 0u; /* contains coordB, coordA, k */


      ...


      /* ---------------------- */     }

      GPtr uC;


      uC.f = C + coord.y * ldc + coord.x;


      __global  *pC = uC.f0v;

      float4 tempC0, tempC1, tempC2, tempC3, tempC4, tempC5, tempC6, tempC7;

      tempC0 = pC[0];


      ...


      }


      --------------------------------------------------------


      Build log:

      "/tmp/OCLZMGC08.cl", line 33: warning: explicit type is missing ("int" assumed) const __global  *restrict A,

      "/tmp/OCLZMGC08.cl", line 34: warning: explicit type is missing ("int" assumed)       const __global  *restrict B,

      "/tmp/OCLZMGC08.cl", line 35: warning: explicit type is missing ("int" assumed)       __global  *C,

      "/tmp/OCLZMGC08.cl", line 167: warning: explicit type is missing ("int"           assumed)       __global  *pC = uC.f0v;

      "/tmp/OCLZMGC08.cl", line 167: error: union "GPtr" has no field "f0v"       __global  *pC = uC.f0v;

      "/tmp/OCLZMGC08.cl", line 219: error: a value of type "float4" cannot be           assigned to an entity of type "int"       pC[0] = tempC0;

       

       

      It can be seen fairly quickly that this code is not correct, e.g. the

      missing union member f0v, the missing type definitions and resulting errors.

       

       

      This has been tested using a Pitcairn card (W7000) using Driver 9.003.3-121120a-151130C

      AMDAPP-SDK v2.8, as well as on APUs, e.g.. A10-4600M & Radeon HD 7660G (AMDAPP-SDK v2.8,

      driver 9.002-120928m-148573C-ATI)

       

       

      Is this a well known fact by now?

       

       

      Regards,

       

      buqchucker