Archives Discussions

buqchucker · ‎01-06-2013

Hello everybody,

in order to achieve the best possible performance I tried to tune the sgemm

routines in clAmdBlas-1.8.291 on Linux (Fedora 17, kernel 3.6.10-2.fc17, x86_64)

for an AMD FirePro W7000 card.

It seems that the routine tries to build new opencl kernels

with different sets of parameters.

All these builds fail with errors due to wrong types and missing variables:

AN INTERNAL KERNEL BUILD ERROR OCCURRED!

device name = Pitcairn error = -11 memory pattern = Cached global memory based block gemm, computing kernel generator Subproblem dimensions: dims[0].itemY = 16, dims[0].itemX = 128, dims[0].y = 16, dims[0].x = 128, dims[0].bwidth = 4; ; dims[1].itemY = 8, dims[1].itemX = 4, dims[1].y = 8, dims[1].x = 4, dims[1].bwidth = 4; ; Parallelism granularity: pgran->wgDim = 2, pgran->wgSize[0] = 32, pgran->wgSize[1] = 2, pgran->wfSize = 64 Kernel extra flags: 3145728

Source:


<pre>
<code>
typedef union GPtr { <p>
__global float *f; <p>
__global float2 *f2v; <p>
__global float4 *f4v;<p>
__global float8 *f8v; <p>
__global float16 *f16v; <p>
} GPtr; <p>
<br><p>
typedef union LPtr { <p>
__local float *f; <p>
__local float2 *f2v; <p>
__local float4 *f4v; <p>
__local float8 *f8v; <p>
__local float16 *f16v; <p>
} LPtr; <p>
<br><p>
typedef union PPtr { <p>
float *f; <p>
float2 *f2v; <p>
float4 *f4v; <p>
float8 *f8v; <p>
float16 *f16v; <p>
} PPtr; <p>
<br><p>
__attribute__((reqd_work_group_size(32, 2, 1))) <p>
void __kernel sgemmBlock(     uint M,     uint N,     uint K, <p>
const float alpha,     const float beta, <p>
const __global  *restrict A,     const __global  *restrict B,     __global  *C, <p>
uint lda,     uint ldb,     uint ldc) { <p>
float4 a0, a1, a2, a3, a4, a5, a6, a7; <p>
float4 b0, b1, b2, b3; <p>
float4 c0, c1, c2, c3, c4, c5, c6, c7; <p>
uint4 coord = 0u; /* contains coordB, coordA, k */ <p>
<br><p>
...<p>
<br><p>
/* ---------------------- */     } <p>
GPtr uC;<p>
<br><p>
uC.f = C + coord.y * ldc + coord.x; <p>
<br><p>
__global  *pC = uC.f0v; <p>
float4 tempC0, tempC1, tempC2, tempC3, tempC4, tempC5, tempC6, tempC7; <p>
tempC0 = pC[0]; <p>
<br><p>
...<p>
<br><p>
} <p>
<br><p>
-------------------------------------------------------- <p>
<br><p>
Build log: <p>
"/tmp/OCLZMGC08.cl", line 33: warning: explicit type is missing ("int" assumed) const __global  *restrict A,<p>
 "/tmp/OCLZMGC08.cl", line 34: warning: explicit type is missing ("int" assumed)       const __global  *restrict B, <p>
"/tmp/OCLZMGC08.cl", line 35: warning: explicit type is missing ("int" assumed)       __global  *C, <p>
"/tmp/OCLZMGC08.cl", line 167: warning: explicit type is missing ("int"           assumed)       __global  *pC = uC.f0v; <p>
"/tmp/OCLZMGC08.cl", line 167: error: union "GPtr" has no field "f0v"       __global  *pC = uC.f0v; <p>
"/tmp/OCLZMGC08.cl", line 219: error: a value of type "float4" cannot be           assigned to an entity of type "int"       pC[0] = tempC0; <p>
</code>
</pre>

It can be seen fairly quickly that this code is not correct, e.g. the

missing union member f0v, the missing type definitions and resulting errors.

This has been tested using a Pitcairn card (W7000) using Driver 9.003.3-121120a-151130C

AMDAPP-SDK v2.8, as well as on APUs, e.g.. A10-4600M & Radeon HD 7660G (AMDAPP-SDK v2.8,

driver 9.002-120928m-148573C-ATI)

Is this a well known fact by now?

Regards,

buqchucker

binying · ‎01-07-2013

Would you mind uploading a minimum test case?

developer · ‎01-08-2013

clAmdBlas comes with a tuner program called clAmdBlasTune. You just need to run this program for SGEMM routine to simulate the bug (in the appropriate configuration listed by buqchucker)

buqchucker · ‎01-08-2013

Yes indeed, I used clAmdBlasTune to do this.

binying was right though, I did not state it explicitly

buqchucker · ‎01-16-2013

bing

Archives Discussions

Bugs in clAmdBlas/gemm tuning runs?