Hello everybody,
in order to achieve the best possible performance I tried to tune the sgemm
routines in clAmdBlas-1.8.291 on Linux (Fedora 17, kernel 3.6.10-2.fc17, x86_64)
for an AMD FirePro W7000 card.
It seems that the routine tries to build new opencl kernels
with different sets of parameters.
All these builds fail with errors due to wrong types and missing variables:
AN INTERNAL KERNEL BUILD ERROR OCCURRED!
device name = Pitcairn error = -11 memory pattern = Cached global memory based block gemm, computing kernel generator Subproblem dimensions: dims[0].itemY = 16, dims[0].itemX = 128, dims[0].y = 16, dims[0].x = 128, dims[0].bwidth = 4; ; dims[1].itemY = 8, dims[1].itemX = 4, dims[1].y = 8, dims[1].x = 4, dims[1].bwidth = 4; ; Parallelism granularity: pgran->wgDim = 2, pgran->wgSize[0] = 32, pgran->wgSize[1] = 2, pgran->wfSize = 64 Kernel extra flags: 3145728
Source:
<pre>
<code>
typedef union GPtr { <p>
__global float *f; <p>
__global float2 *f2v; <p>
__global float4 *f4v;<p>
__global float8 *f8v; <p>
__global float16 *f16v; <p>
} GPtr; <p>
<br><p>
typedef union LPtr { <p>
__local float *f; <p>
__local float2 *f2v; <p>
__local float4 *f4v; <p>
__local float8 *f8v; <p>
__local float16 *f16v; <p>
} LPtr; <p>
<br><p>
typedef union PPtr { <p>
float *f; <p>
float2 *f2v; <p>
float4 *f4v; <p>
float8 *f8v; <p>
float16 *f16v; <p>
} PPtr; <p>
<br><p>
__attribute__((reqd_work_group_size(32, 2, 1))) <p>
void __kernel sgemmBlock( uint M, uint N, uint K, <p>
const float alpha, const float beta, <p>
const __global *restrict A, const __global *restrict B, __global *C, <p>
uint lda, uint ldb, uint ldc) { <p>
float4 a0, a1, a2, a3, a4, a5, a6, a7; <p>
float4 b0, b1, b2, b3; <p>
float4 c0, c1, c2, c3, c4, c5, c6, c7; <p>
uint4 coord = 0u; /* contains coordB, coordA, k */ <p>
<br><p>
...<p>
<br><p>
/* ---------------------- */ } <p>
GPtr uC;<p>
<br><p>
uC.f = C + coord.y * ldc + coord.x; <p>
<br><p>
__global *pC = uC.f0v; <p>
float4 tempC0, tempC1, tempC2, tempC3, tempC4, tempC5, tempC6, tempC7; <p>
tempC0 = pC[0]; <p>
<br><p>
...<p>
<br><p>
} <p>
<br><p>
-------------------------------------------------------- <p>
<br><p>
Build log: <p>
"/tmp/OCLZMGC08.cl", line 33: warning: explicit type is missing ("int" assumed) const __global *restrict A,<p>
"/tmp/OCLZMGC08.cl", line 34: warning: explicit type is missing ("int" assumed) const __global *restrict B, <p>
"/tmp/OCLZMGC08.cl", line 35: warning: explicit type is missing ("int" assumed) __global *C, <p>
"/tmp/OCLZMGC08.cl", line 167: warning: explicit type is missing ("int" assumed) __global *pC = uC.f0v; <p>
"/tmp/OCLZMGC08.cl", line 167: error: union "GPtr" has no field "f0v" __global *pC = uC.f0v; <p>
"/tmp/OCLZMGC08.cl", line 219: error: a value of type "float4" cannot be assigned to an entity of type "int" pC[0] = tempC0; <p>
</code>
</pre>
It can be seen fairly quickly that this code is not correct, e.g. the
missing union member f0v, the missing type definitions and resulting errors.
This has been tested using a Pitcairn card (W7000) using Driver 9.003.3-121120a-151130C
AMDAPP-SDK v2.8, as well as on APUs, e.g.. A10-4600M & Radeon HD 7660G (AMDAPP-SDK v2.8,
driver 9.002-120928m-148573C-ATI)
Is this a well known fact by now?
Regards,
buqchucker
Would you mind uploading a minimum test case?
clAmdBlas comes with a tuner program called clAmdBlasTune. You just need to run this program for SGEMM routine to simulate the bug (in the appropriate configuration listed by buqchucker)
Yes indeed, I used clAmdBlasTune to do this.
binying was right though, I did not state it explicitly
bing