I have a code that produces correct results when I test it with my Radeon HD7870.
It also compiles for GPUs in KernelAnalyser2.
The kernel myTask is used as a task (clEnqueueTask).
The code computes y=L*R*x with L and R being sparse matrices.
The code is not efficient but this is not my point.
I use AMD APP 2.8 and CAT 13.1 on Windows 7 Ultimate 64bit with i7-3820.
When I select CPU as device, the compiler crashes inside clBuildProgram (in amdocl64.dll).
KernelAnalyser2 also crashes.
I have made some observations with KernelAnalyser2:
1. The code compiles if I comment out myKernel.
2. The code compiles if I uncomment the line i = get_global_id(0)
3. The code compiles if I don't use the second for loop (for(i = 0; i < rows; i++)),
even though results are not correct.
The above apply also when I compile using my software.
Could anyone try to repeat my findings with the KernelAnalyser2?
The code is as follows:
__kernel void myTask(__global float *R, __global int *RIdx, __global int *RPtr,
__global float *L, __global int *LIdx, __global int *LPtr,
__global float *x, __global float *y, int rows, int cols)
{
__local float z[1024];
int i = 0;
//i = get_global_id(0);
for(; i < cols; i++)
{
float acc = 0.0f;
for (int j = RPtr; j < RPtr[i+1]; j++)
{
acc += R * x[RIdx];
}
z = acc;
}
for(i = 0; i < rows; i++)
{
float acc = 0.0f;
for (int j = LPtr; j < LPtr[i+1]; j++)
{
acc += L * z[LIdx];
}
y = acc;
}
}
__kernel void myKernel(__global float *a, __global float *b, __global float *c)
{
int i = get_global_id(0);
a = b + c;
}