Hi guys,
I've been trying to write a kernel which multiplies a row vector into a CSR format stored matrix, but it gives a different result each time it runs. The GPU is AMD Radeon™ HD 7970 Graphics.
#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable
__kernel void MatMulTF(__global float *A, __global int *row, __global int *col, __global float *Rs, volatile __global float *As) {
int gid=get_global_id(0);
int j=row[gid];
int a1=row[gid+1];
float Rsgid=Rs[gid];
float rp;
int colj;
volatile __global unsigned int *pAs;
union {
unsigned int intVal;
float floatVal;
} newVal, prevVal;
for (;j<a1; j++) {
rp=A
colj=col
pAs=(volatile __global unsigned int *) (&As[colj]);
do {
prevVal.floatVal = As[colj];
newVal.floatVal = prevVal.floatVal + rp;
} while (atomic_cmpxchg(pAs, prevVal.intVal, newVal.intVal) != prevVal.intVal);
}
}
Can anybody help me? thanks.
Solved! Go to Solution.
Check the SrSum value it is a 9 digit number for me. compared to which 32 is negligible.
Seriously? No answer?
Should not the while loop condition be flipped? i.e. while (atomic_cmpxchg(pAs, prevVal.intVal, newVal.intVal) == prevVal.intVal);
Also, I fear your code might suffer from multiple problems - Non-coalesced memory accesses + Excessive atomics.
Hope you are running on the latest generation of AMD cards - which handle atomics better. (like 7970)
Thanks,
But I think the code is right despite its several drawbacks which are not my concern for now.
I like to loop while the old value read by atomic_cmpxchg is not equal to previously read value used to calculate the sum. (Reference: here).
The odd thing is, the code gives the right values for first iterations, but then, it goes wrong. In each run, it gives a different answer! I re-wrote it in several ways, but the problem remains. I thought I have the same problem as: Possible bug with atom_cmpxchg, but I was wrong.
PS: I have a 7970.
Any Ideas? The problem is related to atomic_cmpxchg. I'm sure of it.
The while() condition that you had used is correct. Sorry about the confusion.
Right now, I am unable to edit my answer... So, I will do that when "edit" starts working...
Can you post your code here?
I remember a thread on image-processing where they were seeing wrong results.....Sounds similar.
Can you post a zip attachment? That will be useful.
Thanks for your attention,
The code is very long and complicated and about 5000 lines...which part do you need? I can email it to you (It's somehow confidential). However, as I mentioned only this kernel have problem, since when I replace it with a serial code, it works correctly. (this kernel multiplies a row vector into a Compressed Sparse Row matrix)
Would you please place the link to that thread about image-processing?
Thank you in advance.
If this kernel alone is the problem -- Can you make a small repro case (standalone code that can reproduce) with this kernel alone?
It might be of immense help in debugging this problem.
Here is the link on image processing:
The sample is a numerical case, and has iterations. The problem won't be seen in the one iteration. to build a repro case, you'll need the whole program. I can mail it to you, but I can not put it in the forum.
I think the problem is a cache-flush related problem in the atomic operation, since the condition in the while(atomic_cmpxchg(pAs, prevVal.intVal, newVal.intVal) != prevVal.intVal) seems to be always false (loop is always executed only one time).
Interestingly, Timchist - who started the thread I have referred above - has come back saying Catalyst driver 13.6 has fixed the issue.
Can you try your example with 13.6 - if thats not a huge ask.