Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Adept I

Possible problem with atomic_cmpxchg?

Hi guys,

I've been trying to write a kernel which multiplies a row vector into a CSR format stored matrix, but it gives a different result each time it runs. The GPU is AMD Radeon™ HD 7970 Graphics.

#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable

#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable

#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable

#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable

__kernel void MatMulTF(__global float *A, __global int *row, __global int *col, __global float *Rs, volatile __global float *As) {   

    int gid=get_global_id(0);

    int j=row[gid];   

    int a1=row[gid+1];

    float Rsgid=Rs[gid];

    float rp;

    int colj;

    volatile __global unsigned int *pAs;

    union {

        unsigned int intVal;

        float floatVal;

    } newVal, prevVal;


    for (;j<a1; j++) {



        pAs=(volatile __global unsigned int *) (&As[colj]);

        do {

            prevVal.floatVal = As[colj];

            newVal.floatVal = prevVal.floatVal + rp;

        } while (atomic_cmpxchg(pAs, prevVal.intVal, newVal.intVal) != prevVal.intVal);       



Can anybody help me? thanks.

1 Solution

Check the SrSum value it is a 9 digit number for me. compared to which 32 is negligible.

View solution in original post

15 Replies
Adept I

Seriously? No answer?


Should not the while loop condition be flipped? i.e. while (atomic_cmpxchg(pAs, prevVal.intVal, newVal.intVal) == prevVal.intVal);      


Also, I fear your code might suffer from multiple problems - Non-coalesced memory accesses + Excessive atomics.

Hope you are running on the latest generation of AMD cards - which handle atomics better. (like 7970)


But I think the code is right despite its several drawbacks which are not my concern for now.

I like to loop while the old value read by atomic_cmpxchg is not equal to previously read value used to calculate the sum. (Reference: here).

The odd thing is, the code gives the right values for first iterations, but then, it goes wrong. In each run, it gives a different answer! I re-wrote it in several ways, but the problem remains. I thought I have the same problem as: Possible bug with atom_cmpxchg, but I was wrong.

PS: I have a 7970.

Any Ideas? The problem is related to atomic_cmpxchg. I'm sure of it.


The while() condition that you had used is correct. Sorry about the confusion.

Right now, I am unable to edit my answer... So, I will do that when "edit" starts working...

Can you post your code here?

I remember a thread on image-processing where they were seeing wrong results.....Sounds similar.

Can you post a zip attachment? That will be useful.


Thanks for your attention,

The code is very long and complicated and about 5000 lines...which part do you need? I can email it to you (It's somehow confidential). However, as I mentioned only this kernel have problem, since when I replace it with a serial code, it works correctly. (this kernel multiplies a row vector into a Compressed Sparse Row matrix)

Would you please place the link to that thread about image-processing?

Thank you in advance.


If this kernel alone is the problem -- Can you make a small repro case (standalone code that can reproduce)  with this kernel alone?

It might be of immense help in debugging this problem.

Here is the link on image processing:

AMD 79xx GPUs skip kernel execution for certain indices

The sample is a numerical case, and has iterations. The problem won't be seen in the one iteration. to build a repro case, you'll need the whole program. I can mail it to you, but I can not put it in the forum.

I think the problem is a cache-flush related problem in the atomic operation, since the condition in the while(atomic_cmpxchg(pAs, prevVal.intVal, newVal.intVal) != prevVal.intVal) seems to be always false (loop is always executed only one time).


Interestingly, Timchist - who started the thread I have referred above - has come back saying Catalyst driver 13.6 has fixed the issue.

Can you try your example with 13.6 - if thats not a huge ask.



The problem still persists.

I'm trying to build a repro case, if it is possible.


I've built a small repro case, based on my matrices (in MS Visual Studio 2008, Radeon HD 7970)

As you'll see, the code breaks at different values of variable j, despite it is expected to print "Success", the case which never happens. At least on my PC.

Thanks for helping me.

(Attachment has been edited)


It is a problem with the way you are comparing the results. Floats cannot be compared like ints, you need to give a error_threshold that would depend on how big the stored numbers are. Your code seems to be working fine.

Just try printf("iteration: %d Current Sum:%f, Older sum:%f\n", j, rSum, SrSum); and see how close the results actually are.


I do not think the comparison method is the problem. It MUST give the same results. but I've found a new issue. change the if to:

if (SrSum!=rSum) {

  printf("%d\t%f\n", j, SrSum-rSum);


  //TerM("Error Found!");


in my machine, different values of j are printed, and the value of (SrSum-rSum) is always -32.000000 or 32.000000 !

The difference value is very high, specially in iterative solutions it will make a good algorithm to a not working one.


Check the SrSum value it is a 9 digit number for me. compared to which 32 is negligible.


Yes, it is. I was comparing the (SrSum-rSum)/rSum with machine epsilon of float, and they are in the same magnitude. It's a numerical error, or for we petroleum engineers, a "Numerical Dispersion".

Thanks so much.