cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Anon5710
Adept I

atomic operation

Hello,

I have written a kernel whose job is to compare  2 groups of integers to each other, whenever there is a match counter is increased.

To avoid running in the 128 MB buffer limit, i decided to use atomic operations to give me a index, which i use as a reference where to write when counter is high enough.

__kernel void compare( __global int* input, __global int* output, const unsigned int count, __local int * shared , __local int * suspect, __local int * indexes, __global int* outIndex )

{

    int x = get_global_id(0); //get global idx

    int y = get_global_id(1); //get global idy

    int local_idx = get_local_id(0);

    int k,i,j;

    int counter=0;

    int index;

       

    if( x < count && y < count && x != y)

    {

        //store each subgroup (of hashes) in local memory

        //offset each variable by workgroup size bytes

        for(k=0;k<REP*2;k++)

        {

            shared[local_idx+k*64] = input[x+k*count];

            suspect = input[y+k*count];            //does the one exception to bank conflicts also work when writing ?

        }

        barrier(CLK_LOCAL_MEM_FENCE);

        //store outindex global values in local index values

        //has a coalsed acces pattern

        indexes[local_idx] = outIndex;

               

        for(i=0;i<REP*2;i++)

        {

            for(j=0;j<REP*2;j++)

            {

                if(suspect == shared[local_idx+j*64])

                {

                    //printf("Succes : %d = %d -> (%d,%d) \n",suspect,shared[local_idx+j*64],x,y);

                    counter++;

                }

            }

        }

        //output[x+y*count] = counter;

        if(counter >  REP/2)

        {

            index = atomic_add( &indexes[local_idx], 2 );

            output[ x +index*count ] = y;

            output[ x +index*count+count ] = counter;

            //if( x == 65 )

                //printf("(%d,%d,%d)\n",y,counter,index);

        }

        barrier(CLK_LOCAL_MEM_FENCE);

        //restore outindex in global memory;

        outIndex = indexes[local_idx];

        barrier(CLK_LOCAL_MEM_FENCE);

    }

}

Now when this code is run with line 45 & 46 uncommented, it works perfectly i have no problem at all. I get 20 or so positive results.

When line 45&46 are commented out, i get return 1 result, the last one.  So I'm guessing there is a problem with the atomic operation. But honestly, i have no clue how to fix this.

Does anyone know what i'm doing wrong ?

0 Likes
1 Solution
Anon5710
Adept I

Hello again,

I've asked my professor but he couldn't answer this question as-well.

i've attached my complete scourcode below.

(make to compile, ./v1-par-test to execute. )

Hope this helps

View solution in original post

0 Likes
5 Replies
Anon5710
Adept I

Hello again,

I've asked my professor but he couldn't answer this question as-well.

i've attached my complete scourcode below.

(make to compile, ./v1-par-test to execute. )

Hope this helps

0 Likes

Aren't you performing atomics on a non-shared value (addressing local memory with the local index)? I'm not clear what purpose the atomic is serving unless I'm misreading the code.

Ahh, so it wasn't just me.  It just seems to copy a globally unique value to a locally unique cache, do some locally unique 'atomic' ops on it, and then copy it back to the globally unique array.

(and not to mention, actually writing any results to an always globally unique slice anyway).

Atomics must be shared between jobs to be any use.

If you want a tight-packed result you need to use a single global atomic - use atomic counters on amd, global atomics are really slow.  Or if you want one tight-packed-result per work-group, you could have one atomic counter per workgroup, or whatever - but since you need to reserve output space anyway it probably wouldn't really help achieve your goal of reducing memory wastage.

So to quote from the opencl specification:

Global memory is consistent across

work-items in a single work-group at a work-group barrier, but there are no guarantees of

memory consistency between different work-groups executing a kernel.

(this is what the above 2 persons were saying right ? )

So how does using printf actually guarantees  memory consistency between diffrent work group at a global level ?

(because this is what i am experiecing ?)

As a side note..

Am i forced to use a global counter, to get a tighter packed result ? ( i still can use count*count  array but this quickly grows way to .... )

second, i only have seen a global atomics example, can someone point me to a global counter example ?

Regards Anon5710

0 Likes

printf does not guarantee memory consistency. What you are seeing is a by-product of our implementation where we only run a certain number of work-groups at a time when printf is used because of the issue with the printf buffer size not being able to grow dynamically.

0 Likes