cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

inferrna
Journeyman III

Results differs for GPU and CPU devices depending on data.

I wrote simple propagation example realized for CPU and for OpenCL. OpenCL results differs from CPU results depending on data structure and GPU using. First device Devastator (7560D) gives right result only when all layers have the same size. Second device Pitcairn(HD 7870) gives wrong result when all layers have size equals WORK_GROUP_SIZE (256 for both). Wrong always is the third layer (second of mutable layers). My question is: is something wrong with my code, or fglrx driver, or this task is impossible for OpenCL and I must do separate call for each propagation?

In attachment is test program to see the difference between CPU & GPU results.

Compile options: gcc -O0  -std=c99 test_prop.c test_prop_cl.c -o test_prop -lm -lOpenCL -ggdb -L/usr/lib/

First variant of kernel, gives the same results as version in attachment.

__kernel void test_prop( __global float* values, __global __read_only float* conns, __global unsigned int* sz)

{

    __global float *pvalues = values;

    __global float *cvalues = values;

    __local float sums[256];

    float sum;

    unsigned int i, nr, pnr, cn, clc, k, offc=0, lcn, szp;

    for(k=1; k<3; k++){

        szp = sz[k-1];

        cvalues+=szp;

        clc = sz*szp;

        cn  = select((uint)0, (uint)get_global_id(0), get_global_id(0)<clc);

        lcn = get_local_id(0);

        if(cn<clc){

            nr  = cn / szp; //Current neuron

            pnr = cn % szp; //Prev layer's connected neuron.

            sums[lcn] = conns[cn+offc]*pvalues[pnr];

            barrier(CLK_LOCAL_MEM_FENCE);

            for(i=2; i<=szp; i<<=1){

                sum = select((float) 0.0, sums[lcn + i/2], (uint)(clc % i == 0));

                sums[lcn] += sum;

                barrier(CLK_LOCAL_MEM_FENCE);

            }

           cvalues[nr] = sums[lcn];

        }

        sums[lcn] = 0;

        barrier(CLK_LOCAL_MEM_FENCE);

        offc += clc;

        pvalues = cvalues;

        barrier(CLK_GLOBAL_MEM_FENCE);

    }

}

0 Likes
16 Replies
himanshu_gautam
Grandmaster

Before I get into the details,

I see a "barrier" inside a conditional...

Are you sure all workitems of a workgroup would take the same branch condition?

+

Floating point math is tricky..Small errors will be there between CPU and GPU executions.

The order of errors would depend on the number of operations you do..

And, these are common and cannot be avoided....Thats the nature of floating point numbers.

Their result would depend on the order of computation....

Since the order differs, when you break down parallely, the result changes by a small degree..(everytime it happens)

Try it with integer like float data which are small, say less than 5.0f. (assuming you are not having divisions)

+

Are you using cl-fast-relaxed-math compilation option?

Best,

Bruhaspati

Here is the modified kernel in which I trying to avoid barriers inside a conditional and replaced interleaved writes to globals with async_work_group_copy. It gives same results.

test_prop.cl - Pastebin.com

> Are you using cl-fast-relaxed-math compilation option?

I have tried changing options, it gave no result.

> Small errors will be there between CPU and GPU executions.

I have seen these and it were insignificant. Significant is 16.001 vs -58.54 for example.

Also, there possible an unstable situation (depends on data structure): sometimes it gives right result, sometimes not. For 7870 it is 256/128/64 (layer size must be a power of 2 for reduction works).

> Try it with integer like float data which are small, say less than 5.0f.

Interesting advice. I need float data, but I will examine this to see the difference.

0 Likes

Barriers, in a literal sense are actually barriers to performance as well.

As a parallel programmer, you should not be using so many barriers in your code...

Anyway,

If the errors are insignificant -- then it is the regular floating point deviation.

Also,
I see that in the new code you are not writing into global memory.

You are merely doing some pointer math..

I dont understand what this code is trying to do...

-

Bruhaspati

0 Likes

Illustration what kernel does http://borgu.org/test_prop.pdf

In two words: layer 0 is immutable, its data propagates (with coefficients stored in conns) to layer1 and from layer1 to layer2. First propagation, from layer0 to layer1 is always right, but the second produces wrong data depending on structure - you can see it on video

> Barriers, in a literal sense are actually barriers to performance as well.

now accurate results is more important. When it done, I will going to optimize code.

> I see that in the new code you are not writing into global memory.

it writes to global with async_work_group_copy ( and reuses results in the next iteration )

0 Likes

Thanks. In your video, at time 1:16 (after clang compilation) -- Pause it....

If you examine the results, the results are still wrong.. (example: look at 119 and above)

To me, it looks like a race condition (because of async_copy and FOR loops..)

We have seen earlier that a barrier in the middle of FOR loop causes race between upper and lower half.

I will check your code now to see if some such assumptions are broken...

Best,

Bruhaspati

0 Likes

if(pnr == 0 && cn<clc) lvalues[lnr] = sums[lcn];

How many workitems execute the statement above?

I hope only 1. If not, many workitems would be writing to the same location - which is non-deterministic.

And in that case, I believe you should be writing "&lvalues[lnr]" to memory.....

instead of writing "&lvalues[0]" to global memory...

Best,

Bruhaspati

0 Likes

> the results are still wrong.. (example: look at 119 and above)

Really, sorry for inadvertency.

> I hope only 1. If not, many workitems would be writing to the same location - which is non-deterministic.

More than one, but there no conflicts. It proves in updated scheme at test_prop_opt.svg

Each workitem executes on connection (coefficient like synapse between neurons), but in the end of each iteration we need get sums of products conn*value. E.g. current_neuron_value = sum(prev_neuron_i*current_conn_i)

Simple network Neural network | TikZ example

> And in that case, I believe you should be writing "&lvalues[lnr]" to memory.....

> instead of writing "&lvalues[0]" to global memory...

Also described in test_prop_opt.svg - I writing all local summs of each workgroup into global memory with its own offset (gvalues = cvalues + loff; where loff is common for workgroup).

0 Likes

Buddy,

Can you attach your new source kernel file (pastebin thing) as an attachment here? (use advanced editor)?

I am under a diffrnt network and it is all blocked here...

If multiple workitems are executing the store, then the code is non-deterministic already....

0 Likes

Ok, sure. I also have converted the block-scheme into png.

0 Likes

I recorded the new screencast, now it about integer data. As you can see, there is no dependence from data type. Source files also in attachment (it little differs from float sources.)

0 Likes

Hi

Sorry for the late reply.

The code is not working. Please provide compilable code in windows.

0 Likes

I have fixed 2 errors that showed up in VisualStudio.

0 Likes

Hi

Thanks for your posting.. I will test this and get back to you.

0 Likes

Sorry for late reply.

Still getting compilation errors.

My work environment :

using Visual studio 2010.

AMD APP SDK 2.9 with catalyst 13.11beta.

0 Likes

In VS2013 I also saw warnings about calloc/malloc, but after including malloc.h it gone. Which error you see?

In attached video my new experiment - instead last propagation I simple copy data from layer 1 to local buffer and from buffer to layer 2. In straight order it shows from 0 to 6 errors on Devstator, with reverse order it also appended 63 errors - seems like it early accessed global data that is waiting computation and have not yet filled. And the same reason is why it sometimes got bad data from 1st propagation, sometimes not - sometimes it have time to finish, sometimes not. There is a question - how to properly sync computation for guaranteed sharing data between work-items? Global barriers not helps in this case.

all sources attachment

0 Likes
inferrna
Journeyman III

I manually unrolled main loop and replaced async_work_group_copy to simple copying inside single tread. Results are the same. This unrolled and more commented version of test_prop.cl in attachment.

0 Likes