Archives Discussions

musale87 · ‎10-24-2014

Hi everyone,

I made a porting a Cellular Automaton from CPU to GPU, I'm using OpenCL and I have the following hardware:

CPU Xeon 8 core

GPU NVIDIA k20c

GPU AMD R9 280x

The porting is completed but there is a problem with the R9 280x card.

The problem is that the result are very different from the CPU's and NVIDIA's (this two are very close).

I have made a simple kernel to show you the different results:


#pragma OPENCL EXTENSION cl_khr_fp64 : enable

#define DIVISOR 1.0002349474678


__kernel void div_kernel( __global double *in){

    in[0]+=pow(DIVISOR,10);

    barrier(CLK_GLOBAL_MEM_FENCE);

}


__kernel void div_kernel2( __global double *in){

    in[0]/=pow(DIVISOR,10);

    barrier(CLK_GLOBAL_MEM_FENCE);

}


__kernel void div_kernel3( __global double *in){

    in[0]=in[0]/DIVISOR;

    barrier(CLK_GLOBAL_MEM_FENCE);

}

And those are the results after 1000 execution:

OpenCL on CPU:

div_kernel = 101002.351960

div_kernel2 = 9544.561027

div_kernel3 = 79063.419905

OpenCL on NVIDIA:

div_kernel = 101002.351960

div_kernel2 = 9544.561027

div_kernel3 = 79063.419905

OpenCL on AMD:

div_kernel = 101007.072489

div_kernel2 = 86.949658

div_kernel3 = 79063.419905

as you can see, the division seems ok, but the pow function has rounding error which used in combination with division, amplify the error..

In attachment there is the code if anyone would try it self this kernel (tested on linux and mac).

My question is, why there are this different result from NVIDIA to AMD?

Thanks to all in advance

EDIT:

If anyone have an AMD card and would help me to find a solution, can simply compile and execute the code and then report the results

Thanks!

PS. I'm using Ubuntu 14.04 x86_64 with Catalyst-14.9.

yurtesen · ‎10-25-2014

Don't you think that there would be race conditions when multiple work[items/groups] write to same memory location simultaneously? Try to use atomic add for the div_kernel and see if it gives same results on all devices:

atom_add

bilal · ‎10-26-2014

Sorry i didn't understand your problem correctly. But the built in pow function often gave me very different result and thats why i avoid it by either explicitly coding it (e.g double myPow(arg,arg)) or as a kernel argument where power value is computed on CPU.

Your attached code gave me the following result.. Moreover inside your code you set in the NDRange global and local sizes both to 1, which wouldn't benefit you in getting performance.

Group 1 with 1 devices

Device 0

Platform name: NVIDIA CUDA

Vendor: NVIDIA Corporation

Device name: GeForce GTX 260

Bit Arch: 32

Extension: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64

Execution successful in 18208 millisec

result_k1: 101002.351960

result_k2: 9544.561027

result_k3: 79063.419905

musale87 · ‎10-26-2014

Hi bilal,

this code is just an example to highlights the problem

I know that using NDRange with size 1 is not a good choice, but I put 1 just to have only one workitem

Your results are correct because you are using an NVIDIA

musale87 · ‎10-26-2014

Hi yurtesen,

help me to understand why there is a race condition if the global_size and the local_size are 1?

Thanks.

Bwt using this modified kernel gives same results:


__kernel void div_kernel2( __global double *in) {

    int pos = get_global_id(0);

    in[pos]/=pow(DIVISOR,10);

    barrier(CLK_GLOBAL_MEM_FENCE);

}

yurtesen · ‎10-27-2014

Well sorry, it was my mistake, I didn't check your code attachment, so I didnt realize you enqueue the kernels 1000 times etc.

Now I downloaded your kernel and ran it on Spectre, Hawaii, AMD A10-7850K with AMD SDK and Intel SDK. The machine was using linux-amd-catalyst-14.6-beta-v1.0-jul11.zip driver.

I have attached the output from your program. So there is no problem?

musale87 · ‎10-27-2014

Thanks yurtesen!

your result is good!

I'm seriously thinking that I have a problem in configuration or hw, may be a problem with the driver which goes in conflict with nvidia ones or problem with power supply?!

Can you make another try please? Replace pow(DIVISOR,10) with cos(DIVISOR) in the first kernel function and run it.

In my system this is the result on AMD:

result_k1: 99398.981831

but this is the right result:

result_k1: 100540.104589

musale87 · ‎10-26-2014

The solution (suggested by an user of GPGPU group on linkedIn) was to made my own pow() function like this:


double my_pow(double x, double a) {

    return exp(x * log(a));

}

It works very well and the results are the same as nVIDIA ones

bilal · ‎10-26-2014

Yes that's what i pointed out in my reply.

mdriftmeyer · ‎10-27-2014


musale87 wrote:


The solution (suggested by an user of GPGPU group on linkedIn) was to made my own pow() function like this:

double my_pow(double x, double a) {
return exp(x * log(a));
}


double my_pow(double x, double a) {
    return exp(x * log(a));
}

It works very well and the results are the same as nVIDIA ones

A far cleaner, simpler and more debug able solution.

yurtesen · ‎10-28-2014

I get 100540.104589 for all the results from all the devices in the same machine as before (also I tried on a tahiti card and got the same result). For your information my clinfo shows driver version 1526.3 for AMD and 1.2.0.92 for Intel SDK. (and machine is running Ubuntu 14.04)

I can't tell what may be the problem in your machine, probably you may have an old version of something or it may be a driver problem. I can't imagine what kind of power supply problem would effect certain functions only. I prefer to use an all AMD machine for my stuff because I was fed up with dealing with driver problems etc. (and kaveri cpu/mobos are so cheap that it made sense to build one)

musale87 · ‎10-28-2014

Thank you very much!

I will install the operating system from scratch and I will try the system without the nvidia card

I hope that all goes well

I'm sorry for this wrong discussion

Archives Discussions

Maths in double precision gives me different results on AMD and NVIDIA