Hi everyone,
I made a porting a Cellular Automaton from CPU to GPU, I'm using OpenCL and I have the following hardware:
CPU Xeon 8 core
GPU NVIDIA k20c
GPU AMD R9 280x
The porting is completed but there is a problem with the R9 280x card.
The problem is that the result are very different from the CPU's and NVIDIA's (this two are very close).
I have made a simple kernel to show you the different results:
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#define DIVISOR 1.0002349474678
__kernel void div_kernel( __global double *in){
in[0]+=pow(DIVISOR,10);
barrier(CLK_GLOBAL_MEM_FENCE);
}
__kernel void div_kernel2( __global double *in){
in[0]/=pow(DIVISOR,10);
barrier(CLK_GLOBAL_MEM_FENCE);
}
__kernel void div_kernel3( __global double *in){
in[0]=in[0]/DIVISOR;
barrier(CLK_GLOBAL_MEM_FENCE);
}
And those are the results after 1000 execution:
OpenCL on CPU:
div_kernel = 101002.351960
div_kernel2 = 9544.561027
div_kernel3 = 79063.419905
OpenCL on NVIDIA:
div_kernel = 101002.351960
div_kernel2 = 9544.561027
div_kernel3 = 79063.419905
OpenCL on AMD:
div_kernel = 101007.072489
div_kernel2 = 86.949658
div_kernel3 = 79063.419905
as you can see, the division seems ok, but the pow function has rounding error which used in combination with division, amplify the error..
In attachment there is the code if anyone would try it self this kernel (tested on linux and mac).
My question is, why there are this different result from NVIDIA to AMD?
Thanks to all in advance
EDIT:
If anyone have an AMD card and would help me to find a solution, can simply compile and execute the code and then report the results
Thanks!
PS. I'm using Ubuntu 14.04 x86_64 with Catalyst-14.9.
Don't you think that there would be race conditions when multiple work[items/groups] write to same memory location simultaneously? Try to use atomic add for the div_kernel and see if it gives same results on all devices:
Sorry i didn't understand your problem correctly. But the built in pow function often gave me very different result and thats why i avoid it by either explicitly coding it (e.g double myPow(arg,arg)) or as a kernel argument where power value is computed on CPU.
Your attached code gave me the following result.. Moreover inside your code you set in the NDRange global and local sizes both to 1, which wouldn't benefit you in getting performance.
Group 1 with 1 devices
Device 0
Platform name: NVIDIA CUDA
Vendor: NVIDIA Corporation
Device name: GeForce GTX 260
Bit Arch: 32
Extension: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64
Execution successful in 18208 millisec
result_k1: 101002.351960
result_k2: 9544.561027
result_k3: 79063.419905
Hi bilal,
this code is just an example to highlights the problem
I know that using NDRange with size 1 is not a good choice, but I put 1 just to have only one workitem
Your results are correct because you are using an NVIDIA
Hi yurtesen,
help me to understand why there is a race condition if the global_size and the local_size are 1?
Thanks.
Bwt using this modified kernel gives same results:
__kernel void div_kernel2( __global double *in) {
int pos = get_global_id(0);
in[pos]/=pow(DIVISOR,10);
barrier(CLK_GLOBAL_MEM_FENCE);
}
Well sorry, it was my mistake, I didn't check your code attachment, so I didnt realize you enqueue the kernels 1000 times etc.
Now I downloaded your kernel and ran it on Spectre, Hawaii, AMD A10-7850K with AMD SDK and Intel SDK. The machine was using linux-amd-catalyst-14.6-beta-v1.0-jul11.zip driver.
I have attached the output from your program. So there is no problem?
Thanks yurtesen!
your result is good!
I'm seriously thinking that I have a problem in configuration or hw, may be a problem with the driver which goes in conflict with nvidia ones or problem with power supply?!
Can you make another try please? Replace pow(DIVISOR,10) with cos(DIVISOR) in the first kernel function and run it.
In my system this is the result on AMD:
result_k1: 99398.981831
but this is the right result:
result_k1: 100540.104589
The solution (suggested by an user of GPGPU group on linkedIn) was to made my own pow() function like this:
double my_pow(double x, double a) {
return exp(x * log(a));
}
It works very well and the results are the same as nVIDIA ones
Yes that's what i pointed out in my reply.
musale87 wrote:
The solution (suggested by an user of GPGPU group on linkedIn) was to made my own pow() function like this:
double my_pow(double x, double a) {
return exp(x * log(a));
}
It works very well and the results are the same as nVIDIA ones
A far cleaner, simpler and more debug able solution.
I get 100540.104589 for all the results from all the devices in the same machine as before (also I tried on a tahiti card and got the same result). For your information my clinfo shows driver version 1526.3 for AMD and 1.2.0.92 for Intel SDK. (and machine is running Ubuntu 14.04)
I can't tell what may be the problem in your machine, probably you may have an old version of something or it may be a driver problem. I can't imagine what kind of power supply problem would effect certain functions only. I prefer to use an all AMD machine for my stuff because I was fed up with dealing with driver problems etc. (and kaveri cpu/mobos are so cheap that it made sense to build one)
Thank you very much!
I will install the operating system from scratch and I will try the system without the nvidia card
I hope that all goes well
I'm sorry for this wrong discussion