cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

nbigaouette
Journeyman III

[Solved] Simple hello world on CPU: Bug in AmdStream?

Not a bug at all, just a mis-understanding of OpenCL...

First, I want to thanks AMD for providing AMD Stream SDK. I just started using v2.2, but unfortunately I don't have a GPU for the moment so I'm running some test on the CPU.

While trying a simple kernel, I could not get correct results. I then tried on two different machines having dedicated GPU and got the expected results. The kernel is simply adding two arrays and storing the result in the second one.

As soon as I include "b(i)" in the assignment of "b(i)" (or even using a temporary variable) the new values are always "2*b". The exact same code, on a real GPU, gives the correct addition.

 

 

I'm still new to OpenCL, but I cannot hink of anything else but a bug in the Stream package. Is it possible?

I'm developping on ArchLinux x86_64, but tested on an Ubuntu i686 machine with an nvidia GeForce 8400 GS (CUDA sdk) and on a cluster with many tesla cards.

Thank you for your help.

 

__kernel void Test_OpenCL(__global const float * const a, __global float * const b, const int n) { for (int i = 0 ; i < n ; i++) { b = a + b; } }

0 Likes
3 Replies
cjang
Journeyman III

It works ok for me on a Core 2 Duo CPU running Ubuntu 10.04 x86_64 using SDK v2.2 and Catalyst 10.7b. I am also doing mixed ATI / NVIDIA development. As each OpenCL runtime can see all GPUs (if the drivers modules are installed and active), it is easy to run an ATI-linked binary against the NVIDIA GPU or vice-versa. I've done that and stuff can seem to run but the results are undefined.

Here's the code I used to test your Test_OpenCL kernel. It will print the following:

B[0] = 3
B[1] = 4
B[2] = 5
B[3] = 6

ComputeInit base; ComputeDevice cDev(base, 0); const int N = 4; // array of floats Membuf<float> A(cDev, N, READ); Membuf<float> B(cDev, N, READWRITE); // initialize arrays for (size_t i = 0; i < A.length(); i++) static_cast<float*>(A) = 3; for (size_t i = 0; i < B.length(); i++) static_cast<float*>(B) = i; // kernel Kernel foo(cDev, "Test_OpenCL", "__kernel void Test_OpenCL(__global const float * const a, __global float * const b, const int n) " "{ " " for (int i = 0 ; i < n ; i++) " " { " " b = a + b; " " } " "} "); // set work items and arguments foo << WorkIndex(1, 1) << A << B << N; // data transfer from host to device // execute kernel // data transfer from device to host cDev << A << B << foo << FLUSH >> B >> FLUSH; // print output for (size_t i = 0; i < B.length(); i++) cout << "B[" << i << "] = " << static_cast<float*>(B) << endl;

0 Likes

nbigaouette
I feel you are computing the same values of same variables in every thread.This might give different results on CPU & Gpu as execution order of threads will differ in the two cases.

try this kernel for matrix addition:

 

__kernel void sum(int width, __global float* X, __global float* Y, __global float* Z) { int i = get_global_id(0); int j = get_global_id(1); Z[j*width+i] = X[j*width+i] + Y[j*width+i]; }

0 Likes

Thank you for your answer.

You were right: The loop on "i" should be habdled by threads, not explicitely. By replacing "for (int i = 0 ; i < n ; i++)" with "unsigned int i = get_global_id(0);" it now works. I was sure I tried it before...

I tried it on two different CPU and it works as expected.

Sorry for my mistake!

0 Likes