Can you tell, where is the CPU implementation and the OpenCL kernel implementation above.
Please post a copy of your code (as zip file) so that we can reproduce here.
Please include the following details as well.
1. Platform - win32 / win64 / lin32 / lin64 or some other?
Win7 or win vista or Win8.. Similarly for linux, your distribution
2. Version of driver
3. CPU or GPU Target?
4. CPU/GPU details of your hardware
I would suggest to attach the code in the forum post itself using advanced editor. Also no need to attach a large number of unnecessary VS files. I have done it here though
Now let's go in the code:
1. How many total global threads are you creating? It looks like global_threads is equal to stringlength in the code. And the string used has its length as just 1. The commented string is also not more than 10. These are very small numbers, for a GPU having hundreds of stream processors. It is generally recommeded to have 4 times the threads as you have stream processors in your GPU to acheive good occupancy.
2. The kernel has a complicated looking while loop, which BTW does not affect the output of the kernel. So I guess that may be just some dummy code. In such scenario, its hard to say, how much of your loops will not be optimized out . So better make sure you somehow use results out of the while loop in the output.
3. And yes, assuming nothing is optimized out, you have a big private array of 2800 ints per wok-item. which is horribly slow. The big while loop is run 1000 times by each work-item. Can you explain how are you parallelizing your total work to be done here.
4. Probably you can look into some more samples. hello world is only meant to explain the basics of writing opencl programs. Also read AMD OpenCL Programming guide to learn what kind of algorithms can be accelerated on GPUs.
HelloWorld1.zip 10.6 KB
int r[2800 + 1];
Your kernel is allocating a huge private array.
On top of it -- you are not using constant indices to access it.
So, the compiler will allocate it in global memory (or) so -- which will slow it down very badly.
Try to use local data store:
local int r[2800 + 1];
And also restrict printf(); to a specific thread because it got tons of work now interleaving all the printfs of all your threads.
Also I see this is a test code, and all the threads are calculating the same data, but it does the same at least 64x in paralell (or whatever your number of threads in your NDRange).