1. I do not see the need to launch 1024 work-items for reducing 1024 elements. And then using conditions inside kernel, which disables half the thread directly. Why not launch 512 threads only.
2. use get_global_id(0). The group_id method may be right, but is very confusing (with that 2 inside it).
Just rewriting the small section of kernel.
Global Size:512, Local Size=64
int gid = get_global_id(0);
int lid = get_local_id(0);
int grp_id = get_group_id(0);
int grp_size = get_group_size(0);
if(gid < 512)
// 3 versions for varying access pattern. Just check once before using, not tested
//ocl_test_sdata[lid] = log(exp(sqrt(a_g_idata[gid]))) + log(exp(sqrt(a_g_idata[gid + get_global_size(0)]))) ;
//ocl_test_sdata[lid] = log(exp(sqrt(a_g_idata[2 * gid]))) + log(exp(sqrt(a_g_idata[2 * gid + 1]))) ;
//ocl_test_sdata[lid] = log(exp(sqrt(a_g_idata[(2 * grp_id) * grp_size + lid]))) +
log(exp(sqrt(a_g_idata[(2 * grp_id + 1) * grp_size + lid]))) ;
Thanks for the comments I'll try that, I think I mixed the local parameter that we pass to the enqueueNDRangeKernel function and the total number of elements that should be computed. I thought it was the same but from what you told it's not really the same.
Another question, were you able to test my code on an AMD GPU to see of the test passed even if you disable printf?
I'd be reassured if my code run on NVIDIA and AMD GPU correctly.
Also, could you please tell me how to post code as a zipped attachment?
Have a nice day.
Your code returns SUCCESS with/witout using printf. Here is the output when debug was disabled.
Platform name found AMD Accelerated Parallel Processing
--> Choosen Device name: Capeverde
parallel sum 21829.1
Good to know, thanks a lot!
Then maybe the problem was from the AMD GPU I got. I'll try to test on another one if possible later.
I'm now porting a CUDA application to OpenCL and I encountered some problems. I don't know if you're familiar with Cuda, I'm facing some diffculties to "translate" tex3D and textures in OpenCL. I read about cl::Image so I think that I choose use that to pass data to the kernel but it's not very clear.
You are right. Look into cl::image, you can checkout some APP SDK Samples (although most of them have been written without OpenCL C++ wrapper). SimpleImage, MatrixMulImage are a few to name.
I have a small question about cl::Image3D. When you enqueueWriteImage it asks for an origin and a region.
If I want to read the whole image, then the region should be defined as (width,height,depth), isn't it?
region defines the (width, height, depth) in pixels of the 2D or 3D rectangle being read or written. If image is a 2D image object, the depth value given by region must be 1.
From the khronos C++ wrapper document.
Ok then it should be fine, sorry for the bother.
I have (again) another question : in the CUDA code that I'm porting there is a CudaPitchPtr. I read the specs and when you create a 3D image, it's said that you can pass the row_pitch which should be the equivalent of the host_ptr.pitch.
but What about the xSize and ySize seems like slice_pitch but not too sure. Also I really don't know what to give as a host_ptr when I construct the 3d Image. I think I should allocate an array for the size of the image which means 3 dimensions but seems like in cuda they allocate for a 3D array dimension. I hope you could help I'm kind of lost.
h_ptr.pitch = volume_size.width*sizeof(float);
h_ptr.xsize = volume_size.width;
h_ptr.ysize = volume_size.height;
cl::Image3D(context, CL_MEM_READ_ONLY, fmt, width, height, depth,
row_pitch, // = row_pitch = height*sizeof(float);
I hope I'm not too far but some help would be pretty well welcomed.
Then I "only" have to allocate memory for the host pointer?
So If I have a 3D image I have to allocate memory for a 3D Array? Sorry if my question is dumb but i haven't really understood yet.