I tried to use 1D images as well but with image_channel_order set to CL_A. Every sample except (int2)(0,0) returned undefined values. I have no issues using 2D and 3D images with CL_RGBA, they work just fine. In my case I just used a buffer instead of a single channel 1D image.
Have you tried to use float4 instead of float for your array if the image_channel_order is CL_R or CL_A? I haven't tried that myself because I thought it doesn't make much sense.
Raistmer,
Why don't you just stuff up all the 4 image elements with valid data. So that you can use every element of the image read.
With CL_R option you will always get the other 3 values as garbage.
You can also try to use constant memory(if present on your GPU). Otherwise i think you will have to depend on the implicit caching done by the implementation.
Which device do you use?
Btw, I think the maximum 1D image size is 8192 which may be too little for you.
CUDA supports 10^27 1D elements. Jumbo 1D texture seems to be a problem with ATI cards.
Originally posted by: Raistmer There is no fast local memory available for HD4xxx GPUs and this array will be accessed randomly, cause it contains cached values for some function => poor performance expected if just global memory will be used.
If you aren't restricted to OpenCL maybe you could use CAL++ ( http://sourceforge.net/projects/calpp/ ) library. It allows writing kernels directly in C++ and supports LDS ( local memory ) on 4xxx cards.
hazeman,
You don't have a separate LDS memory in 4xxx devices.It is emulated from the global memory space.Anyhow using CAL you can do better optimizations.
bubu,
I am not sure about this, but can't we use a 2D image instead of a 1D image. Then we can have 2^26 elements in a image.
EDIT: 4xxx do have LDS as discussed later
yes. just keed width of the texture 2^n and you can calculate 2D coordinates from 1D with bitwise operation which are pretty fast.
Originally posted by: himanshu.gautam hazeman,
You don't have a separate LDS memory in 4xxx devices.It is emulated from the global memory space.Anyhow using CAL you can do better optimizations.
Please check 4xxx cards documentation* before posting false data. 4xxx cards do have LDS and it's accessible in IL. ATI simply didn't make it available in OpenCL.
* ATI Stream Computing: ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview - slide 10
* 4xxx ISA docs
My apologies, 4xxx series does have a scratchpad LDS. This LDS however is not exposed to openCL because of the restrictions of writing to LDS which was not openCL compliant.
Raistmer,
You can definitely try to speed up your code using CAL. Just keep in mind the restricting nature of LDS.
Hello;
You can store the elements in a RGBA image and perform simple bitwise operations to retrieve them.
I could go on about the topic but I've written an article specifically about this and the source code is available, this is the link:
http://www.cmsoft.com.br/index.php?option=com_content&view=category&layout=blog&id=115&Itemid=172
Look for section 3.2 Interpreting Image2Ds as regular vectors
Hope that helps
Since you're the second person I've seen ask how to do this, I'll add some API and kernel support for it in clUtil (http://code.google.com/p/clutil/). OpenCL does not actually have 1D image support. You have the right idea of emulating it using a 2D image (which gives you a max size of 65 million instead of 8k). If I recall correctly, you always use float4 when reading and writing to images only the appropriate channels will be assigned when you sample.
clUtil now supports 1D images.