Archives Discussions

Raistmer · ‎11-24-2010

need to cache function values

I want to use image as cache for function values in some range.

I create image in such way:

float* cache = (float*) malloc(8192 * sizeof(cl_float));

for (int i = 0; i < 8192; ++i) {
double chisqr = 1.0 + (double) i / 8192.0 * 10.0;
cache[ i ] = (float)lcgf(0.5*gauss_dof,std::max(chisqr*0.5*gauss_bins,0.5*gauss_dof+1));
}
cl_image_format image_format;
image_format.image_channel_data_type = CL_FLOAT;
image_format.image_channel_order = CL_R;
gpu_gauss_dof_lcgf_cache=clCreateImage2D(context,
CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR,&image_format,8192,1,0,&cache,&err);

Then I use it in kernel in such way:

__constant sampler_t read_sampler = CLK_NORMALIZED_COORDS_TRUE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR;
float calc_GaussFit_score_cached(float chisqr, float null_chisqr,float score_offset,image2d_t gauss_cache,image2d_t null_cache) { // <- gauss_pot_length constant across whole package
float chisqr_cache = (chisqr - 1.0f) / 10.f; //R: normalized coords clamped to the edge
float null_chisqr_cache = (null_chisqr - 1.0f) / 10.f;
return score_offset+
(read_imagef(gauss_cache, read_sampler, (int2)(chisqr_cache, 1))).x+
(read_imagef(null_cache, read_sampler, (int2)(chisqr_cache, 1))).x;
}

App crashed when reach corresponding kernel.

What is wrong? Should I use float4 in CPU array? I see that read_imagef always return float4, but I need only one float value per image element...

sarobi · ‎11-28-2010

I tried to use 1D images as well but with image_channel_order set to CL_A. Every sample except (int2)(0,0) returned undefined values. I have no issues using 2D and 3D images with CL_RGBA, they work just fine. In my case I just used a buffer instead of a single channel 1D image.

Raistmer · ‎11-28-2010

Thanks for answer.
There is no fast local memory available for HD4xxx GPUs and this array will be accessed randomly, cause it contains cached values for some function => poor performance expected if just global memory will be used.
To speedup access I trying to make use texture cache via image usage.
Another option would be to make use constant memory cache, but then I will lose "free" linear interpolation ability between dots -> worse precision.

Any comments from AMD staff? What is wrong with 1D image usage? Are there any examples how to use images as cache for function values? I've seen mention about such usage in manual, btw, but w/o concrete samples.

sarobi · ‎11-28-2010

Have you tried to use float4 instead of float for your array if the image_channel_order is CL_R or CL_A? I haven't tried that myself because I thought it doesn't make much sense.

himanshu_gautam · ‎12-23-2010

Raistmer,

Why don't you just stuff up all the 4 image elements with valid data. So that you can use every element of the image read.

With CL_R option you will always get the other 3 values as garbage.

You can also try to use constant memory(if present on your GPU). Otherwise i think you will have to depend on the implicit caching done by the implementation.

Which device do you use?

bubu · ‎12-23-2010

Btw, I think the maximum 1D image size is 8192 which may be too little for you.

CUDA supports 10^27 1D elements. Jumbo 1D texture seems to be a problem with ATI cards.

hazeman · ‎12-23-2010

Originally posted by: Raistmer There is no fast local memory available for HD4xxx GPUs and this array will be accessed randomly, cause it contains cached values for some function => poor performance expected if just global memory will be used.

If you aren't restricted to OpenCL maybe you could use CAL++ ( http://sourceforge.net/projects/calpp/ ) library. It allows writing kernels directly in C++ and supports LDS ( local memory ) on 4xxx cards.

himanshu_gautam · ‎12-23-2010

hazeman,

You don't have a separate LDS memory in 4xxx devices.It is emulated from the global memory space.Anyhow using CAL you can do better optimizations.

bubu,

I am not sure about this, but can't we use a 2D image instead of a 1D image. Then we can have 2^26 elements in a image.

EDIT: 4xxx do have LDS as discussed later

nou · ‎12-23-2010

yes. just keed width of the texture 2^n and you can calculate 2D coordinates from 1D with bitwise operation which are pretty fast.

hazeman · ‎12-23-2010

Originally posted by: himanshu.gautam hazeman,

You don't have a separate LDS memory in 4xxx devices.It is emulated from the global memory space.Anyhow using CAL you can do better optimizations.

Please check 4xxx cards documentation* before posting false data. 4xxx cards do have LDS and it's accessible in IL. ATI simply didn't make it available in OpenCL.

* ATI Stream Computing: ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview - slide 10

* 4xxx ISA docs

himanshu_gautam · ‎12-26-2010

My apologies, 4xxx series does have a scratchpad LDS. This LDS however is not exposed to openCL because of the restrictions of writing to LDS which was not openCL compliant.

Raistmer,

You can definitely try to speed up your code using CAL. Just keep in mind the restricting nature of LDS.

douglas125 · ‎12-27-2010

Hello;

You can store the elements in a RGBA image and perform simple bitwise operations to retrieve them.

I could go on about the topic but I've written an article specifically about this and the source code is available, this is the link:

http://www.cmsoft.com.br/index.php?option=com_content&view=category&layout=blog&id=115&Itemid=172

Look for section 3.2 Interpreting Image2Ds as regular vectors

Hope that helps

rick_weber · ‎01-06-2011

Since you're the second person I've seen ask how to do this, I'll add some API and kernel support for it in clUtil (http://code.google.com/p/clutil/). OpenCL does not actually have 1D image support. You have the right idea of emulating it using a 2D image (which gives you a max size of 65 million instead of 8k). If I recall correctly, you always use float4 when reading and writing to images only the appropriate channels will be assigned when you sample.

rick_weber · ‎01-14-2011

clUtil now supports 1D images.

Raistmer · ‎01-15-2011

Thanks all for answers.
I'll try to implement your suggestions in next app versions.

OpenCL was chosen to provide support for biggest number of devices possible but looks like performance requirements will push to code divercity anyway. Many kernels run better in different implementations for NV and ATI, different kernels required for HD4xxx and HD5xxx generations, probably for HD6xxx too... cuFFT still faster on NV cards than OpenCL implementations so CUDA port looks unavoidable... So using CAL++ looks as possible solution too, especially if it gives access to more HD4xx hardware features.

Archives Discussions

How to use image as 1D array?