tomknk

Basic questions and graphic card problems

Discussion created by tomknk on Oct 3, 2010
Latest reply on Oct 4, 2010 by tomknk

Hi all,

I am new to OpenCL and tried to implement some easy min / max algorithms. But as my ubuntu had some problems with the latest drivers I worked with the cpu part of the stream sdk 2.2 for 2 weeks and after all my opencl-code runs fine on the cpu. I found a workaround for the problems with the latest catalyst driver and got the opencl capabilities of my graphic card back. I use an hd 4670 which is only opencl 1.0 compliant, and has no atomic functions so i kicked all kernels which relied on the atomic functions. But after the rest compiled correctly for the gpu, i have some additional problems.

1) Im not sure wheter i understand the opencl processing wrong or i only implemented things wrong. I thought that every workgroup is cut into pieces of n compute units(n=number of compute units in the gpu) which are than processed at once. Eighter till the end of the kernel is reached or a barrier is hit. Then the next chunk and so on until the whole workgroup (local size) is processed. Then the next workgroup will be processed the same way until all workgroups (global size) are executed.

Is this correct or am I wrong with this expectations?

As I thougt i am right i implemented a kernel which uses a local buffer of 8 elements (as my graphic card has 8 compute units) as i thought each element will be used only once per 8 compute units and so the buffer will be used in a unique way each turn until the whole workgroup is processed. So the following code should work?

sample code:

#define PROC_NUM 8

__kernel min ( __local int* buffer, __global const int* vector ) {

size_t lid = get_local_id(0);

if( vector[lid] > buffer[lid%PROC_NUM] )

buffer[lid%PROC_NUM] = vector[lid];

barrier(CLK_LOCAL_MEM_FENCE);

}

buffer[] was initialised with the first element of vector.

As I read in this forum the execution within a workgroup is implementation dependent so i should extend the buffer to hold a full warp which is 64(amd) and 32(nvidia) as i read in some paper (or work group size?). Or is it enough to use the number of compute units as a buffer size. Is the way in which the local ids are processed specified in the opencl specs or is it wrong to expect that only one buffer item is used at all passes within the workgroup?

2) In the first run i implemented the above kernel with a global buffer which doesn't work at all. Since the buffer elements will never be updated even if i use barrier(CLK_GLOBAL_MEM_FENCE); at the end of the code. Is their any way to use a global buffer which updates the changing values, or is that only done between workgroups or will that never be done?

3) CLInfo states that my graphic card has a max workgroup size of 128, but when ever i use a local buffer even of the size of one unsigned int it fails without an error, when the workgroup size is larger then 32.  Only the fast execution time and the global debug buffer i use to get some debug variables (in the simplest test i used debug[0] = get_global_id(0); which is allways -1 which is the initialisation value) showed that the kernel is not executed at all. Thats why i can't test the code sample in 1) with 64 which could be necessary to execute the min-kernel correct.

So I hope someone with more expirence and knowledge can answer some questions or guide me in the right direction.

Thanks

Tom

Outcomes