Archives Discussions

Fr4nz · ‎01-14-2010

Hello all. I'm new with OpenCL so it's probable that I'm doing something wrong...anyway, I'm encountering a strange problem when using __local variables pointers, because apparently I'm not able to write them. Consider this simple kernel :

__kernel void

intersect(__global uint *outputCounts, __local uint4 *tmpBuffer) {


    const unsigned int gid = get_global_id(0);

   tmpBuffer[0].x = 1;
   tmpBuffer[0].y = 1;
   tmpBuffer[0].z = 1;
   tmpBuffer[0].w = 1;

    outputCounts[gid] = tmpBuffer[0].x;
}

As expected every value returned by a thread returns 1 in the output array at the right position. But if a replace the last line with:

outputCounts[gid] = tmpBuffer[0].y; (but it could be "z" or "w")

then I get wrong results (1 billion, 2 billions, etc.: nonsense results!). It seems that the components y,z and w aren't initialized (or problems with memory alignments??) ...

Another interesting thing is that if I use the command:

vstore4((uint4) (1),0,tmpBuffer);

then the first uint4 contained in tmpBuffer is initialized correctly. So, it seems that only with vstore4 I can modify __local variables. Why the initialization through vector components used above is wrong? Where's my error? Thank you for any answer.

I'm working under Ubuntu 9.04 32bit and my videocard is a 5770. Below here is posted the host code that executes kernels.

EDIT: After some tests I've found that if you want to modify a __local vector correctly (at least this is the case of "uint4", dunno if this involves other types)you must use the function vstore4 (maybe it has to do with portability, as explained in appendix B of the specs?). Reads are "ok", in the sense that you can access local variables with indices and .xyzws0s1..." without any strange behaviour.

Please provide some explanation...

void GPUInterface::executeGPUIntersect(unsigned int *VD, unsigned int *prefixCkGPU, unsigned int *support, int numItems, int dimRow, int prefixCkSize, int k) { // Codice di errore tornato dalle varie funzioni; cl_int err; // Creazione della coda dei comandi associata al dispositivo GPU-OpenCL; cl::vector<cl::Device> listaDevices = this->contesto->getInfo<CL_CONTEXT_DEVICES>(); cl::CommandQueue queue((*(this->contesto)), listaDevices[0]); // Creazione dell'output buffer relativo ai conteggi dei vari candidati; cl::Buffer outputSupport = cl::Buffer(*(this->contesto), CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, sizeof(unsigned int)*prefixCkSize, support, &err); // Set kernel args... this->kernel->setArg(0, outputSupport); this->kernel->setArg(1, (size_t) prefixCkSize*4*sizeof(unsigned int), NULL); // Per le variabili locali si DEVE saltare la creazione del buffer, indicandone solo la dimensione da allocare sulla GPU! cl::Event e; int wait = 1; err = queue.enqueueNDRangeKernel(*(this->kernel), cl::NullRange, cl::NDRange(prefixCkSize), cl::NullRange, 0, &e); if(wait == 1) e.wait(); // Recupero dal dispositivo i risultati ottenuti... err = queue.enqueueReadBuffer(outputSupport,CL_TRUE,0,sizeof(unsigned int)*prefixCkSize,support,NULL,NULL); // Fine esecuzione dei kernel; le risorse occupate dalla coda dei comandi vengono liberate; err = queue.finish(); printf("\n%d - %u", prefixCkSize, support[0]); }

genaganna · ‎01-23-2010

Fr4nz,

Could you please try following two ways and let us know what you are getting?

// Method one __kernel void intersect(__global uint *outputCounts, __local uint4 *tmpBuffer) { const unsigned int gid = get_global_id(0); tmpBuffer[0].x = 1; tmpBuffer[0].y = 1; tmpBuffer[0].z = 1; tmpBuffer[0].w = 1; barrier(CLK_LOCAL_MEM_FENCE); outputCounts[gid] = tmpBuffer[0].y; } // Method two __kernel void intersect(__global uint *outputCounts, __local uint4 *tmpBuffer) { const unsigned int gid = get_global_id(0); // Assign only with 0th thread if(get_local_id(0) == 0) { tmpBuffer[0].x = 1; tmpBuffer[0].y = 1; tmpBuffer[0].z = 1; tmpBuffer[0].w = 1; } barrier(CLK_LOCAL_MEM_FENCE); outputCounts[gid] = tmpBuffer[0].y; }

Fr4nz · ‎01-23-2010

Originally posted by: genaganna Fr4nz,

Could you please try following two ways and let us know what you are getting?

Hi genna, I solved my problem many days ago: the problem, as you saw, was that I didn't use a barrier after writing to local memory. These videos teached me very well about how to manage LDS:

http://www.macresearch.org/opencl_episode4

http://www.macresearch.org/opencl_episode5

http://www.macresearch.org/opencl_episode6

As a last thing, may I ask you a question?

Knowing that a bank entry is 32-bit wide, when many threads writes in LDS short/ushort values, which are 16 bit-wide, using an access pattern like "vector[lid + i*localsize]", do we have bank-conflicts for every pair of thread (for eg. threads 0,1 write to bank 0, 1,2 to bank 1, and so on) or do we have a write "mask" that maps ushort values in consecutive banks for consecutive threads, avoiding bank conflicts?

Thank you for your kindness.

Archives Discussions

Strange behaviour with __local variables...