Hello all. I'm new with OpenCL so it's probable that I'm doing something wrong...anyway, I'm encountering a strange problem when using __local variables pointers, because apparently I'm not able to write them. Consider this simple kernel :
__kernel void
intersect(__global uint *outputCounts, __local uint4 *tmpBuffer) {
const unsigned int gid = get_global_id(0); tmpBuffer[0].x = 1; tmpBuffer[0].y = 1; tmpBuffer[0].z = 1; tmpBuffer[0].w = 1;
outputCounts[gid] = tmpBuffer[0].x; } |
As expected every value returned by a thread returns 1 in the output array at the right position. But if a replace the last line with:
outputCounts[gid] = tmpBuffer[0].y; (but it could be "z" or "w") |
then I get wrong results (1 billion, 2 billions, etc.: nonsense results!). It seems that the components y,z and w aren't initialized (or problems with memory alignments??) ...
Another interesting thing is that if I use the command:
vstore4((uint4) (1),0,tmpBuffer);
then the first uint4 contained in tmpBuffer is initialized correctly. So, it seems that only with vstore4 I can modify __local variables. Why the initialization through vector components used above is wrong? Where's my error? Thank you for any answer.
I'm working under Ubuntu 9.04 32bit and my videocard is a 5770. Below here is posted the host code that executes kernels.
EDIT: After some tests I've found that if you want to modify a __local vector correctly (at least this is the case of "uint4", dunno if this involves other types)you must use the function vstore4 (maybe it has to do with portability, as explained in appendix B of the specs?). Reads are "ok", in the sense that you can access local variables with indices and .xyzws0s1..." without any strange behaviour.
Please provide some explanation...
void GPUInterface::executeGPUIntersect(unsigned int *VD, unsigned int *prefixCkGPU, unsigned int *support, int numItems, int dimRow, int prefixCkSize, int k) { // Codice di errore tornato dalle varie funzioni; cl_int err; // Creazione della coda dei comandi associata al dispositivo GPU-OpenCL; cl::vector<cl::Device> listaDevices = this->contesto->getInfo<CL_CONTEXT_DEVICES>(); cl::CommandQueue queue((*(this->contesto)), listaDevices[0]); // Creazione dell'output buffer relativo ai conteggi dei vari candidati; cl::Buffer outputSupport = cl::Buffer(*(this->contesto), CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, sizeof(unsigned int)*prefixCkSize, support, &err); // Set kernel args... this->kernel->setArg(0, outputSupport); this->kernel->setArg(1, (size_t) prefixCkSize*4*sizeof(unsigned int), NULL); // Per le variabili locali si DEVE saltare la creazione del buffer, indicandone solo la dimensione da allocare sulla GPU! cl::Event e; int wait = 1; err = queue.enqueueNDRangeKernel(*(this->kernel), cl::NullRange, cl::NDRange(prefixCkSize), cl::NullRange, 0, &e); if(wait == 1) e.wait(); // Recupero dal dispositivo i risultati ottenuti... err = queue.enqueueReadBuffer(outputSupport,CL_TRUE,0,sizeof(unsigned int)*prefixCkSize,support,NULL,NULL); // Fine esecuzione dei kernel; le risorse occupate dalla coda dei comandi vengono liberate; err = queue.finish(); printf("\n%d - %u", prefixCkSize, support[0]); }