Archives Discussions

emuller · ‎10-30-2009

Running the attached kernel with

local_size = 64

global_size = 64*20 to use all 20 wavefronts on 4870

where inside the kernel I write to lines of an output buffer 64x20 wide in a loop.

The loop repeats "iter" times.

Up to iter==1127, everything looks fine on the output. For iter>1128, it breaks. Results are complete garbage, perhaps due to memory corruption. On my gtx260 using the nvidia stack, this number can be a factor of 80 higher with no problems ... and I'm no where near running out of memory. Is this a user error, bug, or hardware limitation?

__kernel void GenerateOut(uint iter, __global uint4 *seed, __local uint4 *shmem, __global uint4 *rngs) { uint4 rngRegs[REG_COUNT]; LoadState(seed, rngRegs, shmem); for(uint i=0;i<iter;i++){ rngs[get_global_id(0) + i*get_global_size(0)] = Generate(rngRegs, shmem); } SaveState(shmem, seed); }

jcpalmer · ‎10-30-2009

64 * 20 = 1280. Have you queried the max CL_KERNEL_WORK_GROUP_SIZE to ensure it is lower? I realize that clEnqueueNDRangeKernel should have returned CL_INVALID_WORK_GROUP_SIZE if this were the case, but this is beta software. Not a great lead, but something to check off as the problem.

Weird though is 1442560. Isn't that the size of the 3.5" floppy? Could be significant.

MicahVillmow · ‎10-30-2009

jcpalmer,
CL_KERNEL_WORK_GROUP_SIZE is for local work group size, not global. emuller, does this occur with a simplified kernel? i.e. can you simplify it as much as possible where there are no unneeded calculations to reproduce the error and a single line change cause the error to occur?

Thanks,

emuller · ‎11-02-2009

It appears the problem was something of a user error. When I change my constant block initialization ....

/*

__constant uint2 Q[32]=
{29, 5, 24, 14, 5, 28, 23, 24, 14, 19, 26, 13, 11, 0, 31, 17, 9,
       11, 3, 20, 1, 7, 28, 10, 0, 6, 2, 15, 22, 2, 20, 9, 18, 8,
       15, 23, 27, 4, 13, 30, 10, 12, 16, 25, 8, 3, 17, 21, 25, 26, 12,
       27, 19, 31, 30, 18, 7, 22, 6, 16, 4, 29, 21, 1};

*/

to the following:

__constant uint Q[2][32]={
{29,24,5,23,14,26,11,31,9,3,1,28,0,2,22,20,18,15,27,13,10,16,8,17,25,12,19,30,7,6,4,21},
{5,14,28,24,19,13,0,17,11,20,7,10,6,15,2,9,8,23,4,30,12,25,3,21,26,27,31,18,22,16,29,1}
};

Everything is fine (when associated code is setup to use uint). Sorry for the false alarm. NVIDIA's stack simply wouldn't compile the first one, which was the first hint.

If the first approach is incorrect, how should one correctly initialize an array of uint2's or uint4's then? I could not find anything in the OpenCL spec to this regard.

MicahVillmow · ‎11-02-2009

The correct approach should be as follows:
__constant uint2 Q[32]=
{(uint2)(29, 5), (uint2)(24, 14),..., (uint2)(21, 1)};

This is how vector constructors work in OpenCL:
(typeN)(val0,...,valN-1);

Archives Discussions

memory corruption addressing global past 1442560?