Archives Discussions

timchist · ‎06-24-2010

Simple memset kernel on 4850 does not work for unsigned char type

I have several kernels, all of which fill a memory region with 0. Even though memset0_128 is the most effective one, memory block size can be different, so sometimes it may be required to use memset0_8.

However, when I check the memory after calling memset0_8, I find out, that some bytes are not equal to zero. For example, when memset0_8 is called on a 36864-byte (192x192) region with (256, 1, 1) local size, output byte #7445 always has non-zero value (the actual value is different from run to run however). Moreover, if I run the kernel on larger memory blocks, byte #7445 is also the first non-zero output byte.

memset0_16 is more stable, however, it also fails sometimes. For example, on a 307200 (512x600)-byte array it returns incorrect value in word #1522.

It seems that 32-bit write transactions are used internally when smaller blocks are written, which makes the operation not thread-safe.

I use iMac 27' with ATI 4850 Mobility card.

__kernel void memset0_128(__global uint4* dst) { dst[get_global_id(0)] = to_uint4(0); } __kernel void memset0_32(__global unsigned int* dst) { dst[get_global_id(0)] = 0; } __kernel void memset0_16(__global unsigned short* dst) { dst[get_global_id(0)] = 0; } __kernel void memset0_8(__global unsigned char* dst) { dst[get_global_id(0)] = 0; }

omkaranathan · ‎06-24-2010

Timchist,

Do you have test cases for above issues. Compilable test cases makes it easy for us to reproduce the problem and track down the issue.

timchist · ‎06-24-2010

Nevermind, just found out that cl_khr_byte_addressable_store is not supported on RV770 cards, so memset0_16 and memset0_8 are not supposed to work on 4850.

Thanks for your reply anyway.

Illusio · ‎06-24-2010

I had a somewhat related issue a while back. (Not identical though)

The short answer was that "any writes to a char pointer are currently illegal" and that "new versions of the compiler will complain about this unless the byte_addressable_store extension is enabled.(On a side note, your quoted code actually doesn't compile with the latest SDK because it's missing the byte_addressable_store pragma. Did you just forget to copy that, or are you using an old SDK that doesn't report this error?)

Also, are you really handing OpenCL the dimensions you're talking about in the opening post? According to the OpenCL spec, clEnqueueNDRangeKernel requires global work sizes to be evenly divisible with local work sizes. Which means that a global worksize of 192x192 is illegal to combine with a local work size of (256,1,1). Although this might not be the source of this problem, I think it's a safe bet that it won't hurt to pass NULL instead of an invalid local work size.

Edit: And I suppose, that will teach me to not leave a browser open for ages before answering =P

timchist · ‎06-24-2010

(On a side note, your quoted code actually doesn't compile with the latest SDK because it's missing the byte_addressable_store pragma. Did you just forget to copy that, or are you using an old SDK that doesn't report this error?)

Just forgot to copy this line. I use the SDK 2.1, it does report the error if the pragma line is missing. But what surprises me, is that neither clCreateProgramWithSource, nor clBuildProgram have returned an error when building on 4850, where cl_khr_byte_addressable_store is not supported.

Also, are you really handing OpenCL the dimensions you're talking about in the opening post? According to the OpenCL spec, clEnqueueNDRangeKernel requires global work sizes to be evenly divisible with local work sizes. Which means that a global worksize of 192x192 is illegal to combine with a local work size of (256,1,1).

I used 1D-global worksizes (such as (307200, 1, 1)), which are divisible with (256, 1, 1). I just wanted to explain the origin of magic constants in my example, so I represented them as products.

Although this might not be the source of this problem, I think it's a safe bet that it won't hurt to pass NULL instead of an invalid local work size.

On NVIDIA 8200 mGPU passing NULL as local worksize param value caused choosing (1, 1, 1) as local worksize, which was very inefficient.

Archives Discussions

Memset kernel does not work as expected