Does life get better if you change
float4 I_Ret = float4ToUint(...)
uint I_Ret = ...
Ohh... and what does float4ToUnit really do? You posted too small a code snippet to really tell us what is going on here.
-- M. Reilly -- not an AMD/ATI employee.... just a fellow developer...
Ohh, you are right, it's actually uint.
I've found that this crash happens if I try something like that:
uint float4ToUint(float4 v)
uint ret = (uint) ((v.x * 255.f));
I'll do it normally this way:
return ((uint)(rgba.w*255.0f)<<24) | ((uint)(rgba.z*255.0f)<<16) | ((uint)(rgba.y*255.0f)<<8) | (uint)(rgba.x*255.0f);
But even simple cast to uint seems to produce this crash, always at some sse instruction that expects aligned memory.
I've seen the same way of converting some floats to uint in a ratracer and they claimed this works (I have not tested the code). I would like to use images, but images are not supported.
I've tested it on my notebook with nvidia OpenCL implementation and the code works. I'll assume it's a bug in ATI OpenCL (CPU implementation).
The disassembly shows that the pointer is fetched from stack and is used in an sse or sse3 instruction with 16 bit memory alignment. I've to calculate the address aligment of this pointer, but I'am sure it is not aligned. This happens in the temporary dynamic link library that is compiled from the OpenCL implementation.
Little annoying is that the temp folder (on my Windows) is full with compiled dll's. It would be good if they will be deleted if the OpenCL implementation gets closed, so the temp folder is not filled with hundreds of dll's that will not be used anymore.
To be honestly, this bug, no image support in ATI OpenCL and the time that is gone since OpenCL is defined lead me to the decision to switch to CUDA instead. I believe that CUDA is far more stable, because its longer there and NVIDIA has more experiens with GPU computing language. If the problems are fixed in some future impementations of OpenCL I'll look at it again.
Thank you and best regards,
that dlls are deleted when you properly release kernel and program.
Last posting to this topic.
I've found the problem at least. Memory alignment has to be on 16 bit addresses, for all data types.
I've set the alignment on Host side with __delcspec(align(16)) for the datatypes and on Device side __attribute__ ((aligned(16))) allocated memroy on Device side with flag CL_MEM_COPY_HOST_PTR. But the access to the data was outside of the 16 bit alignment, so sse instruction crushes, even all data was correctly copied. To make sure that all data is aligned correctly I'll use #pragma pack(push,16). This works on Cpu. So all was my fault, due to wrong alignment.