Firstly, you will need to explicitly typecast to convert int to int4* using (int4*)
Secondly, the OpenCL spec states (6.1.5) that vector data type have to be properly aligned by their size in bytes. So for int4, you will need to be aligned to a 16byte (4 int) boundary. I assume 'value' will not be properly aligned.
Originally posted by: zhuzxy Not sure if it is proper forum to ask questions. If not, please ignore it.
For the AMD OpenCL code, we know vector operates more efficently. And during my calculation, I have a int array, e.g. int value, I want to do add operations to continuous 4 array elements. like value += 2; value+=3 ...
when I wrote code like the following:
int4 *tmp = &value;
*tmp += (int4)(2,3,4,5);
the final results shows it does not work. Does AMD opencl support such scale ->vector cast and operations?
Try with int4 *tmp = (int4*)&value and see whether it is running or not.
The principle should work if you get the casts right, although to some degree I think it's an implementation specific question. Vectors are single entities as far as the spec is concerned and one of the reasons for that is to generalise ordering.
However, I feel I should point out that it depends on what you want to do whether this makes sense. For reading from memory and writing back, then certainly vectorising is the right approach. For the arithmetic computations you don't have to do vector ops, you just want to do more ops.
So a work item that does a += b where a and b are floats doesn't do much arithmetic work for the compiler to fill ALU slots with. If a and b are vectors it does. However it still does is you have four floats and do:
a1 += b; a2 += b; a3 += b; a4 += b;
As the hardware (like all high end GPUs) is a vector architecture at the high level. It is not vector within a work item but rather VLIW which means it is compiler scheduled instruction level parallelism.
Could try vload()?
Or if you know the data is aligned, and always want to use 4 elements, just use an int4 pointer as the kernel argument.
"For the AMD OpenCL code, we know vector operates more efficently."
It matters more for cpu code, since that is SIMD. The current GPU's are VLIW, so as Lee said, as long as it has enough independent work to do it doesn't really matter whether the arguments are vectors as such.
In some cases i've seen it make things worse, since it forces a particular data flow & register allocation which may prevent the compiler making other optimisations.