1 of 1 people found this helpful
Accessing data as vector is generally faster than accessing the elements individually. Pointer casting can be used in this case. However the main problem in this approach is proper data alignment. Normally the compiler ignores the alignment checking during the pointer casting. Thus, it is developer's responsibility to ensure the alignment and endian-ness. For example, a float4 variable should be aligned to a 16-byte boundary.
Another effective way of doing this is by using vector loading and storing built-in function e.g. vloadn and vstoren. The vloadn reads a vector from an address aligned to the element type of the vector. Whereas vstoren perform the storing.
Another point is, depending on access pattern where multiple work-items within a same work-group accessing the same data from global memory, it is better to place the data in local memory. Though global memory accessing are automatically cached (if supported by the device), but the performance depends on availability of data on cache and thus pattern of the data access. If access pattern is predictable, one can use the local memory which can be thought as programmer manageable cache. One overhead of this approach is one needs to explicitly copy the data to local memory from global memory. However, in many scenario where there is large overlap of data access, this approach gives better overall performance. Its better to do some experiment and check which is more suitable for your own problem.
OK, so my input array is a series of float2's and I want to read them out as a float16. However, the start of the float16 is arbitrarily located in the array of float2s. I think what you're saying is that the start of the float16 needs to be on an even index in the array of float2s. The problem that I have then is that I will need to do another access to global memory to get the exact data that I want (in the case of an odd index). Performance is pretty critical in this application so I would like to avoid that if possible.. If I were to change to using vloadn then do I need to worry about the alignment? (assuming that I am on a float2 boundary then I should be at a 64 bit boundary).
Also, is it valid to take a float16 and cast it to a float2 pointer and treat it like an array of 8 elements?
Let me share a problem I faced once in past. For clearness, I've only shown the relevant portion.
Following kernel code didn't give me consistent result.
__kernel void HelloWorld(__global float4 *in, global float2 *out)
int index = get_global_id(0);
global float *tmp_in = (global float*)in;
float16 val = *((global float16*)(&tmp_in[index]));
//printf("[%d] = (%f, %f)", index, val.s0, val.sf);
However, when I changed the above bold line as below, the result was correct.
float16 val = vload16(0, tmp_in + index);
BTW, for some different scenario, following line of code worked fine.
float16 val = *((global float16*) (&in[index])); // directly use the "in" buffer
Hope, this example may help you.
... If I were to change to using vloadn then do I need to worry about the alignment? (assuming that I am on a float2 boundary then I should be at a 64 bit boundary).
I recently approached a similar problem. The cool thing about vloadn intrinsics is that they work on element type alignment. So, vload on a float2 can access on a 32 bit alignment.
This is clearly specified in S 6.1.5 of CL1.2,
The behavior of an unaligned load or store is undefined, except for the vloadn, vload_halfn, vstoren, and vstore_halfn functions defined in section 6.12.7. The vector
load functions can read a vector from an address aligned to the element type of the vector. The vector store functions can write a vector to an address aligned to the element type of the vector.
It is also noted on intrinsic documentation:
The address computed as (p + (offset * n)) must be 8-bit aligned if gentype is char, uchar; 16-bit aligned if gentype is short, ushort; 32-bit aligned if gentype is int, uint, float; 64-bit aligned if gentype is long, ulong.
Note argument dimensionality is used to compute the offset but is not mentioned in the alignment requirements.
But, most importantly, I can tell it works because I have used it those days for a "sliding window" problem at byte level, for which I used uchar4.
It was my understanding that vector elements however are not guaranteed to be in any specific order, which is possibly a reason for which operator isn't supported on them. I wouldn't cast them and expect the behavior to be portable, albeit I've seen code doing this and I know it works.