Archives Discussions

cjb80 · ‎11-29-2014

Hello,

I have a kernel where each instance of the kernel must access 16 floats from global memory (in series) and perform some computation on them. Adjacent kernels may or may not need to access the same 16 floats; this depends on several parameters that are inputs. It is difficult (computationally) to compute if/when other kernels are sharing the same memory or not.

My current approach is to copy the 16 float values directly from the global memory to the private memory with the knowledge that this will lead to some redundant memory accesses with other kernels in the work group. The 16 floats are aligned to 8 byte boundaries (the input array is float2's), so for example, I would compute that I need to copy item 99 through 107 in an input array of float2s (yielding 16 floats, or 8 float2's).

My first question is this, can I copy the 16 floats using this method:

float16 private_memory_data = (*(global float16*)&input[computed_index]); // where input is a global float2 *

This computes and runs though it is a little hard for me to verify that this is working or not (for various unrelated reasons). The target platform is a FirePro S9150 (Hawaii), but this code is crashing on an NVidia board that doesn't support byte-addressible memory (that I have to use for development due to company IT policy...). Previously I was copying each float individually from global memory and it was running fairly slowly.

Also, is there any sort of automatic caching in local memory if a bunch of kernels are accessing the same global memory location, or is this something that I would need to code?

Thanks,

Chris

maxdz8 · ‎12-03-2014

cjb80 wrote:

... If I were to change to using vloadn then do I need to worry about the alignment? (assuming that I am on a float2 boundary then I should be at a 64 bit boundary).

Thanks!

I recently approached a similar problem. The cool thing about vloadn intrinsics is that they work on element type alignment. So, vload on a float2 can access on a 32 bit alignment.

This is clearly specified in S 6.1.5 of CL1.2,

The behavior of an unaligned load or store is undefined, except for the vloadn, vload_halfn, vstoren, and vstore_halfn functions defined in section 6.12.7. The vector

load functions can read a vector from an address aligned to the element type of the vector. The vector store functions can write a vector to an address aligned to the element type of the vector.

It is also noted on intrinsic documentation:

The address computed as (p + (offset * n)) must be 8-bit aligned if gentype is char, uchar; 16-bit aligned if gentype is short, ushort; 32-bit aligned if gentype is int, uint, float; 64-bit aligned if gentype is long, ulong.

Note argument dimensionality is used to compute the offset but is not mentioned in the alignment requirements.

But, most importantly, I can tell it works because I have used it those days for a "sliding window" problem at byte level, for which I used uchar4.

It was my understanding that vector elements however are not guaranteed to be in any specific order, which is possibly a reason for which operator[] isn't supported on them. I wouldn't cast them and expect the behavior to be portable, albeit I've seen code doing this and I know it works.

View solution in original post

dipak · ‎12-01-2014

Hi Chris,

Accessing data as vector is generally faster than accessing the elements individually. Pointer casting can be used in this case. However the main problem in this approach is proper data alignment. Normally the compiler ignores the alignment checking during the pointer casting. Thus, it is developer's responsibility to ensure the alignment and endian-ness. For example, a float4 variable should be aligned to a 16-byte boundary.

Another effective way of doing this is by using vector loading and storing built-in function e.g. vloadn and vstoren. The vloadn reads a vector from an address aligned to the element type of the vector. Whereas vstoren perform the storing.

Another point is, depending on access pattern where multiple work-items within a same work-group accessing the same data from global memory, it is better to place the data in local memory. Though global memory accessing are automatically cached (if supported by the device), but the performance depends on availability of data on cache and thus pattern of the data access. If access pattern is predictable, one can use the local memory which can be thought as programmer manageable cache. One overhead of this approach is one needs to explicitly copy the data to local memory from global memory. However, in many scenario where there is large overlap of data access, this approach gives better overall performance. Its better to do some experiment and check which is more suitable for your own problem.

Regards,

cjb80 · ‎12-01-2014

OK, so my input array is a series of float2's and I want to read them out as a float16. However, the start of the float16 is arbitrarily located in the array of float2s. I think what you're saying is that the start of the float16 needs to be on an even index in the array of float2s. The problem that I have then is that I will need to do another access to global memory to get the exact data that I want (in the case of an odd index). Performance is pretty critical in this application so I would like to avoid that if possible.. If I were to change to using vloadn then do I need to worry about the alignment? (assuming that I am on a float2 boundary then I should be at a 64 bit boundary).

Also, is it valid to take a float16 and cast it to a float2 pointer and treat it like an array of 8 elements?

Thanks!

dipak · ‎12-03-2014

Let me share a problem I faced once in past. For clearness, I've only shown the relevant portion.

Following kernel code didn't give me consistent result.

__kernel void HelloWorld(__global float4 *in, global float2 *out)

{

int index = get_global_id(0);

global float *tmp_in = (global float*)in;

float16 val = *((global float16*)(&tmp_in[index]));

//printf("[%d] = (%f, %f)", index, val.s0, val.sf);

...

}

However, when I changed the above bold line as below, the result was correct.

float16 val = vload16(0, tmp_in + index);

BTW, for some different scenario, following line of code worked fine.

float16 val = *((global float16*) (&in[index])); // directly use the "in" buffer

Hope, this example may help you.