Archives Discussions

bubu · ‎06-01-2010

Should I write this using vstore in a GPU?

__kernel main ( __global float4 *outb )

{

const int x = get_global_id(0);
const int y = get_global_id(1);

const width = get_global_size(0);

const float4 data = ...

outb[y*width+x] = data;

}

or

__kernel main ( __global float4 *outb )

{

const int x = get_global_id(0);
const int y = get_global_id(1);

const width = get_global_size(0);

const float4 data = ...;

vstore4 ( data, 16*(y*width+x), (__global float*)outb );

}

???

(in that way would run efficiently in GPU and CPU ).

And what's the #define to know the ATI's OpenCL implementation is using the CPU? #ifdef ATI_OPENCL_CPU ?

thx

omkaranathan · ‎06-03-2010

__kernel main ( __global float4 *outb )

{

       const int x = get_global_id(0);    const int y = get_global_id(1);

       const width = get_global_size(0);

       const float4 data = ...

outb[y*width+x] = data;

}

or

__kernel main ( __global float4 *outb )

{

       const int x = get_global_id(0);    const int y = get_global_id(1);

       const width = get_global_size(0);

       const float4 data = ...;

       vstore4 ( data, 16*(y*width+x), (__global float*)outb );

}

The second way is better.

And what's the #define to know the ATI's OpenCL implementation is using the CPU? #ifdef ATI_OPENCL_CPU ?

There is no such predefined macro at present.

LeeHowes · ‎06-03-2010

I don't think that's right. I'd say the first way is better, but it depends. The first way will do a vector write, so you will use fewer write instructions. It will only work if the pointer is 16-byte aligned, though. The second way will write a vector using float writes, so will work happily unaligned, but will always use multiple writes. Given that you passed a float4 output pointer I'm going to assume it is aligned and say that I'd go with the first version.

Also, is an ATI_CPU macro really what you want? Other than SIMD width, what purpose do you have for knowing it's on the CPU? Because a CPU isn't a very general term (and of course neither is ATI/AMD) I'm not sure it's likely to be a good macro to have. SIMD width could be, but in either case I think you're better off querying the runtime and specifying compiler macros of your own as necessary.

Raistmer · ‎06-06-2010

vstore4 will use many separate writes? It listed as vector store ...

LeeHowes · ‎06-06-2010

Yes, they allow you to store vector types to a pointer in memory. The trick there is that could be worded as "allow you to store vector types to a pointer in memory, even if the pointer is not aligned to a multiple of the vector type". Note that the spec only specifies an alignment of gentype where the instruction is vstoren(gentype *...). Given that, unless the compiler can guarantee that the value is aligned it has to write it as multiple unaligned writes. A float4 pointer, on the other hand, is assumed to be aligned, as I understand it.

I'll quote Micah on this one:

Vload4 and vstore4 allow you to load unaligned data into a vector register. They explicitly do not do a single instruction, whereas loading from a float4 pointer would have to be aligned and do a 128-bit load. They're for the case where you can't do the float4 load rather than a way of forcing it to work.

Raistmer · ‎06-07-2010

Oops, thery interesting info, thanks.
So float4* pointer should be used instead of float* and vload4/vstore4 if one have(or can have) properly aligned data and wants to speed up memory accesses. It was not obvious from documentation.

MicahVillmow · ‎06-07-2010

Raistmer,
The only time you want to use vload/vstore is if the alignment of your data is unknown. If you know that the pointers in question are aligned properly to the size of the vector you want to load, cast the pointer to that vector and load via array syntax.

Raistmer · ‎06-07-2010

I did appropriate change in all my kernels, will report if it will lead to some different in speed.

Raistmer · ‎06-10-2010

Change from float* and vload4/vstore4 to float4* and array indexing syntax improved performance a lot.
Test task time reduced from 190 sec to 158sec.

Archives Discussions

vstore doubt