Should I write this using vstore in a GPU?
__kernel main ( __global float4 *outb )
{
const int x = get_global_id(0);
const int y = get_global_id(1);
const width = get_global_size(0);
const float4 data = ...
outb[y*width+x] = data;
}
or
__kernel main ( __global float4 *outb )
{
const int x = get_global_id(0);
const int y = get_global_id(1);
const width = get_global_size(0);
const float4 data = ...;
vstore4 ( data, 16*(y*width+x), (__global float*)outb );
}
???
(in that way would run efficiently in GPU and CPU ).
And what's the #define to know the ATI's OpenCL implementation is using the CPU? #ifdef ATI_OPENCL_CPU ?
thx
__kernel main ( __global float4 *outb )
{
const int x = get_global_id(0); const int y = get_global_id(1);
const width = get_global_size(0);
const float4 data = ...
outb[y*width+x] = data;
}
or
__kernel main ( __global float4 *outb )
{
const int x = get_global_id(0); const int y = get_global_id(1);
const width = get_global_size(0);
const float4 data = ...;
vstore4 ( data, 16*(y*width+x), (__global float*)outb );
}
The second way is better.
And what's the #define to know the ATI's OpenCL implementation is using the CPU? #ifdef ATI_OPENCL_CPU ?
There is no such predefined macro at present.
I don't think that's right. I'd say the first way is better, but it depends. The first way will do a vector write, so you will use fewer write instructions. It will only work if the pointer is 16-byte aligned, though. The second way will write a vector using float writes, so will work happily unaligned, but will always use multiple writes. Given that you passed a float4 output pointer I'm going to assume it is aligned and say that I'd go with the first version.
Also, is an ATI_CPU macro really what you want? Other than SIMD width, what purpose do you have for knowing it's on the CPU? Because a CPU isn't a very general term (and of course neither is ATI/AMD) I'm not sure it's likely to be a good macro to have. SIMD width could be, but in either case I think you're better off querying the runtime and specifying compiler macros of your own as necessary.
Yes, they allow you to store vector types to a pointer in memory. The trick there is that could be worded as "allow you to store vector types to a pointer in memory, even if the pointer is not aligned to a multiple of the vector type". Note that the spec only specifies an alignment of gentype where the instruction is vstoren(gentype *...). Given that, unless the compiler can guarantee that the value is aligned it has to write it as multiple unaligned writes. A float4 pointer, on the other hand, is assumed to be aligned, as I understand it.
I'll quote Micah on this one:
<!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:0 0 0 0 0 0 0 0 0 0; mso-font-charset:1; mso-generic-font-family:roman; mso-font-formatther; mso-font-pitch:variable; mso-font-signature:0 0 0 0 0 0;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4; mso-font-charset:0; mso-generic-font-family:swiss; mso-font-pitch:variable; mso-font-signature:-520092929 1073786111 9 0 415 0;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:""; margin:0in; margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:Calibri; mso-fareast-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} .MsoChpDefault {mso-style-type:export-only; mso-default-props:yes; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:Calibri; mso-fareast-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} @page Section1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in; mso-header-margin:.5in; mso-footer-margin:.5in; mso-paper-source:0;} div.Section1 {page:Section1;} --> Vload4 and vstore4 allow you to load unaligned data into a vector register. They explicitly do not do a single instruction, whereas loading from a float4 pointer would have to be aligned and do a 128-bit load. They're for the case where you can't do the float4 load rather than a way of forcing it to work.