I am trying to improve the performance with OpenCL scheme for x86 multicore.
And I re-wrote kernel program using vector integer instead of scalar integer.
However it improves the performance very little (a few percent).
For non-OpenCL case, using SSE doubles the performance.
Does the current kernel compiler really generate SSE instructions for vector integer operations ?
I doubt that a vector operation is emulated with scalar operations at least for integer.
How do I know SSE is generated or not ?
when you use CPU device then compiler create dll which you can found in tmp directory. clGetProgramInfo() with CL_PROGRAM_BINARY return for now a path to that dll. so you can copy that dll and disassemble it and look yourself.
nou,
Thank you for the information and very quick response.
This helps me a lot. I will check lator.
Hi Jins,
could you please post your findings when you have them? I had a similar experience with int4s, but I figured it was because of the many moves and stores I had to use to fill the registers.
In fact it was often ~20% slower to use the vector data type..
I have seen significant performance increase on CPU by using vectorization in atleast 2 cases. The speedup was nearly 3x-4x in both of the cases. But in one case I only saw a small improvement of about 15%.
Hi n0thing,
Does the speedup case use integer or float ?
The speedup cases used integer.
Hi n0thing,
Thank you.
And I had verified that SSE instructions may be generated for vector integer operation.
I have dumpbin'ed generated DLL and found SSE instruction (ex. pmullw) in that. And if I use no vector, then no SSE instruction is found.
Hi
I have found one of the reason why vector integer operation is slow.
convert_uchar4_sat(short4) is incredibly slower than convert_uchar4(short4).
Why ? kernel compiler generates poor SSE code ?
convert_uchar4_sat() is very important function for multimedia area.
SSE XMM registers are 16 bytes wide, so it is hard to see why convert_uchar4_sat(short4) is so important.
golgo_13,
Thnak you for the comment.
convert_uchar8_sat(short8) is also important.