Archives Discussions

jins · ‎11-25-2009

I am trying to improve the performance with OpenCL scheme for x86 multicore.

And I re-wrote kernel program using vector integer instead of scalar integer.

However it improves the performance very little (a few percent).

For non-OpenCL case, using SSE doubles the performance.

Does the current kernel compiler really generate SSE instructions for vector integer operations ?

I doubt that a vector operation is emulated with scalar operations at least for integer.

How do I know SSE is generated or not ?

nou · ‎11-25-2009

when you use CPU device then compiler create dll which you can found in tmp directory. clGetProgramInfo() with CL_PROGRAM_BINARY return for now a path to that dll. so you can copy that dll and disassemble it and look yourself.

jins · ‎11-25-2009

nou,

Thank you for the information and very quick response.

This helps me a lot. I will check lator.

AndreasStahl · ‎11-25-2009

Hi Jins,

could you please post your findings when you have them? I had a similar experience with int4s, but I figured it was because of the many moves and stores I had to use to fill the registers.

In fact it was often ~20% slower to use the vector data type..

n0thing · ‎11-30-2009

I have seen significant performance increase on CPU by using vectorization in atleast 2 cases. The speedup was nearly 3x-4x in both of the cases. But in one case I only saw a small improvement of about 15%.

jins · ‎11-30-2009

Hi n0thing,

Does the speedup case use integer or float ?

n0thing · ‎12-01-2009

The speedup cases used integer.

jins · ‎12-03-2009

Hi n0thing,

Thank you.

And I had verified that SSE instructions may be generated for vector integer operation.

I have dumpbin'ed generated DLL and found SSE instruction (ex. pmullw) in that. And if I use no vector, then no SSE instruction is found.

jins · ‎12-07-2009

Hi

I have found one of the reason why vector integer operation is slow.

convert_uchar4_sat(short4) is incredibly slower than convert_uchar4(short4).

Why ? kernel compiler generates poor SSE code ?

convert_uchar4_sat() is very important function for multimedia area.

golgo_13 · ‎12-17-2009

SSE XMM registers are 16 bytes wide, so it is hard to see why convert_uchar4_sat(short4) is so important.

jins · ‎12-21-2009

golgo_13,

Thnak you for the comment.

convert_uchar8_sat(short8) is also important.

Archives Discussions

Using vector integer does not improve the performance for x86 multicore