cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

jins
Journeyman III

Using vector integer does not improve the performance for x86 multicore

I am trying to improve the performance with OpenCL scheme for x86 multicore.

And I re-wrote kernel program using vector integer instead of scalar integer.

However it improves the performance very little (a few percent).

For non-OpenCL case, using SSE doubles the performance.

Does the current kernel compiler really generate SSE instructions for vector integer operations ?

I doubt that a vector operation is emulated with scalar operations at least for integer.

How do I know SSE is generated or not ?

0 Likes
10 Replies
nou
Exemplar

when you use CPU device then compiler create dll which you can found in tmp directory. clGetProgramInfo() with CL_PROGRAM_BINARY return for now a path to that dll. so you can copy that dll and disassemble it and look yourself.

0 Likes
jins
Journeyman III

nou,

Thank you for the information and very quick response.

This helps me a lot. I will check lator.

0 Likes

Hi Jins,

could you please post your findings when you have them? I had a similar experience with int4s, but I figured it was because of the many moves and stores I had to use to fill the registers.

In fact it was often ~20% slower to use the vector data type..

 

0 Likes

I have seen significant performance increase on CPU by using vectorization in atleast 2 cases. The speedup was nearly 3x-4x in both of the cases. But in one case I only saw a small improvement of about 15%.

 

0 Likes

Hi n0thing,

Does the speedup case use integer or float ?

 

0 Likes

The speedup cases used integer.

0 Likes

Hi n0thing,

Thank you.

And I had verified that SSE instructions may be generated for vector integer operation.

I have dumpbin'ed generated DLL and found SSE instruction (ex. pmullw) in that. And if I use no vector, then no SSE instruction is found.

0 Likes
jins
Journeyman III

Hi

I have found one of the reason why vector integer operation is slow.

convert_uchar4_sat(short4) is incredibly slower than convert_uchar4(short4).

Why ? kernel compiler generates poor SSE code ?

convert_uchar4_sat() is very important function for multimedia area.

 

0 Likes

SSE XMM registers are 16 bytes wide, so it is hard to see why convert_uchar4_sat(short4) is so important.

0 Likes

golgo_13,

Thnak you for the comment.

convert_uchar8_sat(short8) is also important.

0 Likes