cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

frankas
Journeyman III

Integer multiplication and xor very slow ?

Software emulation in place ?

FYI: I tried the OpenCL, looks very promising, but some integer operations (xor and multiplication) were very very slow. Using Brook+ gave much better performance.

I suspect that OpenCl doesn't use the MULT and XOR instructions directly, but rather software implementations.

 

 

0 Likes
4 Replies
genaganna
Journeyman III

Originally posted by: frankas FYI: I tried the OpenCL, looks very promising, but some integer operations (xor and multiplication) were very very slow. Using Brook+ gave much better performance.

 

I suspect that OpenCl doesn't use the MULT and XOR instructions directly, but rather software implementations.

 

could you please paste both brook+ kernel and OpenCL kernel and gives the input and output data size?

 

 

0 Likes

Originally posted by: genaganna

 

could you please paste both brook+ kernel and OpenCL kernel and gives the input and output data size?

 

Trying more accurate timing, I have to retract my original assesment of the situation. The loops I used for timing would get optimized away in some cases. But I have a specific piece of code that runs much slower in OpenCL.

If I could somehow view the compiled code, I should be able to tell the differnce from the StreamKernelAnalyzer assembly.

 

How can I view the compiled OpenCL code ?

0 Likes

Not sure about viewing compiled code, but FYI NVIDIA warns to avoid integer division and modulo operations.  They say nothing about multiplication.

sources: NVIDIA OpenCL Programming Guide & NVIDIA Best Practices Guide.

Not counting the completely blank ATI OpenCL Programming Guide in the SDK & this forum, you are on your own as far knowing what and what not to do in order to write ATI optimized OpenCL systems.

Assuming that the people writing manuals are not the same as the programmers, a little parallel effort to get just a draft of something might be a decent idea.

Just a suggestion, if it is not to difficult, try to run 2 versions of your OpenCL, 1 integer & 1 float.  That would isolate the integer question, and separate it from just an overall slow down compared to Brook++.   

0 Likes

See http://oscarbg.blogspot.com/2009/10/cal-wrapper-for-getting-amd-il-from.htm

A reply for: "http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=120623&enterthread=y"

I have actually done exactly that a wrapper to ATI CAL..

It's working on Windows and Linux as a note I tested your kernel and as Micah said it's much better code than what you get with your implementation..

Also note that it has also device assembly code (so it gets info as would a SKA for OpenCL..)

One limitation of my approach vs. yours is that you can theoretically run that in Mac (using Wine) for getting AMD IL.. My implementation can't as ATI doesn't ship CAL libraries in MacOS an also the AMD support in MacOs seems to do not depend on CAL libraries (I can't search it)..


Originally posted by: frankas
Originally posted by: genaganna

 

 

 

could you please paste both brook+ kernel and OpenCL kernel and gives the input and output data size?

 

 

 

 

Trying more accurate timing, I have to retract my original assesment of the situation. The loops I used for timing would get optimized away in some cases. But I have a specific piece of code that runs much slower in OpenCL.

 

If I could somehow view the compiled code, I should be able to tell the differnce from the StreamKernelAnalyzer assembly.

 

 

 

How can I view the compiled OpenCL code ?

 

0 Likes