4 Replies Latest reply on Oct 23, 2009 3:05 PM by oscarbarenys1

    Integer multiplication and xor very slow ?

    frankas
      Software emulation in place ?

      FYI: I tried the OpenCL, looks very promising, but some integer operations (xor and multiplication) were very very slow. Using Brook+ gave much better performance.

      I suspect that OpenCl doesn't use the MULT and XOR instructions directly, but rather software implementations.

       

       

        • Integer multiplication and xor very slow ?
          genaganna

           

          Originally posted by: frankas FYI: I tried the OpenCL, looks very promising, but some integer operations (xor and multiplication) were very very slow. Using Brook+ gave much better performance.

           

          I suspect that OpenCl doesn't use the MULT and XOR instructions directly, but rather software implementations.

           

          could you please paste both brook+ kernel and OpenCL kernel and gives the input and output data size?

           

           

            • Integer multiplication and xor very slow ?
              frankas

               

              Originally posted by: genaganna

               

              could you please paste both brook+ kernel and OpenCL kernel and gives the input and output data size?

               

              Trying more accurate timing, I have to retract my original assesment of the situation. The loops I used for timing would get optimized away in some cases. But I have a specific piece of code that runs much slower in OpenCL.

              If I could somehow view the compiled code, I should be able to tell the differnce from the StreamKernelAnalyzer assembly.

               

              How can I view the compiled OpenCL code ?

                • Integer multiplication and xor very slow ?
                  jcpalmer

                  Not sure about viewing compiled code, but FYI NVIDIA warns to avoid integer division and modulo operations.  They say nothing about multiplication.

                  sources: NVIDIA OpenCL Programming Guide & NVIDIA Best Practices Guide.

                  Not counting the completely blank ATI OpenCL Programming Guide in the SDK & this forum, you are on your own as far knowing what and what not to do in order to write ATI optimized OpenCL systems.

                  Assuming that the people writing manuals are not the same as the programmers, a little parallel effort to get just a draft of something might be a decent idea.

                  Just a suggestion, if it is not to difficult, try to run 2 versions of your OpenCL, 1 integer & 1 float.  That would isolate the integer question, and separate it from just an overall slow down compared to Brook++.   

                  • Integer multiplication and xor very slow ?
                    oscarbarenys1

                    See http://oscarbg.blogspot.com/2009/10/cal-wrapper-for-getting-amd-il-from.htm

                    A reply for: "http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=120623&enterthread=y"

                    I have actually done exactly that a wrapper to ATI CAL..

                    It's working on Windows and Linux as a note I tested your kernel and as Micah said it's much better code than what you get with your implementation..

                    Also note that it has also device assembly code (so it gets info as would a SKA for OpenCL..)

                    One limitation of my approach vs. yours is that you can theoretically run that in Mac (using Wine) for getting AMD IL.. My implementation can't as ATI doesn't ship CAL libraries in MacOS an also the AMD support in MacOs seems to do not depend on CAL libraries (I can't search it)..


                    http://oscarbg.blogspot.com/2009/10/cal-wrapper-for-getting-amd-il-from.htm

                     

                    Originally posted by: frankas
                    Originally posted by: genaganna

                     

                     

                     

                    could you please paste both brook+ kernel and OpenCL kernel and gives the input and output data size?

                     

                     

                     

                     

                    Trying more accurate timing, I have to retract my original assesment of the situation. The loops I used for timing would get optimized away in some cases. But I have a specific piece of code that runs much slower in OpenCL.

                     

                    If I could somehow view the compiled code, I should be able to tell the differnce from the StreamKernelAnalyzer assembly.

                     

                     

                     

                    How can I view the compiled OpenCL code ?