3 Replies Latest reply on Sep 6, 2017 3:02 AM by dipak

    Best practices in OpenCL GPU performance programming

    jpsollie

      Hello everyone,

       

      I have a OpenCL 1.2 kernel which is performing far from optimal on a R9 Fury (a 32 core bulldozer CPU works twice as fast ...)

      I think the problem may be in the amount of bit-shift and and, or and xor on char arrays

      hereby the instruction count of the kernel from oclgrind from a reduced problem:

      Instructions executed for kernel:

              12566568 - add

              10649560 - xor

               8650762 - or

               7774218 - lshr

               7421960 - shl

               5783500 - load private (22183918 bytes)

               4767752 - store private (15466538 bytes)

               4227257 - br

               3858597 - getelementptr

               2277394 - and

               2179249 - icmp

               2129972 - zext

               2072606 - phi

                770050 - load local (3080200 bytes)

                679940 - bitcast

                524288 - call llvm.bswap.i32()

                475145 - trunc

                344066 - store local (1376264 bytes)

                303240 - ret

                163846 - sub

                147464 - sext

                147460 - ptrtoint

                 81922 - call memcpy()

                 65572 - load constant (114724 bytes)

                 57482 - alloca

                 49152 - call 1Update()

                 32768 - call 1Transform()

                 16386 - call memcpyc()

                 16386 - call 1Updatec()

                 16384 - call memcpyzero()

                 16384 - call 1Updatezero()

                 16384 - call 1Final()

                 16384 - call 1Init()

                 16384 - call 1_vector()

                 16384 - call memcpywithIDs()

                 16382 - select

                  8192 - call hmac_sha1_vector()

                   128 - load global (512 bytes)

                    15 - call get_global_id()

                     4 - store global (16 bytes)

       

       

      currently, I can't use the amd APP SDK to further analyze the program, as the running pc has no monitor, and the monitoring pc is an intel/nvidia laptop

      I ordered a mobile firepro card on ebay to be able to execute openCL on that laptop, but in meantime, maybe somebody here already has some interesting thoughts?

      All ideas are appreciated!

       

      What I already thought of:

      instead of putting things like this in my code:

      (ctx->d[128] is a local integer array)

             digest[12] = (unsigned char)(ctx->d[localid]  >> 24);

              digest[13] = (unsigned char)(ctx->d[localid]  >> 16) & 255;

              digest[14] = (unsigned char)(ctx->d[localid]  >> 8) & 255;

              digest[15] = (unsigned char)(ctx->d[localid]) & 255;

      do something like

           typedef union {

                uchar[4] l1;

                uint d;

           } workaround;

      and then use the workaround.l1 instead.

      would this be a good idea? or is this just nonsense?

       

      thanks!