AnsweredAssumed Answered

Best practices in OpenCL GPU performance programming

Question asked by jpsollie on Sep 4, 2017
Latest reply on Sep 6, 2017 by dipak

Hello everyone,

 

I have a OpenCL 1.2 kernel which is performing far from optimal on a R9 Fury (a 32 core bulldozer CPU works twice as fast ...)

I think the problem may be in the amount of bit-shift and and, or and xor on char arrays

hereby the instruction count of the kernel from oclgrind from a reduced problem:

Instructions executed for kernel:

        12566568 - add

        10649560 - xor

         8650762 - or

         7774218 - lshr

         7421960 - shl

         5783500 - load private (22183918 bytes)

         4767752 - store private (15466538 bytes)

         4227257 - br

         3858597 - getelementptr

         2277394 - and

         2179249 - icmp

         2129972 - zext

         2072606 - phi

          770050 - load local (3080200 bytes)

          679940 - bitcast

          524288 - call llvm.bswap.i32()

          475145 - trunc

          344066 - store local (1376264 bytes)

          303240 - ret

          163846 - sub

          147464 - sext

          147460 - ptrtoint

           81922 - call memcpy()

           65572 - load constant (114724 bytes)

           57482 - alloca

           49152 - call 1Update()

           32768 - call 1Transform()

           16386 - call memcpyc()

           16386 - call 1Updatec()

           16384 - call memcpyzero()

           16384 - call 1Updatezero()

           16384 - call 1Final()

           16384 - call 1Init()

           16384 - call 1_vector()

           16384 - call memcpywithIDs()

           16382 - select

            8192 - call hmac_sha1_vector()

             128 - load global (512 bytes)

              15 - call get_global_id()

               4 - store global (16 bytes)

 

 

currently, I can't use the amd APP SDK to further analyze the program, as the running pc has no monitor, and the monitoring pc is an intel/nvidia laptop

I ordered a mobile firepro card on ebay to be able to execute openCL on that laptop, but in meantime, maybe somebody here already has some interesting thoughts?

All ideas are appreciated!

 

What I already thought of:

instead of putting things like this in my code:

(ctx->d[128] is a local integer array)

       digest[12] = (unsigned char)(ctx->d[localid]  >> 24);

        digest[13] = (unsigned char)(ctx->d[localid]  >> 16) & 255;

        digest[14] = (unsigned char)(ctx->d[localid]  >> 8) & 255;

        digest[15] = (unsigned char)(ctx->d[localid]) & 255;

do something like

     typedef union {

          uchar[4] l1;

          uint d;

     } workaround;

and then use the workaround.l1 instead.

would this be a good idea? or is this just nonsense?

 

thanks!

Outcomes