cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

jpsollie
Adept II

Best practices in OpenCL GPU performance programming

Hello everyone,

I have a OpenCL 1.2 kernel which is performing far from optimal on a R9 Fury (a 32 core bulldozer CPU works twice as fast ...)

I think the problem may be in the amount of bit-shift and and, or and xor on char arrays

hereby the instruction count of the kernel from oclgrind from a reduced problem:

Instructions executed for kernel:

        12566568 - add

        10649560 - xor

         8650762 - or

         7774218 - lshr

         7421960 - shl

         5783500 - load private (22183918 bytes)

         4767752 - store private (15466538 bytes)

         4227257 - br

         3858597 - getelementptr

         2277394 - and

         2179249 - icmp

         2129972 - zext

         2072606 - phi

          770050 - load local (3080200 bytes)

          679940 - bitcast

          524288 - call llvm.bswap.i32()

          475145 - trunc

          344066 - store local (1376264 bytes)

          303240 - ret

          163846 - sub

          147464 - sext

          147460 - ptrtoint

           81922 - call memcpy()

           65572 - load constant (114724 bytes)

           57482 - alloca

           49152 - call 1Update()

           32768 - call 1Transform()

           16386 - call memcpyc()

           16386 - call 1Updatec()

           16384 - call memcpyzero()

           16384 - call 1Updatezero()

           16384 - call 1Final()

           16384 - call 1Init()

           16384 - call 1_vector()

           16384 - call memcpywithIDs()

           16382 - select

            8192 - call hmac_sha1_vector()

             128 - load global (512 bytes)

              15 - call get_global_id()

               4 - store global (16 bytes)

currently, I can't use the amd APP SDK to further analyze the program, as the running pc has no monitor, and the monitoring pc is an intel/nvidia laptop

I ordered a mobile firepro card on ebay to be able to execute openCL on that laptop, but in meantime, maybe somebody here already has some interesting thoughts?

All ideas are appreciated!

What I already thought of:

instead of putting things like this in my code:

(ctx->d[128] is a local integer array)

       digest[12] = (unsigned char)(ctx->d[localid]  >> 24);

        digest[13] = (unsigned char)(ctx->d[localid]  >> 16) & 255;

        digest[14] = (unsigned char)(ctx->d[localid]  >> 😎 & 255;

        digest[15] = (unsigned char)(ctx->d[localid]) & 255;

do something like

     typedef union {

          uchar[4] l1;

          uint d;

     } workaround;

and then use the workaround.l1 instead.

would this be a good idea? or is this just nonsense?

thanks!

0 Likes
3 Replies
dipak
Big Boss

Just want to mention one point. As the above workaround code depends on underlying endianness, so it should be used carefully. Otherwise, portability might be an issue.

0 Likes

Hi Dipak,

Yes, I'm fully aware of that, but I guess that when the code is only meant to be executed on AMD devices, endianess will not be a problem, right?

0 Likes

If you're aware of the pitfalls, you can try the union version and compare it with the previous performance. To improve the effective memory bandwidth, it's better to access the "workaround" union as whole (as 32-bit value) from local memory or global memory and place it in register before accessing the values in element-wise manner.

0 Likes