Archives Discussions

jpsollie · ‎09-04-2017

Hello everyone,

I have a OpenCL 1.2 kernel which is performing far from optimal on a R9 Fury (a 32 core bulldozer CPU works twice as fast ...)

I think the problem may be in the amount of bit-shift and and, or and xor on char arrays

hereby the instruction count of the kernel from oclgrind from a reduced problem:

Instructions executed for kernel:

12566568 - add

10649560 - xor

8650762 - or

7774218 - lshr

7421960 - shl

5783500 - load private (22183918 bytes)

4767752 - store private (15466538 bytes)

4227257 - br

3858597 - getelementptr

2277394 - and

2179249 - icmp

2129972 - zext

2072606 - phi

770050 - load local (3080200 bytes)

679940 - bitcast

524288 - call llvm.bswap.i32()

475145 - trunc

344066 - store local (1376264 bytes)

303240 - ret

163846 - sub

147464 - sext

147460 - ptrtoint

81922 - call memcpy()

65572 - load constant (114724 bytes)

57482 - alloca

49152 - call 1Update()

32768 - call 1Transform()

16386 - call memcpyc()

16386 - call 1Updatec()

16384 - call memcpyzero()

16384 - call 1Updatezero()

16384 - call 1Final()

16384 - call 1Init()

16384 - call 1_vector()

16384 - call memcpywithIDs()

16382 - select

8192 - call hmac_sha1_vector()

128 - load global (512 bytes)

15 - call get_global_id()

4 - store global (16 bytes)

currently, I can't use the amd APP SDK to further analyze the program, as the running pc has no monitor, and the monitoring pc is an intel/nvidia laptop

I ordered a mobile firepro card on ebay to be able to execute openCL on that laptop, but in meantime, maybe somebody here already has some interesting thoughts?

All ideas are appreciated!

What I already thought of:

instead of putting things like this in my code:

(ctx->d[128] is a local integer array)

digest[12] = (unsigned char)(ctx->d[localid] >> 24);

digest[13] = (unsigned char)(ctx->d[localid] >> 16) & 255;

digest[14] = (unsigned char)(ctx->d[localid] >> 😎 & 255;

digest[15] = (unsigned char)(ctx->d[localid]) & 255;

do something like

typedef union {

uchar[4] l1;

uint d;

} workaround;

and then use the workaround.l1 instead.

would this be a good idea? or is this just nonsense?

thanks!

dipak · ‎09-05-2017

Just want to mention one point. As the above workaround code depends on underlying endianness, so it should be used carefully. Otherwise, portability might be an issue.

jpsollie · ‎09-05-2017

Hi Dipak,

Yes, I'm fully aware of that, but I guess that when the code is only meant to be executed on AMD devices, endianess will not be a problem, right?

dipak · ‎09-06-2017

If you're aware of the pitfalls, you can try the union version and compare it with the previous performance. To improve the effective memory bandwidth, it's better to access the "workaround" union as whole (as 32-bit value) from local memory or global memory and place it in register before accessing the values in element-wise manner.

Archives Discussions

Best practices in OpenCL GPU performance programming