Hello everyone,
I have a OpenCL 1.2 kernel which is performing far from optimal on a R9 Fury (a 32 core bulldozer CPU works twice as fast ...)
I think the problem may be in the amount of bit-shift and and, or and xor on char arrays
hereby the instruction count of the kernel from oclgrind from a reduced problem:
Instructions executed for kernel:
12566568 - add
10649560 - xor
8650762 - or
7774218 - lshr
7421960 - shl
5783500 - load private (22183918 bytes)
4767752 - store private (15466538 bytes)
4227257 - br
3858597 - getelementptr
2277394 - and
2179249 - icmp
2129972 - zext
2072606 - phi
770050 - load local (3080200 bytes)
679940 - bitcast
524288 - call llvm.bswap.i32()
475145 - trunc
344066 - store local (1376264 bytes)
303240 - ret
163846 - sub
147464 - sext
147460 - ptrtoint
81922 - call memcpy()
65572 - load constant (114724 bytes)
57482 - alloca
49152 - call 1Update()
32768 - call 1Transform()
16386 - call memcpyc()
16386 - call 1Updatec()
16384 - call memcpyzero()
16384 - call 1Updatezero()
16384 - call 1Final()
16384 - call 1Init()
16384 - call 1_vector()
16384 - call memcpywithIDs()
16382 - select
8192 - call hmac_sha1_vector()
128 - load global (512 bytes)
15 - call get_global_id()
4 - store global (16 bytes)
currently, I can't use the amd APP SDK to further analyze the program, as the running pc has no monitor, and the monitoring pc is an intel/nvidia laptop
I ordered a mobile firepro card on ebay to be able to execute openCL on that laptop, but in meantime, maybe somebody here already has some interesting thoughts?
All ideas are appreciated!
What I already thought of:
instead of putting things like this in my code:
(ctx->d[128] is a local integer array)
digest[12] = (unsigned char)(ctx->d[localid] >> 24);
digest[13] = (unsigned char)(ctx->d[localid] >> 16) & 255;
digest[14] = (unsigned char)(ctx->d[localid] >> 😎 & 255;
digest[15] = (unsigned char)(ctx->d[localid]) & 255;
do something like
typedef union {
uchar[4] l1;
uint d;
} workaround;
and then use the workaround.l1 instead.
would this be a good idea? or is this just nonsense?
thanks!
Just want to mention one point. As the above workaround code depends on underlying endianness, so it should be used carefully. Otherwise, portability might be an issue.
Hi Dipak,
Yes, I'm fully aware of that, but I guess that when the code is only meant to be executed on AMD devices, endianess will not be a problem, right?
If you're aware of the pitfalls, you can try the union version and compare it with the previous performance. To improve the effective memory bandwidth, it's better to access the "workaround" union as whole (as 32-bit value) from local memory or global memory and place it in register before accessing the values in element-wise manner.