Archives Discussions

BarsMonster · ‎01-05-2009

I've just able have MD5 bruteforcer(called BarsWF 🙂 ) to work on AMD card using Brook. (CUDA & SSE2 cores are already done and working around 100% of theoretical performance)

But, if my kernel is "single threaded" (i.e. have 1 key processed), I am just getting around 20% of theoretical performance.

So to get 60% of theoretical performance i had to MANUALLY copy code 5 times inside kernel to work on 5 keys at the same time, so that it utilize more effectively VLIW.

Why can't brcc/cal runtime do that on it's own?

Can we have a hint for it, like "process 10 logical threads in 1 phisical thread"? If cal backend see this hint, it use 10 times more registers, and copy all commands 10 times.

Or this already supposed to work, and I am doing something wrong?

gaurav_garg · ‎01-05-2009

CAL compiler always try to utilize ILP by executing set of instructions on VLIW units. But, it requires these set of instructions to be independent of each other.

Compiler don't process multiple logical threads into single physical thread, but you can give a hint for ILP with different techniques like using vectorized datatype streams.

udeepta · ‎01-05-2009

If you are using float, try using float4. In my experience, the compiler tries to and actually manages to do a good job with float streams, but not always.

BarsMonster · ‎01-05-2009

Well, float 4 will load just 4 ALU units, while there are 5 of them in hardware, right? :-S

Update: My previous report about non-100% loaded VLIW instruction is not correct, was looking at HD2900 code 🙂

gaurav_garg · ‎01-05-2009

Seems something really wrong with Kernel Analyzer. Do you see the same behavior with brcc?

======

Some of these issues are known where CAL compiler is not able to optimize generated IL via brcc. Please post such test cases. It will be very helpful for fixing these bugs.

BarsMonster · ‎01-05-2009

.

Archives Discussions

Kernels "combining"