cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

BarsMonster
Journeyman III

Kernels "combining"

I've just able have MD5 bruteforcer(called BarsWF 🙂 ) to work on AMD card using Brook. (CUDA & SSE2 cores are already done and working around 100% of theoretical performance)

But, if my kernel is "single threaded" (i.e. have 1 key processed), I am just getting around 20% of theoretical performance.

So to get 60% of theoretical performance i had to MANUALLY copy code 5 times inside kernel to work on 5 keys at the same time, so that it utilize more effectively VLIW.

Why can't brcc/cal runtime do that on it's own?

Can we have a hint for it, like "process 10 logical threads in 1 phisical thread"? If cal backend see this hint, it use 10 times more registers, and copy all commands 10 times.

Or this already supposed to work, and I am doing something wrong?

0 Likes
5 Replies
gaurav_garg
Adept I

CAL compiler always try to utilize ILP by executing set of instructions on VLIW units.  But, it requires these set of instructions to be independent of each other.

Compiler don't process multiple logical threads into single physical thread, but you can give a hint for ILP with different techniques like using vectorized datatype streams.

0 Likes

If you are using float, try using float4. In my experience, the compiler tries to and actually manages to do a good job with float streams, but not always.

 

0 Likes

Well, float 4 will load just 4 ALU units, while there are 5 of them in hardware, right? :-S

 

Update: My previous report about non-100% loaded VLIW instruction is not correct, was looking at HD2900 code 🙂

0 Likes

Seems something really wrong with Kernel Analyzer. Do you see the same behavior with brcc?

======

Some of these issues are known where CAL compiler is not able to optimize generated IL via brcc. Please post such test cases. It will be very helpful for fixing these bugs.

0 Likes

.

0 Likes