I've just able have MD5 bruteforcer(called BarsWF :-) ) to work on AMD card using Brook. (CUDA & SSE2 cores are already done and working around 100% of theoretical performance)
But, if my kernel is "single threaded" (i.e. have 1 key processed), I am just getting around 20% of theoretical performance.
So to get 60% of theoretical performance i had to MANUALLY copy code 5 times inside kernel to work on 5 keys at the same time, so that it utilize more effectively VLIW.
Why can't brcc/cal runtime do that on it's own?
Can we have a hint for it, like "process 10 logical threads in 1 phisical thread"? If cal backend see this hint, it use 10 times more registers, and copy all commands 10 times.
Or this already supposed to work, and I am doing something wrong?