Kernels "combining"

Discussion created by BarsMonster on Jan 5, 2009
Latest reply on Jan 5, 2009 by BarsMonster

I've just able have MD5 bruteforcer(called BarsWF :-) ) to work on AMD card using Brook. (CUDA & SSE2 cores are already done and working around 100% of theoretical performance)

But, if my kernel is "single threaded" (i.e. have 1 key processed), I am just getting around 20% of theoretical performance.

So to get 60% of theoretical performance i had to MANUALLY copy code 5 times inside kernel to work on 5 keys at the same time, so that it utilize more effectively VLIW.

Why can't brcc/cal runtime do that on it's own?

Can we have a hint for it, like "process 10 logical threads in 1 phisical thread"? If cal backend see this hint, it use 10 times more registers, and copy all commands 10 times.

Or this already supposed to work, and I am doing something wrong?