5 Replies Latest reply on Jan 5, 2009 8:37 AM by BarsMonster

    Kernels "combining"

    BarsMonster

      I've just able have MD5 bruteforcer(called BarsWF :-) ) to work on AMD card using Brook. (CUDA & SSE2 cores are already done and working around 100% of theoretical performance)

      But, if my kernel is "single threaded" (i.e. have 1 key processed), I am just getting around 20% of theoretical performance.

      So to get 60% of theoretical performance i had to MANUALLY copy code 5 times inside kernel to work on 5 keys at the same time, so that it utilize more effectively VLIW.

      Why can't brcc/cal runtime do that on it's own?

      Can we have a hint for it, like "process 10 logical threads in 1 phisical thread"? If cal backend see this hint, it use 10 times more registers, and copy all commands 10 times.

      Or this already supposed to work, and I am doing something wrong?