I'm currently using Brook+ on my 3870x2 to write a texture synthesis program that takes three 64x64 images as input. I've implemented my first step but I was a bit dissapointed at the performance. When I checked GPU-Z, the GPU load meter was only high (80+%) for about a second and then stayed around 5% for the rest of the execution, which takes about 10-15 minutes!
This is my code to call my kernel (CompareCross):
for(int xy = 0; xy < 64; xy++)
for(int xx = 0; xx < 64; xx++)
for(int yy = 0; yy < 64; yy++)
for(int yx = 0; yx < 64; yx++)
CompareCross(int2(xx, xy), stmExemplarX, int2(yx, yy), stmExemplarY, stmExemplarZ, stmOutput);
I'm wondering why this is the case? Is it the nested loops? I figure if my kernel was badly written, I'd still see a lot of GPU activity. If anyone has some advice or tips they'd be greatly appreciated
The initial spike in the GPU load was most likely because of data transfers to the GPU. The problem you have is that your data is so small that you don't really stress the GPU at all. What you should think about doing is moving one or more of the loops onto the kernel to get more of the execution on the GPU.
I.e. instead of running 64x64 threads, run 64x64x64(or 512x512) and do 64 CompareCross in parallel.