cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

tlrmchlsmth
Journeyman III

Large Allocation Makes Kernel Run Slowly

I'm using a 5870 with Win7 and the ATI StreamSDK 2.1

I wrote a matrix multiplication kernel, and if I declare a 64x64 matrix of floats, my kernel takes approximately 3 times longer, even if I don't touch the array at all.  Does anybody know why this is?

0 Likes
3 Replies
ryta1203
Journeyman III

IT takes 3x longer than what? The CPU version?

For starters, 64x64 is way too small for GPU. Try 8192x8192 and see if the GPU is still 3x longer than CPU, or even just try a 2048x2048 or even a 1024x1024.

Also, as a side note, use float4s instead of floats.

0 Likes

To clarify,

If I do:

 __local float junk[64][64];

inside my OpenCL kernel, the kernel runs about 3x slower than if that matrix is not declared. This is true even if I do not touch the matrix at all. And this is also true across all sizes of matrices being multiplied, so the additional time is not some constant overhead.

 

Any Ideas?

0 Likes

If you can provide a test case, we can see if it can be fixed in time for our next release.
0 Likes