cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

notyou
Adept III

Slow Computation/Crashing Driver

Hello everyone,

I've been working on advancing my matrix multiplication code and I now have the algorithm working 100% on my OpenCL CPU device. The issue I'm having is that the computation is very slow when compared to the provided MM sample or even host based computation. Keep in mind that I am still very new to OpenCL, so any tips to increase my application's performance are welcome.

Also, it seems that when moving to the GPU device, once I hit a size of 256x256, the driver will crash or at least seem to hang for a very long time. This leads to an execution time greater than both the host CPU doing the calculation and the OpenCL CPU device doing the calculation.

__kernel void globalMM(__global int *A, __global int *B, __global int *C, int dimensions, int block_size) { int group_id0 = get_group_id(0); int group_id1 = get_group_id(1); int local_id0 = get_local_id(0); int local_id1 = get_local_id(1); int row = (group_id0 * block_size) + local_id0; int col = (group_id1 * block_size) + local_id1; for(int k = 0; k < dimensions; k++) { C[row * dimensions + col] = 0; for(int j = 0; j < dimensions; j++) C[row * dimensions + col] += A[row * dimensions + j] * B[j * dimensions + col]; } }

0 Likes
6 Replies
notyou
Adept III

Bump. Does anyone see anything wrong here? Would any system information help debug this problem?

0 Likes

Sorry to say it, but everything is wrong with your code. It's probably the worst possible implementation of matrix mul.You are wasting bandwitch and fpu resources like hell.

The first thing to do would be to remove unnecessary integer multiplications from inside the main loop ( this is one of the most basic optimizations ! ).

Second is to change algorithm to block matrix multiplication ( saving bandwitch ).

To write efficient matrix mul you really need to think about gpu architecture, cache usage, available fp resources ( like mad ).

Get CAL++ ( http://sourceforge.net/projects/calpp/ ). In examples there is fastest known matrix multiplication on ATI's cards. You can try to convert it to OpenCL.

 

0 Likes

notyou,
There are slides on how to write Matrix Multiplication on the GPU in our documentation section here:
http://developer.amd.com/gpu_assets/PLDI08Tutorial.pdf

Please refer to them for hints on how to run matmult efficiently.
0 Likes

Originally posted by: hazeman Sorry to say it, but everything is wrong with your code. It's probably the worst possible implementation of matrix mul.You are wasting bandwitch and fpu resources like hell.

The first thing to do would be to remove unnecessary integer multiplications from inside the main loop ( this is one of the most basic optimizations ! ).

Second is to change algorithm to block matrix multiplication ( saving bandwitch ).

To write efficient matrix mul you really need to think about gpu architecture, cache usage, available fp resources ( like mad ).

Get CAL++ ( http://sourceforge.net/projects/calpp/ ). In examples there is fastest known matrix multiplication on ATI's cards. You can try to convert it to OpenCL.



Don't feel sorry for me, I know my code is junk, I just don't know why (since I'm a new, self-"taught" OpenCL coder) - that's why I'm asking here for tips.

By block, do you mean to access memory in a row, rather than a block like I have now? eg. 16x16 matrix, access row 1 (16 elements), instead of an X by X block (as I believe I'm doing now).

Originally posted by: MicahVillmow notyou, There are slides on how to write Matrix Multiplication on the GPU in our documentation section here: http://developer.amd.com/gpu_assets/PLDI08Tutorial.pdf Please refer to them for hints on how to run matmult efficiently.


Thanks for the paper. Do you have any other resources that may really help a newbie?

0 Likes

0 Likes

@hazeman

I didn't even realize that the outer loop could be removed. *facepalm's self for missing that one* It made a world of difference in execution time.

 

Originally posted by: MicahVillmow notyou, There are other slides/presentations here: http://developer.amd.com/gpu/A...ages/Publications.aspx and here: http://developer.amd.com/gpu/A...ges/Documentation.aspx


Thanks for the links. I'd seen those pages before but you guys (AMD) have too many articles and I hadn't looked through them all.

0 Likes