Archives Discussions

notyou · ‎04-05-2010

Hello everyone,

I've been working on advancing my matrix multiplication code and I now have the algorithm working 100% on my OpenCL CPU device. The issue I'm having is that the computation is very slow when compared to the provided MM sample or even host based computation. Keep in mind that I am still very new to OpenCL, so any tips to increase my application's performance are welcome.

Also, it seems that when moving to the GPU device, once I hit a size of 256x256, the driver will crash or at least seem to hang for a very long time. This leads to an execution time greater than both the host CPU doing the calculation and the OpenCL CPU device doing the calculation.

__kernel void globalMM(__global int *A, __global int *B, __global int *C, int dimensions, int block_size) { int group_id0 = get_group_id(0); int group_id1 = get_group_id(1); int local_id0 = get_local_id(0); int local_id1 = get_local_id(1); int row = (group_id0 * block_size) + local_id0; int col = (group_id1 * block_size) + local_id1; for(int k = 0; k < dimensions; k++) { C[row * dimensions + col] = 0; for(int j = 0; j < dimensions; j++) C[row * dimensions + col] += A[row * dimensions + j] * B[j * dimensions + col]; } }

notyou · ‎04-08-2010

Bump. Does anyone see anything wrong here? Would any system information help debug this problem?

hazeman · ‎04-08-2010

Sorry to say it, but everything is wrong with your code. It's probably the worst possible implementation of matrix mul.You are wasting bandwitch and fpu resources like hell.

The first thing to do would be to remove unnecessary integer multiplications from inside the main loop ( this is one of the most basic optimizations ! ).

Second is to change algorithm to block matrix multiplication ( saving bandwitch ).

To write efficient matrix mul you really need to think about gpu architecture, cache usage, available fp resources ( like mad ).

Get CAL++ ( http://sourceforge.net/projects/calpp/ ). In examples there is fastest known matrix multiplication on ATI's cards. You can try to convert it to OpenCL.

MicahVillmow · ‎04-08-2010

notyou,
There are slides on how to write Matrix Multiplication on the GPU in our documentation section here:
http://developer.amd.com/gpu_assets/PLDI08Tutorial.pdf

Please refer to them for hints on how to run matmult efficiently.

notyou · ‎04-08-2010

Originally posted by: hazeman Sorry to say it, but everything is wrong with your code. It's probably the worst possible implementation of matrix mul.You are wasting bandwitch and fpu resources like hell.

The first thing to do would be to remove unnecessary integer multiplications from inside the main loop ( this is one of the most basic optimizations ! ).
Second is to change algorithm to block matrix multiplication ( saving bandwitch ).
To write efficient matrix mul you really need to think about gpu architecture, cache usage, available fp resources ( like mad ).
Get CAL++ ( http://sourceforge.net/projects/calpp/ ). In examples there is fastest known matrix multiplication on ATI's cards. You can try to convert it to OpenCL.

Don't feel sorry for me, I know my code is junk, I just don't know why (since I'm a new, self-"taught" OpenCL coder) - that's why I'm asking here for tips.

By block, do you mean to access memory in a row, rather than a block like I have now? eg. 16x16 matrix, access row 1 (16 elements), instead of an X by X block (as I believe I'm doing now).

Originally posted by: MicahVillmow notyou, There are slides on how to write Matrix Multiplication on the GPU in our documentation section here: http://developer.amd.com/gpu_assets/PLDI08Tutorial.pdf Please refer to them for hints on how to run matmult efficiently.

Thanks for the paper. Do you have any other resources that may really help a newbie?

MicahVillmow · ‎04-09-2010

notyou,
There are other slides/presentations here:
http://developer.amd.com/gpu/A...ages/Publications.aspx
and here:
http://developer.amd.com/gpu/A...ges/Documentation.aspx

notyou · ‎04-09-2010

@hazeman

I didn't even realize that the outer loop could be removed. *facepalm's self for missing that one* It made a world of difference in execution time.

Originally posted by: MicahVillmow notyou, There are other slides/presentations here: http://developer.amd.com/gpu/A...ages/Publications.aspx and here: http://developer.amd.com/gpu/A...ges/Documentation.aspx

Thanks for the links. I'd seen those pages before but you guys (AMD) have too many articles and I hadn't looked through them all.

Archives Discussions

Slow Computation/Crashing Driver