cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

wudenni
Journeyman III

MatrixMult: OpenCL CPU vs OpenMP

I compared an OpenMP version of MatrixMultiplication with the OpenCL and naïve implementations provided in the SDK.  Here are the benchmark results:

 

1) CPU original: 1211s

2) OpenCL: 253s

3) CPU cache friendly: 210s

4) CPU cache friendly + multithreading using OpenMP: 139s

 

The OpenCL(2) implementation is 4.8X faster than the reference implementation (1).  (3) breaks down the matrix multiplication into 8x8 sub blocks like in (2) to improve memory locality and reduce cache misses.  (3) is single threaded and already beats OpenCL by 1.2X.  With multithreading using OpenMP (4) on dual core machine OpenCL is worse by 1.8X.  I was multiplying 2048x2048 with 2048x2048.

 

Any idea why OpenCL is slower in this example?

 

I’m wondering how OpenCL threads are scheduled on the CPU.  Is it guaranteed that a processor will complete one work group before moving on to threads in another workgroup?  If not, that might explain some of the degradation.

 

Thanks!

 

0 Likes
6 Replies
genaganna
Journeyman III

Originally posted by: wudenni

I compared an OpenMP version of MatrixMultiplication with the OpenCL and naïve implementations provided in the SDK.  Here are the benchmark results:

 

1) CPU original: 1211s

2) OpenCL: 253s

3) CPU cache friendly: 210s

4) CPU cache friendly + multithreading using OpenMP: 139s

 

The OpenCL(2) implementation is 4.8X faster than the reference implementation (1).  (3) breaks down the matrix multiplication into 8x8 sub blocks like in (2) to improve memory locality and reduce cache misses.  (3) is single threaded and already beats OpenCL by 1.2X.  With multithreading using OpenMP (4) on dual core machine OpenCL is worse by 1.8X.  I was multiplying 2048x2048 with 2048x2048.

 

Any idea why OpenCL is slower in this example?

 

I’m wondering how OpenCL threads are scheduled on the CPU.  Is it guaranteed that a processor will complete one work group before moving on to threads in another workgroup?  If not, that might explain some of the degradation.

 

Thanks!

 

 

 

Wudenni,

         Could you please increase blocksize to 16 and 32 in OpenCL implementation and see you get any improvements?

 

0 Likes

Hi genaganna,

 

Thanks for the quick response.  Here are the results for blocksize 16 and 32.  Non-OpenCL runs also have their blocksizes increased to 16 and 32.

 

CPU Original:

1214

 

 

Block Size 8

 

OpenCL CPU

253

CPU Cache Friendly

210

CPU Cache Friendly + OpenMP

139

 

 

Block Size 16

 

OpenCL CPU

489

CPU Cache Friendly

202

CPU Cache Friendly + OpenMP

164

 

 

Block Size 32

 

OpenCL CPU

204

CPU Cache Friendly

232

CPU Cache Friendly + OpenMP

158

 

 

 

The best performing OpenCL run used 32x32 blocks and finished in 204s.  The best performing OpenMP run used 8x8 blocks and finished in 139s.  There is still a 1.5X performance gap.  Any other ideas?

 

I added the source files to this thread in case it helps.  The entry point for OpenMp kernel is matrixMultiplicationCPUCacheFriendly_OpenMP().

 

Thanks!

// compute matrix multiplication between a blockSizexblockSize subblock in input0 with // a blockSizexblockSize subblock in input1 void MatrixMultiplication::matrixMultiplicationCPUCacheFriendly_SubBlock( cl_float *input0, cl_float *input1, const int y, const int x, const int z, const int in0_block_col, const int in0_block_row, const int in1_block_col, const int in1_block_row, cl_float *output) { const int block_width = blockSize; cl_float output_sublock[block_width*block_width]; memset(output_sublock, 0, sizeof(cl_float)*block_width*block_width); int in0_blockWidth = blockSize; int in0_blockHeight = blockSize; int in1_blockWidth = blockSize; int in1_blockHeight = blockSize; // adjust blockWidth/blockHeight for boundary case where full block size is not used if (block_width*in0_block_col > x) { in0_blockWidth = x - (block_width*in0_block_col-1); in1_blockHeight = in0_blockWidth; } if (block_width*in0_block_row > y) in0_blockHeight = y - (block_width*in0_block_row-1); if (block_width*in1_block_col > z) in1_blockWidth = z - (block_width*in1_block_col-1); // compute block boundary int output_offset = in0_block_row*block_width*z + in1_block_col*block_width; int input0_offset = in0_block_row*block_width*x + in0_block_col*block_width; int input1_offset = in1_block_row*block_width*z + in1_block_col*block_width; for (int i=0; i<in1_blockWidth; i++) { for (int j=0; j<in0_blockHeight; j++) { cl_float sum = 0.0; for (int k=0; k<in0_blockWidth; k++) { sum += input0[input0_offset + j*x + k] *input1[input1_offset + k*z + i]; } output[output_offset + j*z + i] += sum; } } } // compute matrix multiplication between all subBlocks in input0 between rows // in0_block_row_start to in0_block_row_end with all subblocks in input1. void MatrixMultiplication::matrixMultiplicationCPUCacheFriendly_Parallel( cl_float * output, cl_float * input0, cl_float * input1, const cl_uint y, const cl_uint x, const cl_uint z, int in0_block_row_start, int in0_block_row_end ) { const int block_width = blockSize; int in0_num_block_cols = x / block_width; if (x % block_width != 0) in0_num_block_cols++; int in0_num_block_rows = y / block_width; if (y % block_width != 0) in0_num_block_rows++; int in1_num_block_cols = z / block_width; if (z % block_width != 0) in1_num_block_cols++; int in1_num_block_rows = in0_num_block_cols; cl_float * output_subblocks = (cl_float *)malloc(sizeof(cl_float)*block_width*block_width *in0_num_block_cols*in0_num_block_rows*in1_num_block_cols*in1_num_block_rows); for (int in0_block_row=in0_block_row_start; in0_block_row<=in0_block_row_end; in0_block_row++) { for (int in1_block_col=0; in1_block_col<in1_num_block_cols; in1_block_col++) { for (int j=0; j<in0_num_block_cols; j++) { matrixMultiplicationCPUCacheFriendly_SubBlock(input0, input1, (int)y, (int)x, (int)z, j, in0_block_row, in1_block_col, j, output); } } } } void MatrixMultiplication::matrixMultiplicationCPUCacheFriendly( cl_float * output, cl_float * input0, cl_float * input1, const cl_uint y, const cl_uint x, const cl_uint z) { const int block_width = blockSize; int in0_num_block_rows = y / block_width; if (y % block_width != 0) in0_num_block_rows++; memset(output, 0, sizeof(cl_float)*y*z); matrixMultiplicationCPUCacheFriendly_Parallel( output, input0, input1, y, x, z, 0, in0_num_block_rows-1 ); } void MatrixMultiplication::matrixMultiplicationCPUCacheFriendly_OpenMP( cl_float * output, cl_float * input0, cl_float * input1, const cl_uint y, const cl_uint x, const cl_uint z) { const int block_width = blockSize; int in0_num_block_rows = y / block_width; if (y % block_width != 0) in0_num_block_rows++; memset(output, 0, sizeof(cl_float)*y*z); // one thread for each input0 block row int num_threads = in0_num_block_rows; int block_rows_per_thread = 1; const int max_num_threads = 128; if (num_threads > max_num_threads) { // limit number of threads // compute new block_rows_per_thread block_rows_per_thread = in0_num_block_rows / max_num_threads; if (in0_num_block_rows % num_threads == 0) block_rows_per_thread++; num_threads = in0_num_block_rows / block_rows_per_thread; if (in0_num_block_rows % block_rows_per_thread != 0) num_threads++; } omp_set_num_threads(num_threads); // printf("num_treads: %d blocks_per_thread: %d\n", num_threads, block_rows_per_thread); int id; #pragma omp parallel private(id) { id = omp_get_thread_num(); int block_start = id*block_rows_per_thread; int block_end = (id+1)*block_rows_per_thread-1; if (block_end >= in0_num_block_rows) block_end = in0_num_block_rows-1; // printf("thread: %d block_start: %d block_end: %d\n", id, block_start, block_end); matrixMultiplicationCPUCacheFriendly_Parallel( output, input0, input1, y, x, z, block_start, block_end ); } }

0 Likes
n0thing
Journeyman III

Originally posted by: wudenni

I’m wondering how OpenCL threads are scheduled on the CPU.  Is it guaranteed that a processor will complete one work group before moving on to threads in another workgroup?  If not, that might explain some of the degradation.

 

   



 

I don't think thats true. See this post -

 

Originally posted by: MicahVillmow The mapping between a hardware thread and a work-group is not a 1-1 mapping. A single hardware thread can run all the work-groups or the work groups can be split up among the hardware threads.








0 Likes
riza_guntur
Journeyman III

Originally posted by: wudenni

I compared an OpenMP version of MatrixMultiplication with the OpenCL and naïve implementations provided in the SDK.  Here are the benchmark results:

 

1) CPU original: 1211s

2) OpenCL: 253s

3) CPU cache friendly: 210s

4) CPU cache friendly + multithreading using OpenMP: 139s

 

The OpenCL(2) implementation is 4.8X faster than the reference implementation (1).  (3) breaks down the matrix multiplication into 8x8 sub blocks like in (2) to improve memory locality and reduce cache misses.  (3) is single threaded and already beats OpenCL by 1.2X.  With multithreading using OpenMP (4) on dual core machine OpenCL is worse by 1.8X.  I was multiplying 2048x2048 with 2048x2048.

 

Any idea why OpenCL is slower in this example?

 

I’m wondering how OpenCL threads are scheduled on the CPU.  Is it guaranteed that a processor will complete one work group before moving on to threads in another workgroup?  If not, that might explain some of the degradation.

 

Thanks!

 

 

have u used float4 yet?

the example don't used it, and also I found example is very slow because of compilation of opencl kernel in runtime (YES in runtime!!!)

0 Likes

n0thing, Thanks for reposting that info here.  I'm a bit surprised by the answer.

In GPU case, work items in the same work group are mapped to processing elements in the same compute unit.  They are encouraged to operate on common/local data to reduce global memory access.

In CPU case, it seems cache misses would be reduced by scheduling work items in the same work group on to the same CPU (since they are encouraged to work on shared data).  What is the benefit of splitting work items in the same work group to different processors?

Thanks!

 

0 Likes

If I use multi level block sizes, 64x64 to match L2 cache and 16x16 to match L1 cache, it get the following improvements:

CPU Cache Friendly (single thread):  187s
CPU Cache Friendly + OpenMP (2 threads): 93s

 

0 Likes