6 Replies Latest reply on Dec 3, 2009 11:38 PM by wudenni

    MatrixMult: OpenCL CPU vs OpenMP

    wudenni

      I compared an OpenMP version of MatrixMultiplication with the OpenCL and naïve implementations provided in the SDK.  Here are the benchmark results:

       

      1) CPU original: 1211s

      2) OpenCL: 253s

      3) CPU cache friendly: 210s

      4) CPU cache friendly + multithreading using OpenMP: 139s

       

      The OpenCL(2) implementation is 4.8X faster than the reference implementation (1).  (3) breaks down the matrix multiplication into 8x8 sub blocks like in (2) to improve memory locality and reduce cache misses.  (3) is single threaded and already beats OpenCL by 1.2X.  With multithreading using OpenMP (4) on dual core machine OpenCL is worse by 1.8X.  I was multiplying 2048x2048 with 2048x2048.

       

      Any idea why OpenCL is slower in this example?

       

      I’m wondering how OpenCL threads are scheduled on the CPU.  Is it guaranteed that a processor will complete one work group before moving on to threads in another workgroup?  If not, that might explain some of the degradation.

       

      Thanks!

       

        • MatrixMult: OpenCL CPU vs OpenMP
          genaganna

           

          Originally posted by: wudenni

          I compared an OpenMP version of MatrixMultiplication with the OpenCL and naïve implementations provided in the SDK.  Here are the benchmark results:

           

          1) CPU original: 1211s

          2) OpenCL: 253s

          3) CPU cache friendly: 210s

          4) CPU cache friendly + multithreading using OpenMP: 139s

           

          The OpenCL(2) implementation is 4.8X faster than the reference implementation (1).  (3) breaks down the matrix multiplication into 8x8 sub blocks like in (2) to improve memory locality and reduce cache misses.  (3) is single threaded and already beats OpenCL by 1.2X.  With multithreading using OpenMP (4) on dual core machine OpenCL is worse by 1.8X.  I was multiplying 2048x2048 with 2048x2048.

           

          Any idea why OpenCL is slower in this example?

           

          I’m wondering how OpenCL threads are scheduled on the CPU.  Is it guaranteed that a processor will complete one work group before moving on to threads in another workgroup?  If not, that might explain some of the degradation.

           

          Thanks!

           

           

           

          Wudenni,

                   Could you please increase blocksize to 16 and 32 in OpenCL implementation and see you get any improvements?

           

            • MatrixMult: OpenCL CPU vs OpenMP
              wudenni

              Hi genaganna,

               

              Thanks for the quick response.  Here are the results for blocksize 16 and 32.  Non-OpenCL runs also have their blocksizes increased to 16 and 32.

               

              CPU Original:

              1214

               

               

              Block Size 8

               

              OpenCL CPU

              253

              CPU Cache Friendly

              210

              CPU Cache Friendly + OpenMP

              139

               

               

              Block Size 16

               

              OpenCL CPU

              489

              CPU Cache Friendly

              202

              CPU Cache Friendly + OpenMP

              164

               

               

              Block Size 32

               

              OpenCL CPU

              204

              CPU Cache Friendly

              232

              CPU Cache Friendly + OpenMP

              158

               

               

               

              The best performing OpenCL run used 32x32 blocks and finished in 204s.  The best performing OpenMP run used 8x8 blocks and finished in 139s.  There is still a 1.5X performance gap.  Any other ideas?

               

              I added the source files to this thread in case it helps.  The entry point for OpenMp kernel is matrixMultiplicationCPUCacheFriendly_OpenMP().

               

              Thanks!

              // compute matrix multiplication between a blockSizexblockSize subblock in input0 with // a blockSizexblockSize subblock in input1 void MatrixMultiplication::matrixMultiplicationCPUCacheFriendly_SubBlock( cl_float *input0, cl_float *input1, const int y, const int x, const int z, const int in0_block_col, const int in0_block_row, const int in1_block_col, const int in1_block_row, cl_float *output) { const int block_width = blockSize; cl_float output_sublock[block_width*block_width]; memset(output_sublock, 0, sizeof(cl_float)*block_width*block_width); int in0_blockWidth = blockSize; int in0_blockHeight = blockSize; int in1_blockWidth = blockSize; int in1_blockHeight = blockSize; // adjust blockWidth/blockHeight for boundary case where full block size is not used if (block_width*in0_block_col > x) { in0_blockWidth = x - (block_width*in0_block_col-1); in1_blockHeight = in0_blockWidth; } if (block_width*in0_block_row > y) in0_blockHeight = y - (block_width*in0_block_row-1); if (block_width*in1_block_col > z) in1_blockWidth = z - (block_width*in1_block_col-1); // compute block boundary int output_offset = in0_block_row*block_width*z + in1_block_col*block_width; int input0_offset = in0_block_row*block_width*x + in0_block_col*block_width; int input1_offset = in1_block_row*block_width*z + in1_block_col*block_width; for (int i=0; i<in1_blockWidth; i++) { for (int j=0; j<in0_blockHeight; j++) { cl_float sum = 0.0; for (int k=0; k<in0_blockWidth; k++) { sum += input0[input0_offset + j*x + k] *input1[input1_offset + k*z + i]; } output[output_offset + j*z + i] += sum; } } } // compute matrix multiplication between all subBlocks in input0 between rows // in0_block_row_start to in0_block_row_end with all subblocks in input1. void MatrixMultiplication::matrixMultiplicationCPUCacheFriendly_Parallel( cl_float * output, cl_float * input0, cl_float * input1, const cl_uint y, const cl_uint x, const cl_uint z, int in0_block_row_start, int in0_block_row_end ) { const int block_width = blockSize; int in0_num_block_cols = x / block_width; if (x % block_width != 0) in0_num_block_cols++; int in0_num_block_rows = y / block_width; if (y % block_width != 0) in0_num_block_rows++; int in1_num_block_cols = z / block_width; if (z % block_width != 0) in1_num_block_cols++; int in1_num_block_rows = in0_num_block_cols; cl_float * output_subblocks = (cl_float *)malloc(sizeof(cl_float)*block_width*block_width *in0_num_block_cols*in0_num_block_rows*in1_num_block_cols*in1_num_block_rows); for (int in0_block_row=in0_block_row_start; in0_block_row<=in0_block_row_end; in0_block_row++) { for (int in1_block_col=0; in1_block_col<in1_num_block_cols; in1_block_col++) { for (int j=0; j<in0_num_block_cols; j++) { matrixMultiplicationCPUCacheFriendly_SubBlock(input0, input1, (int)y, (int)x, (int)z, j, in0_block_row, in1_block_col, j, output); } } } } void MatrixMultiplication::matrixMultiplicationCPUCacheFriendly( cl_float * output, cl_float * input0, cl_float * input1, const cl_uint y, const cl_uint x, const cl_uint z) { const int block_width = blockSize; int in0_num_block_rows = y / block_width; if (y % block_width != 0) in0_num_block_rows++; memset(output, 0, sizeof(cl_float)*y*z); matrixMultiplicationCPUCacheFriendly_Parallel( output, input0, input1, y, x, z, 0, in0_num_block_rows-1 ); } void MatrixMultiplication::matrixMultiplicationCPUCacheFriendly_OpenMP( cl_float * output, cl_float * input0, cl_float * input1, const cl_uint y, const cl_uint x, const cl_uint z) { const int block_width = blockSize; int in0_num_block_rows = y / block_width; if (y % block_width != 0) in0_num_block_rows++; memset(output, 0, sizeof(cl_float)*y*z); // one thread for each input0 block row int num_threads = in0_num_block_rows; int block_rows_per_thread = 1; const int max_num_threads = 128; if (num_threads > max_num_threads) { // limit number of threads // compute new block_rows_per_thread block_rows_per_thread = in0_num_block_rows / max_num_threads; if (in0_num_block_rows % num_threads == 0) block_rows_per_thread++; num_threads = in0_num_block_rows / block_rows_per_thread; if (in0_num_block_rows % block_rows_per_thread != 0) num_threads++; } omp_set_num_threads(num_threads); // printf("num_treads: %d blocks_per_thread: %d\n", num_threads, block_rows_per_thread); int id; #pragma omp parallel private(id) { id = omp_get_thread_num(); int block_start = id*block_rows_per_thread; int block_end = (id+1)*block_rows_per_thread-1; if (block_end >= in0_num_block_rows) block_end = in0_num_block_rows-1; // printf("thread: %d block_start: %d block_end: %d\n", id, block_start, block_end); matrixMultiplicationCPUCacheFriendly_Parallel( output, input0, input1, y, x, z, block_start, block_end ); } }

            • MatrixMult: OpenCL CPU vs OpenMP
              n0thing

               

              Originally posted by: wudenni

              I’m wondering how OpenCL threads are scheduled on the CPU.  Is it guaranteed that a processor will complete one work group before moving on to threads in another workgroup?  If not, that might explain some of the degradation.

               

                 



               

              I don't think thats true. See this post -

               

               

              Originally posted by: MicahVillmow The mapping between a hardware thread and a work-group is not a 1-1 mapping. A single hardware thread can run all the work-groups or the work groups can be split up among the hardware threads.








              • MatrixMult: OpenCL CPU vs OpenMP
                riza.guntur

                 

                Originally posted by: wudenni

                I compared an OpenMP version of MatrixMultiplication with the OpenCL and naïve implementations provided in the SDK.  Here are the benchmark results:

                 

                1) CPU original: 1211s

                2) OpenCL: 253s

                3) CPU cache friendly: 210s

                4) CPU cache friendly + multithreading using OpenMP: 139s

                 

                The OpenCL(2) implementation is 4.8X faster than the reference implementation (1).  (3) breaks down the matrix multiplication into 8x8 sub blocks like in (2) to improve memory locality and reduce cache misses.  (3) is single threaded and already beats OpenCL by 1.2X.  With multithreading using OpenMP (4) on dual core machine OpenCL is worse by 1.8X.  I was multiplying 2048x2048 with 2048x2048.

                 

                Any idea why OpenCL is slower in this example?

                 

                I’m wondering how OpenCL threads are scheduled on the CPU.  Is it guaranteed that a processor will complete one work group before moving on to threads in another workgroup?  If not, that might explain some of the degradation.

                 

                Thanks!

                 

                 

                have u used float4 yet?

                the example don't used it, and also I found example is very slow because of compilation of opencl kernel in runtime (YES in runtime!!!)

                  • MatrixMult: OpenCL CPU vs OpenMP
                    wudenni

                    n0thing, Thanks for reposting that info here.  I'm a bit surprised by the answer.

                    In GPU case, work items in the same work group are mapped to processing elements in the same compute unit.  They are encouraged to operate on common/local data to reduce global memory access.

                    In CPU case, it seems cache misses would be reduced by scheduling work items in the same work group on to the same CPU (since they are encouraged to work on shared data).  What is the benefit of splitting work items in the same work group to different processors?

                    Thanks!