18 Replies Latest reply on Sep 28, 2010 4:56 PM by himanshu.gautam

    Sobel Filter and Simple Convolution

    ryta1203

      Why in the samples do these two samples run so poorly over 2D when compared to 1D?

        • Sobel Filter and Simple Convolution
          genaganna

           

          Originally posted by: ryta1203 Why in the samples do these two samples run so poorly over 2D when compared to 1D?

           

          Could you please give us more details like which options you are using and Device details?

            • Sobel Filter and Simple Convolution
              ryta1203

              Sure,

                I am running a 5870, Cat 10.8, SDK 2.2, WinXP 32-bit.

                I am running the kernel over 2D for both samples (I believe the samples originally ran over 1D) so I have modified both the globalThreads and localThreads variables in the sample to be array 2 instead of 1.

                This is not the problem though, the problem is that after I do that I see such a huge difference in performance between running 256x1 localThreads and say 16x16 localThreads or 8x8 localThreads.

                • Sobel Filter and Simple Convolution
                  redditisgreat

                  The 2d approach has a worse memory access pattern. It accesses 16+2 continous areas in global memory per workgroup instead of 1+2.

                  The +2 comes from the filter that accesses one line above and below the current pixelline.

                    • Sobel Filter and Simple Convolution
                      ryta1203

                      It's interesting because this behavior does not exist when those samples are run on Nvidia GPUs.

                        • Sobel Filter and Simple Convolution
                          n0thing

                          I dont see any difference in SobelFilter example when running for 256x1 and 16x16 localthreads. Here is the output from profiler -

                           

                          {    512     512       1}{   16    16     1}        0.88062


                           

                          {    512     512       1}{  256     1     1}        0.86707


                          I am using Radeon 5650 + Driver 8.753.1 + SDK 2.2

                           

                           

                            • Sobel Filter and Simple Convolution
                              ryta1203

                               

                              Originally posted by: n0thing I dont see any difference in SobelFilter example when running for 256x1 and 16x16 localthreads. Here is the output from profiler -

                              {    512     512       1}{   16    16     1}        0.88062
                              {    512     512       1}{  256     1     1}        0.86707

                              I am using Radeon 5650 + Driver 8.753.1 + SDK 2.2

                               

                               

                              Please increase thread count to 4k*4k and then post the results. Your thread size is so small that you might not be seeing a big difference.

                                • Sobel Filter and Simple Convolution
                                  himanshu.gautam
                                  {   4096    4096       1}{   64     1     1}       55.52287
                                  {   4096    4096       1}{    8     8     1}       55.63685
                                    • Sobel Filter and Simple Convolution
                                      ryta1203

                                      Himanshu, where did you get those numbers? What card are you using? Those are the numbers I get for the Read Buffer at that size but not for the kernel run. Here are the numbers I get using the samples SDK Sobel Filter, all VERIFIED using the "verifyResults()" samples function.

                                      {4096 4096} (256 1 1 } 9.20506

                                      {4096 4096} {16 16 1 } 18.42737

                                      {4096 4096} {64 1  1} 13.69119

                                      {4096 4096} {8 8 1} 15.65674

                                      Large difference between the 256 and the 16x16 run, large difference.

                                      Again, Himanshu, I'm not sure where you got your numbers, can you please explain more how you got those numbers?

                                      • Sobel Filter and Simple Convolution
                                        ryta1203

                                        Furthermore,

                                         From the profiler, the ALU Busy is halved when you go to 2D from 1D in the Sobel Filter. The Fetch Busy increases x2 but the number of fetches remains the same somehow (which I think is also odd).

                                        So they both have the same number of ALU ops, Fetch ops (apparently) and Write Ops but by simply going from 1D to 2D we half performance.

                                        Where are you getting your numbers? I'm not able to achieve what you are getting. Are you changing the kernel at all?

                                        Method ExecutionOrder GlobalWorkSize GroupWorkSize Time LDSSize DataTransferSize GPRs ScratchRegs FCStacks Wavefronts ALUInsts FetchInsts WriteInsts LDSFetchInsts LDSWriteInsts ALUBusy ALUFetchRatio ALUPacking FetchSize CacheHit FetchUnitBusy FetchUnitStalled WriteUnitStalled FastPath CompletePath PathUtilization ALUStalledByLDS LDSBankConflict WriteBuffer 1 125.23941 65536 sobel_filter_00DB7830 2 { 4096 4096 1} { 256 1 1} 9.20506 0 7 0 2 262144 143.94 8 1 0 0 99.92 18 80.14 212120.38 0 22.2 0 0 65503 0 100 0 0 ReadBuffer 3 55.54961 65536 Method ExecutionOrder GlobalWorkSize GroupWorkSize Time LDSSize DataTransferSize GPRs ScratchRegs FCStacks Wavefronts ALUInsts FetchInsts WriteInsts LDSFetchInsts LDSWriteInsts ALUBusy ALUFetchRatio ALUPacking FetchSize CacheHit FetchUnitBusy FetchUnitStalled WriteUnitStalled FastPath CompletePath PathUtilization ALUStalledByLDS LDSBankConflict WriteBuffer 1 124.93881 65536 sobel_filter_00DB7830 2 { 4096 4096 1} { 16 16 1} 18.42737 0 7 0 2 262144 144 8 1 0 0 48.33 18 80.14 343136.38 0 71.85 61.11 56.78 65507.38 0 100 0 0 ReadBuffer 3 53.20153 65536

                                          • Sobel Filter and Simple Convolution
                                            himanshu.gautam

                                            ryta,

                                            I think the performance depends on a lot of parameters among which the access pattern of global mamory is one of the most important.

                                            Although in my case performance is not deviating that much (57.37855 and 55.53019).Anyways it is a fact that 2d arrays would be much slower then 1d array becoz in 1d arrays when successive elemets are accesssed all the channels are used and data is fetched from diffrent banks.But in 2d arrys bank conflicts or channel conflicts occur which tend to serialize the accesses and create delay.

                                            In the two cases you have given above,it appears that ALUBusy is 48.33(2d case) and 99.92(1d case).The reason behind it  according to me is the param FetchUnit busy.(22.2 1d  71.85 2d).

                                            For more optimization related info refer to openCL programming Guide

                                              
                                              • Sobel Filter and Simple Convolution
                                                ryta1203

                                                Sadly, this isn't really covered that much in the Guide.

                                                The DCT and MatrixTranspose both use 2D and probably wouldn't benefit from 1D as you suggest.

                                                Many of the other samples, such as Scholes and Twister are totally unaffected by the change in dimension.

                                                So it's the access pattern of the Sobel and Simple Conv that are causing bank conflicts/contention?

                                                • Sobel Filter and Simple Convolution
                                                  ryta1203

                                                   

                                                  Originally posted by: himanshu.gautam ryta,

                                                  I think the performance depends on a lot of parameters among which the access pattern of global mamory is one of the most important.

                                                  Although in my case performance is not deviating that much (57.37855 and 55.53019).



                                                  What do you mean "in your case"? What card  are you using? Are you sure you didn't accidentally look at the ReadBuffer times? I'm simply using the samples so I don't see how our code could differ!?