cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ryta1203
Journeyman III

Sobel Filter and Simple Convolution

Why in the samples do these two samples run so poorly over 2D when compared to 1D?

0 Likes
18 Replies
genaganna
Journeyman III

Originally posted by: ryta1203 Why in the samples do these two samples run so poorly over 2D when compared to 1D?

 

Could you please give us more details like which options you are using and Device details?

0 Likes

Sure,

  I am running a 5870, Cat 10.8, SDK 2.2, WinXP 32-bit.

  I am running the kernel over 2D for both samples (I believe the samples originally ran over 1D) so I have modified both the globalThreads and localThreads variables in the sample to be array 2 instead of 1.

  This is not the problem though, the problem is that after I do that I see such a huge difference in performance between running 256x1 localThreads and say 16x16 localThreads or 8x8 localThreads.

0 Likes

The 2d approach has a worse memory access pattern. It accesses 16+2 continous areas in global memory per workgroup instead of 1+2.

The +2 comes from the filter that accesses one line above and below the current pixelline.

0 Likes

It's interesting because this behavior does not exist when those samples are run on Nvidia GPUs.

0 Likes

I dont see any difference in SobelFilter example when running for 256x1 and 16x16 localthreads. Here is the output from profiler -

{    512     512       1}{   16    16     1}        0.88062


{    512     512       1}{  256     1     1}        0.86707


I am using Radeon 5650 + Driver 8.753.1 + SDK 2.2

 

 

0 Likes

Originally posted by: n0thing I dont see any difference in SobelFilter example when running for 256x1 and 16x16 localthreads. Here is the output from profiler -

{    512     512       1}{   16    16     1}        0.88062
{    512     512       1}{  256     1     1}        0.86707

I am using Radeon 5650 + Driver 8.753.1 + SDK 2.2

 

 

Please increase thread count to 4k*4k and then post the results. Your thread size is so small that you might not be seeing a big difference.

0 Likes

{   4096    4096       1}{   64     1     1}       55.52287
{   4096    4096       1}{    8     8     1}       55.63685
0 Likes

Himanshu, where did you get those numbers? What card are you using? Those are the numbers I get for the Read Buffer at that size but not for the kernel run. Here are the numbers I get using the samples SDK Sobel Filter, all VERIFIED using the "verifyResults()" samples function.

{4096 4096} (256 1 1 } 9.20506

{4096 4096} {16 16 1 } 18.42737

{4096 4096} {64 1  1} 13.69119

{4096 4096} {8 8 1} 15.65674

Large difference between the 256 and the 16x16 run, large difference.

Again, Himanshu, I'm not sure where you got your numbers, can you please explain more how you got those numbers?

0 Likes

I think you copied down the memory transfer times by accident.

0 Likes

Furthermore,

 From the profiler, the ALU Busy is halved when you go to 2D from 1D in the Sobel Filter. The Fetch Busy increases x2 but the number of fetches remains the same somehow (which I think is also odd).

So they both have the same number of ALU ops, Fetch ops (apparently) and Write Ops but by simply going from 1D to 2D we half performance.

Where are you getting your numbers? I'm not able to achieve what you are getting. Are you changing the kernel at all?

Method ExecutionOrder GlobalWorkSize GroupWorkSize Time LDSSize DataTransferSize GPRs ScratchRegs FCStacks Wavefronts ALUInsts FetchInsts WriteInsts LDSFetchInsts LDSWriteInsts ALUBusy ALUFetchRatio ALUPacking FetchSize CacheHit FetchUnitBusy FetchUnitStalled WriteUnitStalled FastPath CompletePath PathUtilization ALUStalledByLDS LDSBankConflict WriteBuffer 1 125.23941 65536 sobel_filter_00DB7830 2 { 4096 4096 1} { 256 1 1} 9.20506 0 7 0 2 262144 143.94 8 1 0 0 99.92 18 80.14 212120.38 0 22.2 0 0 65503 0 100 0 0 ReadBuffer 3 55.54961 65536 Method ExecutionOrder GlobalWorkSize GroupWorkSize Time LDSSize DataTransferSize GPRs ScratchRegs FCStacks Wavefronts ALUInsts FetchInsts WriteInsts LDSFetchInsts LDSWriteInsts ALUBusy ALUFetchRatio ALUPacking FetchSize CacheHit FetchUnitBusy FetchUnitStalled WriteUnitStalled FastPath CompletePath PathUtilization ALUStalledByLDS LDSBankConflict WriteBuffer 1 124.93881 65536 sobel_filter_00DB7830 2 { 4096 4096 1} { 16 16 1} 18.42737 0 7 0 2 262144 144 8 1 0 0 48.33 18 80.14 343136.38 0 71.85 61.11 56.78 65507.38 0 100 0 0 ReadBuffer 3 53.20153 65536

0 Likes

ryta,

I think the performance depends on a lot of parameters among which the access pattern of global mamory is one of the most important.

Although in my case performance is not deviating that much (57.37855 and 55.53019).Anyways it is a fact that 2d arrays would be much slower then 1d array becoz in 1d arrays when successive elemets are accesssed all the channels are used and data is fetched from diffrent banks.But in 2d arrys bank conflicts or channel conflicts occur which tend to serialize the accesses and create delay.

In the two cases you have given above,it appears that ALUBusy is 48.33(2d case) and 99.92(1d case).The reason behind it  according to me is the param FetchUnit busy.(22.2 1d  71.85 2d).

For more optimization related info refer to openCL programming Guide

  
0 Likes

Sadly, this isn't really covered that much in the Guide.

The DCT and MatrixTranspose both use 2D and probably wouldn't benefit from 1D as you suggest.

Many of the other samples, such as Scholes and Twister are totally unaffected by the change in dimension.

So it's the access pattern of the Sobel and Simple Conv that are causing bank conflicts/contention?

0 Likes

Originally posted by: himanshu.gautam ryta,

I think the performance depends on a lot of parameters among which the access pattern of global mamory is one of the most important.

Although in my case performance is not deviating that much (57.37855 and 55.53019).



What do you mean "in your case"? What card  are you using? Are you sure you didn't accidentally look at the ReadBuffer times? I'm simply using the samples so I don't see how our code could differ!?

0 Likes

Hi ryta1203,
I am sure I checked the kernel execution times and reported above. As such there can be varying optimization techniques in different GPUs. Communication functionality with host depends on the CPU, OS, Device driver and many other factors. We might get different results.

 

 

 

0 Likes

Honestly, that makes doesn't make sense as far as effecting the kernel run times.

I am running on a 5870, you still haven't told me which card you got the above numbers on.

The poster is replied with the change in accesses is correct; however, I was confused at first since I didn't see an increase in the number of fetch instructions... that said, I suppose it shouldn't. What it does do is increase the SIZE of the fetch causing delays.

8x8 has the biggest fetch size while 16x16 is about .75 of that.

Essentially redditisgreat is correct.

0 Likes

i got those times on HD 4870x2.And just rechecked them and they are what they are.

ok.please send your modified code of sample i will try to run that code.

0 Likes

Originally posted by: himanshu.gautam i got those times on HD 4870x2.And just rechecked them and they are what they are.

ok.please send your modified code of sample i will try to run that code.

There's really nothing "modified" about the sample other than I changed the "localThreads[1] = x" to "localThreads[2] = {x, x}; AND the clEnqueueNDRangeKernel from 1D to 2D, that's it.

0 Likes

Ryta,

In that case please send the code at streamdeveloper@amd.com.

With your complete System configuration.

0 Likes