Archives Discussions

bubu · ‎06-01-2010

Hi,

Imagine I want to apply an image filter to a 640x480 image using 128x128 localwork groups. As the height(480) is not divisible by 128, a problem gonna occur!

Do I need to make something special to deal with this or is the OpenCL implemention clever enough to avoid to process the padding/pitch pixels automatically?

And... is the 640x480 / 128x128 optimal? I'll get 5x4=20 blocks in total, and my 5750 has 9 compute units so I think there will be enough data to be optimal...( unless each compute unit can process several work groups in parallel! )

thx

LeeHowes · ‎06-01-2010

You want each workgroup to be 128x128? That's far too many wavefronts for a single core. If not that I'm not entirely sure what you mean.

The OpenCL implementation will not automatically mask out WIs that are outside your range, you'd have to use a test against the global id. It certainly won't guess at the size of your image so it can't possibly mask out work items that appear outside the 480 vertical dimension of the image.

20 groups is a pretty small number. On the 5750 that's a couple per core which is probably ok for latency hiding, though this depends on whether you're really trying to put such big groups on the device because you need few enough waves that they have registers available to them.

bubu · ‎06-01-2010

Originally posted by: LeeHowes

20 groups is a pretty small number. On the 5750 that's a couple per core which is probably ok for latency hiding, though this depends on whether you're really trying to put such big groups on the device because you need few enough waves that they have registers available to them.

So what's a good work group then? 16x16 so I get 1200 blocks using a grid of 640x480 pixels?

Or 8x8 to get the minimum wavefront( 64 threads)?

How many blocks(or wavefronts) can process in parallel each compute unit?

LeeHowes · ‎06-01-2010

You want at least 4 waves per core to allow latency hiding. Whether you want that in one or more groups depends on whether you have barriers. If you have barriers you want more than one group on the core too so that when one group is reducing in size from barriers the other one can keep running.

It's not obvious to me why you're only looking at square groups. Is that vital for your filtering kernel in some way? More likely, because you need to unroll to overcome control overhead anyway, you'll want something like a 16x4 or 32x4 group and unroll in the vertical dimension to make up the difference. I don't think I've ever seen a CUDA or OpenCL image filter kernel for which a square block is optimal.

As for number of waves, I'll quote my earlier post on the subject:

http://developer.amd.com/gpu_assets/Heterogeneous_Computing_OpenCL_and_the_ATI_Radeon_HD_5870_Architecture_201003.pdf

Each dispatcher can manage 248 waves in flight (hence the Juniper numbers) with two dispatchers on the Cypress die. I don't know if there is a limit to how many waves will run on a SIMD short of register space but there is a maximum of 8 work groups per SIMD for other reasons. Note that on Redwood and Cedar (and in realistic cases Juniper Cypress too) the total is lower than the dispatcher can handle because it will always be register limited on the SIMDs. The actual number will of course be lower depending on what waves and work groups are allocated by the dispatcher to each SIMD.

The short answer is 24 waves per SIMD.

bubu · ‎06-02-2010

You want at least 4 waves per core to allow latency hiding. Whether you want that in one or more groups depends on whether you have barriers. If you have barriers you want more than one group on the core too so that when one group is reducing in size from barriers the other one can keep running.

Ok, I use no barriers though.

It's not obvious to me why you're only looking at square groups. Is that vital for your filtering kernel in some way?

I do not need square blocks really. Btw, I should never use blocks of less than the wavefront's size(64), right? For example, a 16x2 is bad, isn't it?

More likely, because you need to unroll to overcome control overhead anyway, you'll want something like a 16x4 or 32x4 group and unroll in the vertical dimension to make up the difference. I don't think I've ever seen a CUDA or OpenCL image filter kernel for which a square block is optimal.

Ok, I'll try that.

As for number of waves, I'll quote my earlier post on the subject:

http://developer.amd.com/gpu_assets/Heterogeneous_Computing_OpenCL_and_the_ATI_Radeon_HD_5870_Architecture_201003.pdf

interesting, thx!

Btw, you should put all those things in a clear table like CUDA's docs:

Max active blocks per multiprocessor(core,compute unit): 8

Warp(wavefront) size : 32 threads

Max active warps per multiprocessors(core,compute unit): 32

Max multiprocessors: 30 ( gtx 280), 24 (gtx 260), 12 (gt240)

Cycles per rcpsqrt = 2, sqrt=4, cos = 8, etc...

etc etc

the \ATIStream\docs is almost empty

Ok, if I understand well the values for ATI's OpenCL are:

Max active blocks per compute unit: 8

Wavefront size : 64 threads

Max. active wavefronts per compute unit: 24

Max compute units: 20 ( 5870), 10(5770), 9 (5750), 5 (5670)

???

LeeHowes · ‎06-02-2010

I do not need square blocks really. Btw, I should never use blocks of less than the wavefront's size(64), right? For example, a 16x2 is bad, isn't it?

Quite right. That would waste execution resources.

Max active blocks per compute unit: 8

Wavefront size : 64 threads

Max. active wavefronts per compute unit: 24
Max compute units: 20 ( 5870), 10(5770), 9 (5750), 5 (5670)

That sounds about right though I haven't checked the numbers for the 5740 and 5670. The only things to note are: that max active wavefronts is not I think an actual limit, that's an average over the device as a whole given the capability of the sequencer. Registers will tend to be the real limit. Also, the wavefront size on cedar is only 32.

Archives Discussions

grid/workgroup size question