cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Pavel_Kudrin
Journeyman III

About work-groups sheduling mechanism

I have Radeon HD 4870x2 card (RV770 GPU), and i've got number of simultaneously processing work-groups = 32 by experimental way.

I didn't understand, from where this number appeared. 

As I know for RV770, CL_DEVICE_MAX_COMPUTE_UNITS = 10. Why then 32?

Additional info:

a) works fine:

globalWorkSize = 8192

localWorkSize = 256

b) don't work:

globalWorkSize = 8448

localWorkSize = 256

c) don't work:

globalWorkSize = 16384

localWorkSize = 256

 





local work size is got from clGetKernelWorkGroupInfo(... CL_KERNEL_WORK_GROUP_SIZE ... )

Then I have following questions:

1) how many simultaneous work groups can work together?

2) if that number exists and is finite, then is the number of simultaneousely processing work-groups depends on GPU type? 

3) also if that number exists and is finite, then how this number can be retrieved programmatically by querying device? (like for number of SIMD engines using clGetDeviceInfo( ..., CL_DEVICE_MAX_COMPUTE_UNITS, ... )

0 Likes
12 Replies
nou
Exemplar

4870 have 10 processing unit. each have 16 5-way SIMD core. so 10*16*5 is 800 which is in specification.

0 Likes
Fr4nz
Journeyman III

Originally posted by: Pavel Kudrin

Then I have following questions:

1) how many simultaneous work groups can work together?

2) if that number exists and is finite, then is the number of simultaneousely processing work-groups depends on GPU type? 

3) also if that number exists and is finite, then how this number can be retrieved programmatically by querying device? (like for number of SIMD engines using clGetDeviceInfo( ..., CL_DEVICE_MAX_COMPUTE_UNITS, ... )



On current ATI hardware every processing unit should be able to execute a work-group of 256 work-items at a given time. If you use less work-items per work-group, it is possible that a processing element will execute more than one work-group simultaneously, but this depends on how wavefronts are managed at hardware level. So, this is a question that only AMD staff can answer precisely...anyway, you shouldn't worry about this when you design your kernel, given the fact that this aspect can't be managed by the programmer's side.

0 Likes

 

Thank you for reply!

nou, 

as I understood from documentation, "software thread" and "hardware thread" have different meaning. And number of hardware processing elements not equals to number software threads executing at a time. We have work-group size = 256 software threads, but SIMD engine have only 16 SIMD cores, or 16*5 = 80 hardware threads at a time. Besides this number not equals to 256. 

10 processing units not equals to 32. Even if we decide, that 4870 x2 have two chips, each of them has 10 processing units, then 20 processing units not equals 32.

So I think, that sheduler use own algorithm of mapping work-groups on processing units.

Fr4nz,

I use 256 work items per group. So the number of work-groups performing on proceccing unit should have minimum value and should be equal to 1. Also experiments with my kernel shown, that result is indifferent to work-group size (I've tried work-group size 128 and 64 - same result as for work-group size 256).

Yes, it would be nice, if AMD staff will answer for this question, because I more inclined to organize mapping

device -> number of work groups, performing at a time

This number is need for my task to make my kernel work if globalWorkSize > 8192

 

0 Likes

Originally posted by: Pavel Kudrin

Thank you for reply!

 

nou, 

 

as I understood from documentation, "software thread" and "hardware thread" have different meaning. And number of hardware processing elements not equals to number software threads executing at a time. We have work-group size = 256 software threads, but SIMD engine have only 16 SIMD cores, or 16*5 = 80 hardware threads at a time. Besides this number not equals to 256.

That's why threads are implicitly organized in wavefronts (or warps, if you want to use CUDA terminology) at hardware level. And that's why you don't have to worry on this: just find the size, for every work-group, that gives you the best performance with your kernel (just keep in mind that you have the best results when the size of every work-group is a multiple of a wavefront size).
Maybe you should look at these tutorial videos, they're very good if you want to make some light about how threads are managed by video-cards at hardware level:

http://www.macresearch.org/opencl_episode1

thru...

http://www.macresearch.org/opencl_episode6

In my opinion you're confusing a nonexistent problem as a problem...
0 Likes

when is execute work-group on SIMD engine it is executed in 4 waves. and it is not 16*5=80 HW thread but only 16 thread. but from one thread it can execute 5 indepent instruction at the same time.

0 Likes
Fr4nz
Journeyman III

Originally posted by: nou when is execute work-group on SIMD engine it is executed in 4 waves. and it is not 16*5=80 HW thread but only 16 thread. but from one thread it can execute 5 indepent instruction at the same time.

 

So, if we want to make a parallel with Nvidia hardware, on ATI hardware we have warps of 64 threads and each warp is made of 4 sub-warps, each one made of 16 threads? Am I correct?

0 Likes
genaganna
Journeyman III

Originally posted by: Pavel Kudrin I have Radeon HD 4870x2 card (RV770 GPU), and i've got number of simultaneously processing work-groups = 32 by experimental way.

 

I didn't understand, from where this number appeared. 

 

As I know for RV770, CL_DEVICE_MAX_COMPUTE_UNITS = 10. Why then 32?

Could you please tell us how did you conclude this number 32?

 

b) don't work:

 

globalWorkSize = 8448

 

localWorkSize = 256

What do you mean by doesn't work? Could you please give us test which shows this failure?

 

c) don't work:

 

globalWorkSize = 16384

 

localWorkSize = 256

 

  [q/]

What do you mean by doesn't work? Could you please give us test which shows this failure?

 

 

 

local work size is got from clGetKernelWorkGroupInfo(... CL_KERNEL_WORK_GROUP_SIZE ... )

 

Then I have following questions:

 

1) how many simultaneous work groups can work together?

I have no idea what is the hardware limit of simultaneous work groups on a SIMD.

I am sure it can run more than 1 work groups simultaneously on a singal SIMD.

 

2) if that number exists and is finite, then is the number of simultaneousely processing work-groups depends on GPU type? 

It depends on GPU type.

 

 

3) also if that number exists and is finite, then how this number can be retrieved programmatically by querying device? (like for number of SIMD engines using clGetDeviceInfo( ..., CL_DEVICE_MAX_COMPUTE_UNITS, ... )

 

There is no way to retrieve this information from OpenCL. I am not sure whether it is possible from CAL.

 

Side Note : Anyhow programmatically we donot have control on number work-Groups running on Compute Unit from any OpenCL implemenations.

               Number work-Groups simulteneousely running on a SIMD depends on the resource usage of each workGroup like registers, shared memory. Off course there is a limitation on Maximum number of Work-Groups running on compute unit(SIMD).

0 Likes

Fr4nz,

seems you are right. RV770 can perform 64 threads on a single SIMD engine. 16 thread processors, each can execute 4 threads at a time, 16*4 = 64. In that way I have a number of simultaneously working groups = 128 (localWorkSize = 64, globalWorkSize = 8192)

When I use localWorkSize = 256 and each SIMD core performs 64 threads at a time again, then 4 SIMD Engines are used to perform 256 work-items at a time, while the number of simultaneously processing work-groups is less in 4 times and equals to 32 (1/4 of 128) (globalWorkSize = 8192).

If I use localWorkSize = 32 then I have effect same as for localWorkSize = 64. No matter 64, or 32, or 16, etc., 1 SIMD Engine is organized to perform one work-group.

genaganna,

my kernel have cycles, and "doesn't work" means that GPU is hanging while work-groups performing this cycles, and only system reboot helps. 

Why 32? Because it hangs when i use globalWorkSize > 8192 or more than 32 work-groups (case of localWorkSize = 256).

In case localWorkSize = 64 we have number of simultaneous groups = 128.

Even if 1 SIMD Engine can perform more than one work group at a time, why then 128? 128 is not multiple of 10.

More than, I have a sense, that GPU performes by 128 work-groups serially: first it executes groups 0...127, then 128...255, then 256...383 and etc.

0 Likes

Originally posted by: Pavel Kudringenaganna,

my kernel have cycles, and "doesn't work" means that GPU is hanging while work-groups performing this cycles, and only system reboot helps. 

Why 32? Because it hangs when i use globalWorkSize > 8192 or more than 32 work-groups (case of localWorkSize = 256).

In case localWorkSize = 64 we have number of simultaneous groups = 128.

Even if 1 SIMD Engine can perform more than one work group at a time, why then 128? 128 is not multiple of 10.

More than, I have a sense, that GPU performes by 128 work-groups serially: first it executes groups 0...127, then 128...255, then 256...383 and etc.

Please give us both kernel code and runtime code.

On 4xxx series cards, Use local work group returned by clGetKernelWorkGroupInfo(CL_KERNEL_WORK_GROUP_SIZE) or 64.

0 Likes

I wrote another simple example.

Now localWorkSize = 64

And now I have got another magic number of groups working at a time: 244

Seems that this number has variable value for specific GPU, but constant value for specific kernel. 

And as i understood, there is no mechanism to determine this number?

And mininum value of this number equals CL_DEVICE_MAX_COMPUTE_UNITS ?

0 Likes

how do you measure this numbers?

0 Likes

Pavel Kudrin,

Could you be a little more specific on how you got these values? A test case will help us understand the problem better and answer your query.

0 Likes