12 Replies Latest reply on Feb 19, 2010 10:04 AM by omkaranathan

    About work-groups sheduling mechanism

    PavelKudrin

      I have Radeon HD 4870x2 card (RV770 GPU), and i've got number of simultaneously processing work-groups = 32 by experimental way.

      I didn't understand, from where this number appeared. 

      As I know for RV770, CL_DEVICE_MAX_COMPUTE_UNITS = 10. Why then 32?

      Additional info:

      a) works fine:

      globalWorkSize = 8192

      localWorkSize = 256

      b) don't work:

       

      globalWorkSize = 8448

      localWorkSize = 256

      c) don't work:

       

      globalWorkSize = 16384

      localWorkSize = 256

       





      local work size is got from clGetKernelWorkGroupInfo(... CL_KERNEL_WORK_GROUP_SIZE ... )

      Then I have following questions:

      1) how many simultaneous work groups can work together?

      2) if that number exists and is finite, then is the number of simultaneousely processing work-groups depends on GPU type? 

      3) also if that number exists and is finite, then how this number can be retrieved programmatically by querying device? (like for number of SIMD engines using clGetDeviceInfo( ..., CL_DEVICE_MAX_COMPUTE_UNITS, ... )

        • About work-groups sheduling mechanism
          nou

          4870 have 10 processing unit. each have 16 5-way SIMD core. so 10*16*5 is 800 which is in specification.

          • About work-groups sheduling mechanism
            Fr4nz

             

            Originally posted by: Pavel Kudrin

            Then I have following questions:

            1) how many simultaneous work groups can work together?

            2) if that number exists and is finite, then is the number of simultaneousely processing work-groups depends on GPU type? 

            3) also if that number exists and is finite, then how this number can be retrieved programmatically by querying device? (like for number of SIMD engines using clGetDeviceInfo( ..., CL_DEVICE_MAX_COMPUTE_UNITS, ... )



            On current ATI hardware every processing unit should be able to execute a work-group of 256 work-items at a given time. If you use less work-items per work-group, it is possible that a processing element will execute more than one work-group simultaneously, but this depends on how wavefronts are managed at hardware level. So, this is a question that only AMD staff can answer precisely...anyway, you shouldn't worry about this when you design your kernel, given the fact that this aspect can't be managed by the programmer's side.

              • About work-groups sheduling mechanism
                PavelKudrin

                 

                Thank you for reply!

                nou, 

                as I understood from documentation, "software thread" and "hardware thread" have different meaning. And number of hardware processing elements not equals to number software threads executing at a time. We have work-group size = 256 software threads, but SIMD engine have only 16 SIMD cores, or 16*5 = 80 hardware threads at a time. Besides this number not equals to 256. 

                10 processing units not equals to 32. Even if we decide, that 4870 x2 have two chips, each of them has 10 processing units, then 20 processing units not equals 32.

                So I think, that sheduler use own algorithm of mapping work-groups on processing units.

                Fr4nz,

                I use 256 work items per group. So the number of work-groups performing on proceccing unit should have minimum value and should be equal to 1. Also experiments with my kernel shown, that result is indifferent to work-group size (I've tried work-group size 128 and 64 - same result as for work-group size 256).

                Yes, it would be nice, if AMD staff will answer for this question, because I more inclined to organize mapping

                device -> number of work groups, performing at a time

                This number is need for my task to make my kernel work if globalWorkSize > 8192

                 

                  • About work-groups sheduling mechanism
                    Fr4nz

                     

                    Originally posted by: Pavel Kudrin

                    Thank you for reply!

                     

                    nou, 

                     

                    as I understood from documentation, "software thread" and "hardware thread" have different meaning. And number of hardware processing elements not equals to number software threads executing at a time. We have work-group size = 256 software threads, but SIMD engine have only 16 SIMD cores, or 16*5 = 80 hardware threads at a time. Besides this number not equals to 256.

                    That's why threads are implicitly organized in wavefronts (or warps, if you want to use CUDA terminology) at hardware level. And that's why you don't have to worry on this: just find the size, for every work-group, that gives you the best performance with your kernel (just keep in mind that you have the best results when the size of every work-group is a multiple of a wavefront size).
                    Maybe you should look at these tutorial videos, they're very good if you want to make some light about how threads are managed by video-cards at hardware level:


                    thru...


                    In my opinion you're confusing a nonexistent problem as a problem...
                      • About work-groups sheduling mechanism
                        nou

                        when is execute work-group on SIMD engine it is executed in 4 waves. and it is not 16*5=80 HW thread but only 16 thread. but from one thread it can execute 5 indepent instruction at the same time.

                          • About work-groups sheduling mechanism
                            Fr4nz

                             

                            Originally posted by: nou when is execute work-group on SIMD engine it is executed in 4 waves. and it is not 16*5=80 HW thread but only 16 thread. but from one thread it can execute 5 indepent instruction at the same time.

                             

                            So, if we want to make a parallel with Nvidia hardware, on ATI hardware we have warps of 64 threads and each warp is made of 4 sub-warps, each one made of 16 threads? Am I correct?

                    • About work-groups sheduling mechanism
                      genaganna

                       

                      Originally posted by: Pavel Kudrin I have Radeon HD 4870x2 card (RV770 GPU), and i've got number of simultaneously processing work-groups = 32 by experimental way.

                       

                      I didn't understand, from where this number appeared. 

                       

                      As I know for RV770, CL_DEVICE_MAX_COMPUTE_UNITS = 10. Why then 32?

                      Could you please tell us how did you conclude this number 32?

                       

                      b) don't work:

                       

                      globalWorkSize = 8448

                       

                      localWorkSize = 256

                      What do you mean by doesn't work? Could you please give us test which shows this failure?

                       

                      c) don't work:

                       

                      globalWorkSize = 16384

                       

                      localWorkSize = 256

                       

                        [q/]

                      What do you mean by doesn't work? Could you please give us test which shows this failure?

                       

                       

                       

                      local work size is got from clGetKernelWorkGroupInfo(... CL_KERNEL_WORK_GROUP_SIZE ... )

                       

                      Then I have following questions:

                       

                      1) how many simultaneous work groups can work together?

                      I have no idea what is the hardware limit of simultaneous work groups on a SIMD.

                      I am sure it can run more than 1 work groups simultaneously on a singal SIMD.

                       

                      2) if that number exists and is finite, then is the number of simultaneousely processing work-groups depends on GPU type? 

                      It depends on GPU type.

                       

                       

                      3) also if that number exists and is finite, then how this number can be retrieved programmatically by querying device? (like for number of SIMD engines using clGetDeviceInfo( ..., CL_DEVICE_MAX_COMPUTE_UNITS, ... )

                       

                      There is no way to retrieve this information from OpenCL. I am not sure whether it is possible from CAL.

                       

                      Side Note : Anyhow programmatically we donot have control on number work-Groups running on Compute Unit from any OpenCL implemenations.

                                     Number work-Groups simulteneousely running on a SIMD depends on the resource usage of each workGroup like registers, shared memory. Off course there is a limitation on Maximum number of Work-Groups running on compute unit(SIMD).

                        • About work-groups sheduling mechanism
                          PavelKudrin

                          Fr4nz,

                          seems you are right. RV770 can perform 64 threads on a single SIMD engine. 16 thread processors, each can execute 4 threads at a time, 16*4 = 64. In that way I have a number of simultaneously working groups = 128 (localWorkSize = 64, globalWorkSize = 8192)

                          When I use localWorkSize = 256 and each SIMD core performs 64 threads at a time again, then 4 SIMD Engines are used to perform 256 work-items at a time, while the number of simultaneously processing work-groups is less in 4 times and equals to 32 (1/4 of 128) (globalWorkSize = 8192).

                          If I use localWorkSize = 32 then I have effect same as for localWorkSize = 64. No matter 64, or 32, or 16, etc., 1 SIMD Engine is organized to perform one work-group.

                          genaganna,

                          my kernel have cycles, and "doesn't work" means that GPU is hanging while work-groups performing this cycles, and only system reboot helps. 

                          Why 32? Because it hangs when i use globalWorkSize > 8192 or more than 32 work-groups (case of localWorkSize = 256).

                          In case localWorkSize = 64 we have number of simultaneous groups = 128.

                          Even if 1 SIMD Engine can perform more than one work group at a time, why then 128? 128 is not multiple of 10.

                          More than, I have a sense, that GPU performes by 128 work-groups serially: first it executes groups 0...127, then 128...255, then 256...383 and etc.

                            • About work-groups sheduling mechanism
                              genaganna

                               

                              Originally posted by: Pavel Kudringenaganna,

                              my kernel have cycles, and "doesn't work" means that GPU is hanging while work-groups performing this cycles, and only system reboot helps. 

                              Why 32? Because it hangs when i use globalWorkSize > 8192 or more than 32 work-groups (case of localWorkSize = 256).

                              In case localWorkSize = 64 we have number of simultaneous groups = 128.

                              Even if 1 SIMD Engine can perform more than one work group at a time, why then 128? 128 is not multiple of 10.

                              More than, I have a sense, that GPU performes by 128 work-groups serially: first it executes groups 0...127, then 128...255, then 256...383 and etc.

                              Please give us both kernel code and runtime code.

                              On 4xxx series cards, Use local work group returned by clGetKernelWorkGroupInfo(CL_KERNEL_WORK_GROUP_SIZE) or 64.