6 Replies Latest reply on Mar 4, 2010 2:43 AM by thatguymike

    Why 256 work_items on RV770?

    drstrip

      The RV770 shows 10 "compute units", and 256 work_items per dimension with 3 dimensions. In Brook+ you are allowed 1024 vector elements per dimension. Neither of these numbers actually relates to the number of thread processors, so where do they come from?

        • Why 256 work_items on RV770?
          genaganna

           

          Originally posted by: drstrip The RV770 shows 10 "compute units", and 256 work_items per dimension with 3 dimensions. In Brook+ you are allowed 1024 vector elements per dimension. Neither of these numbers actually relates to the number of thread processors, so where do they come from?


          Each compute unit is able to execute one or more Work_groups concurrently based the resources used by each work_group.

            • Why 256 work_items on RV770?
              drstrip

              but this doesn't answer the question of why the number is 256. 256 (per dimension) is not a multiple of the number of thread processors. Likewise Brook+ used 1024 per dimension (not quite the same notion, admittedly), which also is not a multiple of the thread processors. So, it seems these numbers are essentially arbitrary.

               

              So, perhaps I should rephrase my question. Is the choice of 256 as the max work_items an arbitrary choice by the implementor?

                • Why 256 work_items on RV770?
                  gaurav.garg

                   

                  but this doesn't answer the question of why the number is 256. 256 (per dimension) is not a multiple of the number of thread processors. Likewise Brook+ used 1024 per dimension (not quite the same notion, admittedly), which also is not a multiple of the thread processors. So, it seems these numbers are essentially arbitrary.

                   

                   

                   

                  So, perhaps I should rephrase my question. Is the choice of 256 as the max work_items an arbitrary choice by the implementor?



                  One work-group is executed on a single compute unit (aka SIMD engine) that contains 16 processing elements (also called SP). One compute unit executes 64 threads (1 wavefont) over 4 cycles. Work-group is always devided into groups of wavefonts. That's why work-group size is multiple of wavefront size, not number of compute units.

                    • Why 256 work_items on RV770?
                      drstrip

                      This helps a lot, esp since I had somehow read right past the part of the spec that says all the work_items in a work_group execute on a single processor. Now the power of two size of work_items makes sense to me, since it decouples it from the number of SIMD engines.

                      A question of clarification -

                      For the RV770, max work_items for each dimension is 256. max work group size is also 256. If I understand this now, that means I can have assign 16x16x1 or 32x8x1, etc, to a work group. The max work group size is total number of items, not items per dimension. Right?

                       

                      Anyway, thanks for clearing up most of this for me.

                        • Why 256 work_items on RV770?
                          thatguymike

                          Correct, 256 work items total in a work group.  The total global size can be much larger, but it must be comprised of work groups no larger than 256 work items. 

                           

                          For example, your total global size can be 8192x8192, comprised of 16x16 work groups.  In this case, you will have 262144 work group (512x512) with each workgroup having 256 work items (16x16).

                  • Why 256 work_items on RV770?
                    MicahVillmow
                    256 is 4 wavefronts on the high end/midrange graphic cards, 8 wavefronts on some mid-range cards and 16 wavefronts on some low end cards. So they are not arbitrary. This number will be higher in future releases, but the problem with having a lot of wavefronts allowed is that resources disappear quickly. This limit will be different in future revisions and eventually will be equal to the max allowed for the device for very simple kernels.