16 Replies Latest reply on Sep 7, 2010 12:55 PM by rotor

    About execution domain dimensions

    Raistmer
      how to chose them properly?

      Hello.
      I have some very long kernel that should run over big execution domain.
      If I enqueue it over whole domain it causes driver restart (due too long execution).
      So I split execution domain on blocks and call that kernel over smaller parts.

      When I use (128x2) grid size it runs OK. But with (256x1) it causes driver restart.
      What can be reasons for such different behavior with same number of threads ?

      I use HD4870. It has (AFAIK) 10 compute units that is, for better GPU usage, some of dimensions should be divisible by compute units number (in my case - by 10). But also, for better load each wavefront should have 64 threads.
      Do I understand right that for meeting both requirements first (X-axis) dimension should be divisible by 64 while second one (Y-axis) shuld be divisible by 10 (in my case).

      In other words, is it true that let say 128x10 will work on my GPU better than 10x128 ?
        • About execution domain dimensions
          rotor

          I think just because when you run on 256x1 you have to specify it as one-dimension group {256} not as two dimension group {256,1}.

          • About execution domain dimensions
            genaganna

             

            Originally posted by: Raistmer Hello. I have some very long kernel that should run over big execution domain. If I enqueue it over whole domain it causes driver restart (due too long execution). So I split execution domain on blocks and call that kernel over smaller parts. When I use (128x2) grid size it runs OK. But with (256x1) it causes driver restart. What can be reasons for such different behavior with same number of threads ? I use HD4870. It has (AFAIK) 10 compute units that is, for better GPU usage, some of dimensions should be divisible by compute units number (in my case - by 10). But also, for better load each wavefront should have 64 threads. Do I understand right that for meeting both requirements first (X-axis) dimension should be divisible by 64 while second one (Y-axis) shuld be divisible by 10 (in my case). In other words, is it true that let say 128x10 will work on my GPU better than 10x128 ?


            Do you have barriers in kernel?  if you have barries in kernel,  you can at most have workgroupsize is 64.  Please send testcase to reproduce your issue. 

            You can have atmost 256 threads per group. so it is not possible to use 128 x 10 or 10 x 128.

             

              • About execution domain dimensions
                HarryH

                 

                Originally posted by: genagannaDo you have barriers in kernel?  if you have barries in kernel,  you can at most have workgroupsize is 64.  Please send testcase to reproduce your issue.


                Is this documented anywhere? I have been having trouble with a memory testing kernel

                that produces sporadic errors in about 0.7% of runs which led me to believe my 5870

                was defective. It turned out that clGetKernelWorkGroupInfo said a workgroupsize of

                256 was ok. Reducing the workgroupsize to 128 got rid of all the errors. But according

                to your post even that is too large. Is this an OpenCL limitation or an AMD/ATI thing?

                Will this be addressed in future releases? This was a really difficult bug to find and I only

                out of desperation tried the smaller workgroupsize because initially I trusted the result returned

                by clGetKernelWorkGroupInfo(..., CL_KERNEL_WORK_GROUP_SIZE, ...)

                Edit: is 64 the limit for all asics or is the size of a wavefront the limit?

                  • About execution domain dimensions
                    genaganna

                     

                    Originally posted by: HarryH
                    Originally posted by: genagannaDo you have barriers in kernel?  if you have barries in kernel,  you can at most have workgroupsize is 64.  Please send testcase to reproduce your issue.


                     

                    Is this documented anywhere? I have been having trouble with a memory testing kernel

                     

                    that produces sporadic errors in about 0.7% of runs which led me to believe my 5870

                     

                    was defective. It turned out that clGetKernelWorkGroupInfo said a workgroupsize of

                     

                    256 was ok. Reducing the workgroupsize to 128 got rid of all the errors. But according

                     

                    to your post even that is too large. Is this an OpenCL limitation or an AMD/ATI thing?

                     

                    Will this be addressed in future releases? This was a really difficult bug to find and I only

                     

                    out of desperation tried the smaller workgroupsize because initially I trusted the result returned

                     

                    by clGetKernelWorkGroupInfo(..., CL_KERNEL_WORK_GROUP_SIZE, ...)

                     

                     

                     

                    This is true for only 4xxx cards.  clGetKernelWorkGroupInfo returns trusted value.   It is recommended to use the returned value from clGetKernelWrokGroupInfo(..., CL_KERNEL_WORK_GROUP_SIZE, ..)

                      • About execution domain dimensions
                        HarryH

                        In my case the value returned by clGetKernelWorkGroupInfo was incorrect, because

                        executing the kernel with a workgroupsize of 256 led to errors and reducing the

                        workgroupsize got rid of them. The errors are more or less random and occur in only

                        0.7% of cases. But the problem can be reproduced and the observed errors seem

                        to occur in bursts of about 1024 bits (slightly less on avarage). As this is a memory

                        tester I first thought of soft errors but now I suspect the problem lies with the compiler's

                        estimate of the resources needed by a kernel leading to clGetKernelWorkGroupInfo

                        producing a suggested workgroupsize that is too large.

                          • About execution domain dimensions
                            genaganna

                             

                            Originally posted by: HarryH In my case the value returned by clGetKernelWorkGroupInfo was incorrect, because

                             

                            executing the kernel with a workgroupsize of 256 led to errors and reducing the

                             

                            workgroupsize got rid of them. The errors are more or less random and occur in only

                             

                            0.7% of cases. But the problem can be reproduced and the observed errors seem

                             

                            to occur in bursts of about 1024 bits (slightly less on avarage). As this is a memory

                             

                            tester I first thought of soft errors but now I suspect the problem lies with the compiler's

                             

                            estimate of the resources needed by a kernel leading to clGetKernelWorkGroupInfo

                             

                            producing a suggested workgroupsize that is too large.

                             

                            Please give us testcase to reproduce this issue.  Testcase will help us to fix this issue.

                            • About execution domain dimensions
                              rotor

                               

                               

                              Originally posted by: HarryH In my case the value returned by clGetKernelWorkGroupInfo was incorrect, because

                               

                              executing the kernel with a workgroupsize of 256 led to errors and reducing the

                               

                              workgroupsize got rid of them. The errors are more or less random and occur in only

                               

                              0.7% of cases. But the problem can be reproduced and the observed errors seem

                               

                              to occur in bursts of about 1024 bits (slightly less on avarage). As this is a memory

                               

                              tester I first thought of soft errors but now I suspect the problem lies with the compiler's

                               

                              estimate of the resources needed by a kernel leading to clGetKernelWorkGroupInfo

                               

                              producing a suggested workgroupsize that is too large.

                               



                              It is really hard to get kernel workgroup size=GPU's max supported workgroup size. In early versions of SDK it allow you to use max size possible but then the recent version force you to use a size <= an optimized max number given by the compiler after it compile the kernel source code. According to ATI doc, it says that since there limited resource (i.e. registers...) then the maximum number of work items/workgroup depends on the utilization of the kernel. So, to determine the max size of your kernel's work group size you have to query rather than the CL_KERNEL_WORK_GROUP_SIZE rather than CL_DEVICE_MAX_WORK_GROUP_SIZE

                            • About execution domain dimensions
                              ryta1203

                               

                              Originally posted by: genaganna
                              Originally posted by: HarryH
                              Originally posted by: genagannaDo you have barriers in kernel?  if you have barries in kernel,  you can at most have workgroupsize is 64.  Please send testcase to reproduce your issue.


                               

                              Is this documented anywhere? I have been having trouble with a memory testing kernel

                               

                              that produces sporadic errors in about 0.7% of runs which led me to believe my 5870

                               

                              was defective. It turned out that clGetKernelWorkGroupInfo said a workgroupsize of

                               

                              256 was ok. Reducing the workgroupsize to 128 got rid of all the errors. But according

                               

                              to your post even that is too large. Is this an OpenCL limitation or an AMD/ATI thing?

                               

                              Will this be addressed in future releases? This was a really difficult bug to find and I only

                               

                              out of desperation tried the smaller workgroupsize because initially I trusted the result returned

                               

                              by clGetKernelWorkGroupInfo(..., CL_KERNEL_WORK_GROUP_SIZE, ...)

                               

                                 

                              This is true for only 4xxx cards.  clGetKernelWorkGroupInfo returns trusted value.   It is recommended to use the returned value from clGetKernelWrokGroupInfo(..., CL_KERNEL_WORK_GROUP_SIZE, ..)

                              Is this based on any variables? If not then I don't think this is a good idea since kernels vary dramatically depending on the block size chosen.

                                • About execution domain dimensions
                                  himanshu.gautam

                                  ryta1203,

                                  IT is really hard to determine dimension preferance on a fixed rule.When we are using global buffers,the more linear the memory access pattern more faster runs the kernels

                                  But when we are using images,the data can be effectively cached in the L1 cache(which is inherently 2D).So 2D cache lines can be more effectively utilised in case of 2D memory access.

                                   

                                  Moreover,your performance depends on the way kernel is written and executed depending on the GPU you are using.

                            • About execution domain dimensions
                              Raistmer
                              Originally posted by: genaganna

                              Originally posted by: Raistmer Hello. I have some very long kernel that should run over big execution domain. If I enqueue it over whole domain it causes driver restart (due too long execution). So I split execution domain on blocks and call that kernel over smaller parts. When I use (128x2) grid size it runs OK. But with (256x1) it causes driver restart. What can be reasons for such different behavior with same number of threads ? I use HD4870. It has (AFAIK) 10 compute units that is, for better GPU usage, some of dimensions should be divisible by compute units number (in my case - by 10). But also, for better load each wavefront should have 64 threads. Do I understand right that for meeting both requirements first (X-axis) dimension should be divisible by 64 while second one (Y-axis) shuld be divisible by 10 (in my case). In other words, is it true that let say 128x10 will work on my GPU better than 10x128 ?



                              Do you have barriers in kernel?  if you have barries in kernel,  you can at most have workgroupsize is 64.  Please send testcase to reproduce your issue. 

                              You can have atmost 256 threads per group. so it is not possible to use 128 x 10 or 10 x 128.
                               


                              Maybe I was not clear. I speak not about workgroup size, I speak about grid size. workgroup value is not setted (by default).
                              There is no real issue there is some question. Big grid sizes cause driver restart for pretty obvious reasons.
                              Question is why 128x2 can execute faster than 256x1 on my hardware?
                              Can it be related to different memory access patterns for example?







                                • About execution domain dimensions
                                  himanshu.gautam

                                  raistmer,

                                  Indeed you are right,it is the memory acces patterns that determine our performance in such cases.But these patterns depend on many factors:

                                  how are you fetching data inside your kernel?

                                  how are your wavefronts organised?

                                  how is your cache being used?

                                  Some cases support 1D approach while others favour 2D approach.

                                  Therefore its very hard to determine which way is more efficient.