11 Replies Latest reply on Jul 1, 2010 3:02 PM by FangQ

    maximum registers per thread?

    FangQ

      I have an OpenCL program that uses 54 registers per thread. It runs 3x slower on 5870 compard with nvidia gtx470 using similar configurations.

      I heard that 5870 only allows ~30 registers per thread and the rest will be spilled to the global memory. Is this true? anything I can optimize?

        • maximum registers per thread?
          ryta1203

          1. How many registers does the Nvidia version use?

          2. YES, I'm sure there are a TON of optimizations you could make. For starters, are you vectorizing? How much control flow? Have you tried splitting the kernel? Etc, etc...

            • maximum registers per thread?
              FangQ

              I ran the same opencl code, so the number of registers are the same for both hardware.

              my opencl kernel is almost identical to the CUDA kernel, which you can browse it at  http://is.gd/d88jw

              the main kernel is "mcx_main_loop()". Any suggestions perticularly concerning the differences of this kernel on two hardware?

                • maximum registers per thread?
                  ryta1203

                   

                  Originally posted by: FangQ I ran the same opencl code, so the number of registers are the same for both hardware.

                  my opencl kernel is almost identical to the CUDA kernel, which you can browse it at  http://is.gd/d88jw

                  the main kernel is "mcx_main_loop()". Any suggestions perticularly concerning the differences of this kernel on two hardware?

                  1. Why? Register allocation is done by the compiler and you are using two different compilers so this "assumption" is false (though it may be that the two do, in fact, use the same number of registers).

                  2. You are not going to see good performance from AMD GPUs unless you vectorize your code (or you have no data dependency and no/little control flow).

                   

                    • maximum registers per thread?
                      FangQ

                       

                      1. Why? Register allocation is done by the compiler and you are using two different compilers so this "assumption" is false (though it may be that the two do, in fact, use the same number of registers).


                      I only know how to get register numbers for nvidia (nvcc --ptxas-options=-v), can you teach me how to find this out for ati stream? (I work with Ubuntu Linux). Thanks

                       

                       

                      2. You are not going to see good performance from AMD GPUs unless you vectorize your code (or you have no data dependency and no/little control flow).


                       

                      I did ran the shader analyzer earlier this year with this code and most (80%) of the instructions were nicely packed to use the 5 VLIW slots simultaneously. I hope things have not changed too much lately.

                      Also, do you think connecting the video card to a display will have any impact to speed? my nvidia card is dedicated (not used for display), but my ati card is connected to dual-monitors.

                        • maximum registers per thread?
                          ryta1203

                          The profiler or SKA can give you the GPR used... or you can simply count the GPR used in the ISA.

                          Yes, 80% is nice, but maybe if you vectorized your code you might get more... hey, 10% is 10%. Admittedly though, I haven't looked at your kernel(s).

                          I'm not positive (though I think you can find an answer if you search this or the ATI Stream forum) but I would imagine that it would have some impact on the performance, again though, don't quote me on that.

                          Is there any way to split your kernel into multiple kernels? I can't be certain but this may provide some benefit if you are doing a lot of spilling.

                          Also, I'm not sure how good the compiler is at register allocation (a well researched topic so I can't imagine it would be bad) but would it help GPR count to vectorize your code ? Again, I would think not but it's possible.

                            • maximum registers per thread?
                              hazeman

                              To get ISA set environment variable GPU_DUMP_DEVICE_KERNEL=3.

                              At the end of the ISA there is info with number of registers used.

                              And IL->ISA compiler is sometimes really stupid ( or more accurately badly written ) and can use excessive number of registers. Sure method to triger this problem is to compute some values inside the loop which are not dependend on the loop index. IL compiler will try to precompute before the loop and pass those values thru registers. This way kernel can use N extra registers. From what i've seen there is no limit on N. This can totally kill kernel performance by spilling registers or forcing you to limit number of threads to 1 or 2 ( which is slooow ).

                               

                      • maximum registers per thread?
                        hazeman

                        Number of available registers depends on number of threads ( wavefronts/warps  ).

                        2 threads ( work group size 128 ) - 128 registers

                        4 threads ( work group size 256 ) - 64 registers

                        8 threads ( work group size 512 ) - 32 registers

                        Usually with 4 threads you can achive full performance.

                          • maximum registers per thread?
                            FangQ

                             

                            Originally posted by: hazeman Number of available registers depends on number of threads ( wavefronts/warps  ).

                             

                            2 threads ( work group size 128 ) - 128 registers

                             

                            4 threads ( work group size 256 ) - 64 registers

                             

                            8 threads ( work group size 512 ) - 32 registers

                             

                            Usually with 4 threads you can achive full performance.

                             

                            I could not set work-group size greater than 256 with catalyst 10.6 (with either 4890OC and 5870), and CLInfo gave me maximum thread 256x256x256

                            In your opinion, if I use a work-group size of 256, there should be no spilling happen, correct?

                        • maximum registers per thread?
                          MicahVillmow
                          Fanq,
                          There is an environment variable GPU_MAX_WORKGROUP_SIZE that you can tweak to raise that limit. However, this is not supported and you use it at your own risk.
                            • maximum registers per thread?
                              FangQ

                               

                              Originally posted by: MicahVillmow Fanq, There is an environment variable GPU_MAX_WORKGROUP_SIZE that you can tweak to raise that limit. However, this is not supported and you use it at your own risk.


                              thank you for the tip, will certainly play with it.