24 Replies Latest reply on Aug 27, 2011 9:49 AM by diapolo

    ALUBusy - an easy way to raise it? + Vec3 problems

    diapolo

      I'm working on a Bitcoin-Mining kernel and took a look at the output of AMD APP Profiler. I saw that the value of ALUBusy is only at ~68% and I guess the goal should be a higher number.

      What causes ALUBusy to get higher in general? Any hints?

      Dia

        • ALUBusy - an easy way to raise it?
          gat3way

          I happen to have written a bitcoin kernel too :)

          Basically it is an extremely ALU-bound kernel and you should not be having global or local memory reads at all (just global memory write at the end). Thus, easiest thing you could do is vectorize more. If you are using say uint2 (as most bitcoin kernels do), try uint4 or uint3 (uint3 was broken with SDK2.4 and generated bad ISA though - it is also opencl1.1 feature probably unsupported by some earlier SDKs).

          Increasing vector size, you provide more ALU operations that have no dependencies so that they can fit in a VLIW pipeline. Another thing is that some ALU ops can't operate on t unit (like bitalign and bfi_int) so that larger vectors == more chance to have enough alu operations to fill the x,y,z,w,t units and get closer to 100% ALUBusy.

          That said, bfi_int patching in general replaces a couple of instructions that can operate on t unit with a single one that does not work on t unit. To achieve better ALUPacking, you may need to reorder some stuff in your round function. I can give you no specific advice on that - just experiment, and profile, use GPU_DUMP_DEVICE_KERNEL=3, look at ISA dumps until you find the sweet spot.

          Also keep in mind that increasing vector size after a certain threshold does not help, just the opposite. That's because it involves more GPRs and makes the kernel bigger. Number of used GPRs limit the number of wavefront thus the device utilization. Larger kernels are slower because they don't fit in GPU instruction caches and I've heard that the OpenCL compiler gives up some register allocation optimizations once kernels get too big (though it might not be true).

          In my case, I found the sweet spot between uint2 and uint4 (and like I said, uint3 was broken). Thus I interlaced one uint2 sha256 operation and one uint sha256 operation and that ended to be fastest for VLIW5 hardware. On 69xx, using just uint2 was faster.

            • ALUBusy - an easy way to raise it?
              genaganna

              1. Make sure you have more more wavefronts per group

              2. Other things as per gat3way

              a. Vectorization

                      b. Avoid dependent statements if possible.

                • ALUBusy - an easy way to raise it?
                  diapolo

                   

                  Originally posted by: genaganna 1. Make sure you have more more wavefronts per group

                   

                  2. Other things as per gat3way

                   

                  a. Vectorization

                   

                          b. Avoid dependent statements if possible.

                   



                   

                  So it could be faster to have a value 2-times in different variables, if this makes the following comands independent?

                  Dia

                • ALUBusy - an easy way to raise it?
                  diapolo

                  Are 3 component vectors working with SDK 2.5 without the use of AMD_vec3 extension? I guess I will try this ... reorderning seems to help, but for me that's kind of trial and error, because I have no deep understanding of IL or ASM code :-/. What's your ALU OP usage and GPR usage for vec3?

                  How can I have more Wavefronts per group? Didn't get that statement ...

                  Thanks,

                  Dia

                    • ALUBusy - an easy way to raise it?
                      genaganna

                       

                      Originally posted by: diapolo Are 3 component vectors working with SDK 2.5 without the use of AMD_vec3 extension? I guess I will try this ... reorderning seems to help, but for me that's kind of trial and error, because I have no deep understanding of IL or ASM code :-/. What's your ALU OP usage and GPR usage for vec3?

                       

                      As per OpenCL 1.1 spec, 3 component vectors are in core spec means no need to use extensions.

                      3 component vectors should work in SDK2.5. I am not what was the problem with SDK2.4. As per my understanding, it should work in SDK2.4.

                      It is always recommanded to use vec4 instead of vec3 becuase more ALU untilization and less over head in initailzation of vec4 data.

                       

                      How can I have more Wavefronts per group? Didn't get that statement ...

                       

                      Make sure your work group size as big as possible and less than maximum allowed work group size as per device. i.e 256 for GPU in general.

                        • ALUBusy - an easy way to raise it?
                          gat3way

                          Well it generally works in 2.4 as I have used it other times. This time however, with that particular kernel (the bitcoin one) switching to uint3 from uint4 caused the runtime to crash. The kernel compiled fine with no errors - at least clBuildProgram() returned CL_SUCCESS. Then some time after that it crashes. I checked the ISA dump to find some error about used port (?!?) and with this error, isa dump ended.  Switching back to uint4 or uint2 produced valid kernel binary. I noticed the same thing with uint3 with one more kernel too. I don't know what causes it though.

                            • ALUBusy - an easy way to raise it?
                              genaganna

                               

                              Originally posted by: gat3way Well it generally works in 2.4 as I have used it other times. This time however, with that particular kernel (the bitcoin one) switching to uint3 from uint4 caused the runtime to crash. The kernel compiled fine with no errors - at least clBuildProgram() returned CL_SUCCESS. Then some time after that it crashes. I checked the ISA dump to find some error about used port (?!?) and with this error, isa dump ended.  Switching back to uint4 or uint2 produced valid kernel binary. I noticed the same thing with uint3 with one more kernel too. I don't know what causes it though.


                              Please send simplified code for us so that it will be fixed in future releases.

                              If you don't want to copy code here please file at ticket at

                              http://developer.amd.com/support/KnowledgeBase/pages/HelpdeskTicketForm.aspx?Category=8

                                • ALUBusy - an easy way to raise it?
                                  diapolo

                                  I tried to implement 3-component vectors, but this just crashes ... could well be my fault, but is it working for someone in SDK 2.5?

                                  This is with Cat 11.7 / Win7 x64 and 2- and 4-component vectors work well, no errors or crashes. Tested with 5870 and 6550D! APP KernelAnalyzer doesn't show any errors, too. APP Profiler doesn't give anything useful because of the crash if uint3 is used!

                                  The ALUpacking value is round about 98,5%.

                                  Global Work-size: 134217728

                                  Work-Group-size: 256

                                  Dia

                                    • ALUBusy - an easy way to raise it?
                                      diapolo

                                      To the Vec3 thing, KernelAnalyzer outputs an error message directly in the Asembly tab: "Error: Another scalar op (gpr 6) has already used GPR read port 0 for chan 1 (gpr 127)".

                                      Any ideas?

                                      Dia

                                        • ALUBusy - an easy way to raise it?
                                          gat3way

                                          Looks like you had exactly the same problem as me with uint3. Is that using 2.5?

                                          As for __constant: do not copy anything. Just initialize a __private variable with the needed value.

                                           

                                          [quote]

                                          The ALUpacking value is round about 98,5%.

                                          Global Work-size: 134217728

                                          Work-Group-size: 256

                                          [/quote]

                                           

                                          By the way, I just tried to compile my bitcoin kernel using the offline devices extension. When I compile from source and do clBuildProgram() it is OK. The precompiled kernel for Barts does not execute though. Profiling it with sprofile displays the same "strange" global work size. I have an error about jump to non-existant address or something like that in the ISA dump.

                                            • ALUBusy - an easy way to raise it?
                                              diapolo

                                              Yes, this is for SDK 2.5 and it seem's to be some kind of bug. I opened a developer ticked, but got no response till now.

                                              If I get new informations, I will post here!

                                              Edit: Using __private instead of __constant leads only to 1 more GPR used for 58XX cards. No improvement in KernelAnalyzer.

                                              Dia

                                          • ALUBusy - an easy way to raise it?
                                            genaganna

                                             

                                            Originally posted by: diapolo I tried to implement 3-component vectors, but this just crashes ... could well be my fault, but is it working for someone in SDK 2.5?

                                            This is with Cat 11.7 / Win7 x64 and 2- and 4-component vectors work well, no errors or crashes. Tested with 5870 and 6550D! APP KernelAnalyzer doesn't show any errors, too. APP Profiler doesn't give anything useful because of the crash if uint3 is used!

                                            The ALUpacking value is round about 98,5%.

                                            Global Work-size: 134217728

                                            Work-Group-size: 256

                                            Is it crashing in Runtime code or kernel code?  Make sure alignment are proper for vec3 as it is required memory exactly same as vec4.

                                              • ALUBusy - an easy way to raise it?
                                                diapolo

                                                 

                                                Originally posted by: genaganna
                                                Originally posted by: diapolo I tried to implement 3-component vectors, but this just crashes ... could well be my fault, but is it working for someone in SDK 2.5?

                                                 

                                                This is with Cat 11.7 / Win7 x64 and 2- and 4-component vectors work well, no errors or crashes. Tested with 5870 and 6550D! APP KernelAnalyzer doesn't show any errors, too. APP Profiler doesn't give anything useful because of the crash if uint3 is used!

                                                 

                                                The ALUpacking value is round about 98,5%.

                                                 

                                                Global Work-size: 134217728

                                                 

                                                Work-Group-size: 256

                                                 

                                                 

                                                Is it crashing in Runtime code or kernel code?  Make sure alignment are proper for vec3 as it is required memory exactly same as vec4.

                                                 

                                                As I said in the KernelAnalyzer the right tab holds the object code, which is asembly for a specific device for which the OpenCL kernel is compiled. In this window I get the above mentioned error message (Error: Another scalar op (gpr 6) has already used GPR read port 0 for chan 1 (gpr 127)), if I try to use vec3 in my kernel.

                                                The generated object code is way to small, so it's clear now, why the application, which uses the kernel, crashes with Vec3 enabled.

                                                What do you mean the memory layout has to be the same as for vec3? There is 1 kernel parameter, which is uint3 for the vec3 version and uint4 for the vec4 version. They are filled with 0, 1, 2, 0, 1, 2 ... for vec3 and 0, 1, 2, 3, 0, 1, 2, 3 ... for vec4.

                                                Dia

                                                  • ALUBusy - an easy way to raise it?
                                                    genaganna

                                                     

                                                    Originally posted by: diapolo
                                                    Originally posted by:  

                                                     

                                                    As I said in the KernelAnalyzer the right tab holds the object code, which is asembly for a specific device for which the OpenCL kernel is compiled. In this window I get the above mentioned error message (Error: Another scalar op (gpr 6) has already used GPR read port 0 for chan 1 (gpr 127)), if I try to use vec3 in my kernel.

                                                    Is it possible for to copy kernel code here?

                                                     

                                                    What do you mean the memory layout has to be the same as for vec3? There is 1 kernel parameter, which is uint3 for the vec3 version and uint4 for the vec4 version. They are filled with 0, 1, 2, 0, 1, 2 ... for vec3 and 0, 1, 2, 3, 0, 1, 2, 3 ... for vec4.

                                                     

                                                    I am taking about alignment in runtime code. In kernel, compiler handles these properly.

                                                      • ALUBusy - an easy way to raise it?
                                                        diapolo

                                                        @genaganna:

                                                        I prefer to not post the kernel in here, but I could send it to you for a review. Just leave an E-Mail address.

                                                        The used Vec3 kernel parameter is an array, which consists of uints in the runtime, which works for vec2 and vec4 :-/.

                                                        Isn't any developer from AMD here, who can comment on the error I get in the KernelAnalyzer?

                                                        Dia

                                                          • ALUBusy - an easy way to raise it?
                                                            genaganna

                                                             

                                                            Originally posted by: diapolo @genaganna:

                                                             

                                                            I prefer to not post the kernel in here, but I could send it to you for a review. Just leave an E-Mail address.

                                                             

                                                            The used Vec3 kernel parameter is an array, which consists of uints in the runtime, which works for vec2 and vec4 :-/.

                                                             

                                                            Isn't any developer from AMD here, who can comment on the error I get in the KernelAnalyzer?

                                                             

                                                            I am from AMD.  It looks like you filed a developer ticket and did not select appropriate field.  I did not get an email for that ticket.  Please paste ticket number here.

                                                              • ALUBusy - an easy way to raise it? + Vec3 problems
                                                                diapolo

                                                                 

                                                                Originally posted by: genaganna
                                                                Originally posted by: diapolo @genaganna:

                                                                I prefer to not post the kernel in here, but I could send it to you for a review. Just leave an E-Mail address.

                                                                The used Vec3 kernel parameter is an array, which consists of uints in the runtime, which works for vec2 and vec4 :-/.

                                                                Isn't any developer from AMD here, who can comment on the error I get in the KernelAnalyzer?



                                                                I am from AMD.  It looks like you filed a developer ticket and did not select appropriate field.  I did not get an email for that ticket.  Please paste ticket number here.



                                                                Indeed, I filed a ticked, but didn't get any ID via e-mail :-/. Can I create a new one, with just a link to this thread?

                                                                Edit: Ticket ID 1481

                                                                Dia

                                                                  • ALUBusy - an easy way to raise it? + Vec3 problems
                                                                    genaganna

                                                                     

                                                                    Originally posted by: diapolo

                                                                     

                                                                     

                                                                    Indeed, I filed a ticked, but didn't get any ID via e-mail :-/. Can I create a new one, with just a link to this thread?

                                                                     

                                                                    Edit: Ticket ID 1481



                                                                    We are able to reproduce the issue. Issue is reported to developers.

                                                                    I have few questions.

                                                                    1. Where it is crashing when you run? Is it crashing in clEnqueueNDRangeKernel or somewhere else? Please tell us at which API, your code is crashing.

                                                                            2. What happens when you run on CPU device?