17 Replies Latest reply on Apr 7, 2012 9:13 AM by realhet

    7970 ISA Vector/Scalar instruction level paralellism

    realhet

      Hi,

       

      I'm planning to do an ISA optimization on my existing kernel, and just wondering if anybody have some info on how to feed the S and V ALU's to achieve maximum utilization.

      On the GCN .ppt slides I've seen that there is an instruction 'arbitrator' which is feeding 4 Vector alu's and 1 Scalar alu. It also tells that the S alu is working 4x faster than the V alu's. In total it gives 1:1 V:S alu operation capacity.

       

      For example: what if I interleave independent V_ and S_ instructions in a pattern like SVSVSVSV?

      Will the instruction decoder in the SIMD engine be able to feed all four V alu's in every 4 cycles while also feeding the S alu with one instruction every cycle (for each of the 4 V alu's)?

      What if I use larger VOP3 instructions and/or 32bit immediate constants, when the instruction decoder will have to look more dwords ahead? Is there any specifications on the capabilities/limitations of the instruction decoder/arbitration unit?

       

      Why am I doing this: I have a quiet big kernel (25KB on 7970) which is working at 98% alu utilization on the 6970, but unfortunately on 7970 it runs out of VRegs (above 128 vgprs I noticed a 'task scheduler' bottleneck because it can't put 2 kernels in the queues. When a task is done, the simd engines can't start immediately another task. It means -30% performance loss in my case). My plan is to use more sgprs instead of some of the vgprs to get below the 128 vgprs 'limit'. I can swap many calculations out to S registers which are the same values for every 16 simd vector threads, I just have th learn, how effectively schedule the SOP's and VOP's to achieve maximum alu utilization.

       

      In my worst expectations maybe there is no S/V paralellism, and in every 4 cycles either 1 VOP or 1 SOP can be executed. This way there is no need for hardwaredependency checks or special compiler instruction reordering.

       

      I also noticed the s_buffer_load_dwordx16 thing, another reason to go down below CAL

       

      Thank you for your answers!

        • Re: 7970 ISA Vector/Scalar instruction level paralellism
          Skysnake

          Yeah, it would be nice to know more about this.

           

          Sorry for the ot question, but how big is your VALU busy value?

           

          I have ~80% ALU busy on my 5870 but only ~24% on my 7970 -.-

           

          Is there still a big benefit in using vector types? and if yes, what is the best size? 4, 8 or 16?

            • Re: 7970 ISA Vector/Scalar instruction level paralellism
              realhet

              I can't tell sprofile values sorry. (not using VS neither OpenCl, and launching sperform manually on my .exe just fails)

               

              According execution time measurements, I can utilize the V alu on 98-99% ,but only in special circumstances.

              On the 7xxx you can try these hints:

              - Unroll a big kernel (16KB ISA size is ok, max 32KB on 7xxx and max 48KB on 4xxx..6xxx)

              - Optimize your algo in a way that doesn't use more than 64 VRegs (on the 4xxx..6xxx hardware  128 registers will do fine)

              I guess that poor 24% alu utilization can come from using too much VRegs on the 7xxx

               

              "Is there still a big benefit in using vector types?"

              - on 4xxx..6xxx you only need to duplicate long dependency chains in order to let CalCompile feed the 4 or 5 VLIW lanes from them. If theoretically there is a way to utilize all vliw slots, then calCompile will find it

              - on 7xxx there is no ALU benefit from vectorizing code.

              - but of course still need to vectorize for effective memory access.

                • Re: 7970 ISA Vector/Scalar instruction level paralellism
                  Skysnake

                  realhet schrieb:

                  According execution time measurements, I can utilize the V alu on 98-99% ,but only in special circumstances.

                  On the 7xxx you can try these hints:

                  - Unroll a big kernel (16KB ISA size is ok, max 32KB on 7xxx and max 48KB on 4xxx..6xxx)

                  - Optimize your algo in a way that doesn't use more than 64 VRegs (on the 4xxx..6xxx hardware  128 registers will do fine)

                  My Kernel is quite small, just under 1kB. And i use only 10 VRegs and 25 SRegs ~40 VALUInst and ~12 SALUInst.

                   

                  realhet schrieb:

                  "Is there still a big benefit in using vector types?"

                  - on 4xxx..6xxx you only need to duplicate long dependency chains in order to let CalCompile feed the 4 or 5 VLIW lanes from them. If theoretically there is a way to utilize all vliw slots, then calCompile will find it

                  - on 7xxx there is no ALU benefit from vectorizing code.

                  - but of course still need to vectorize for effective memory access.

                  Exactly how i thought it should be...

                   

                  My pogram is a heatblade simulation. So access patterns are allways in raws, one element next to the other.

                   

                  The funny part is, when i use LDS, my runtime increase. My Cache hit goes from ~35% up to ~90%, but the runtime is 45% higher..

                   

                  Also when i change the order of accesses to a inverted pattern (because i use 2D array in LDS), there change nothing.

                   

                  Really strange... I hope AMD brings soon the optimization guide for GCN.

                • Re: 7970 ISA Vector/Scalar instruction level paralellism
                  realhet

                  "I have ~80% ALU busy on my 5870 but only ~24% on my 7970 -.-"

                   

                  And what if you compare the computing performance of your algorithm to the raw Teraflops/sec value of your to videocards? Does it show the same difference like the 80% versus 24% busy values?

                   

                  I mean maybe we don't understand how S/VALU_busy measurement works compared to 5xxx ALU_busy, so what about actual/nominal performance on 5xxx and 7xxx?

                    • Re: 7970 ISA Vector/Scalar instruction level paralellism
                      Skysnake

                      Ok. it took some days, but now the results:

                       

                      HD5870: vanilla kernel

                       

                      __kernel void heatblade_kernel1( __global float *in, __global float *out, float my){

                          int gid0=get_global_id(0);

                          int gid1=get_global_id(1);

                       

                          int size_x=get_global_size(0);//BREITE

                          //int size_y=get_global_size(1);//HOEHE

                       

                          int offset_X  = gid0+size_x+3;

                          int offset_X2 = size_x+2;

                          int offset_Y  = gid1*offset_X2;

                       

                          float myn=0.1;

                       

                          //Wir benötigen size_x*size_y global workitems

                          out[offset_X+offset_Y]=in[offset_X+offset_Y]+myn*(

                                          in[offset_X+1+offset_Y]+

                                          in[offset_X-1+offset_Y]+

                                          in[offset_X+(gid1+1)*offset_X2]+

                                          in[offset_X+(gid1-1)*offset_X2]-

                                          4*in[offset_X+offset_Y]

                                          );

                       

                       

                      }

                       

                       


                      MethodExecutionOrderThreadIDCallIndexGlobalWorkSizeWorkGroupSizeTimeLocalMemSizeVGPRSSGPRSFCStacksScratchRegsWavefrontsLDSFetchInstsLDSWriteInstsFectchSizeCacheHitWriteUnitStalledLDSBankConflictALUInstsFetchInstsWriteInstsALUBusyALUFetchRatioALUPackingFetchUnitBusyFetchUnitStalledFastPathCompletePathPathUtilization

                      heatblade_kernel1__k1_Cypress1     1 5952 67 {   1920    1080       1} {   64     4     1}         0,17789           0     5 NA     0     0     32400,00         0,00         0,00      8123,50        46,55         0,00         0,00        15,00         5,00         1,00        65,25         3,00        56,00        88,65         1,77      8058,75         0,00       100,00

                      heatblade_kernel1__k1_Cypress1     2 5952 70 {   1920    1080       1} {   64     4     1}         0,17456           0     5 NA     0     0     32400,00         0,00         0,00      8123,50        46,75         0,02         0,00        15,00         5,00         1,00        65,33         3,00        56,00        88,74         1,72      8061,75         0,00       100,00

                       

                      kernel with LDS:

                      __kernel void heatblade_kernel3( __global float *in, __global float *out, float my){

                          int gid0=get_global_id(0);

                          int gid1=get_global_id(1);

                          int lid0=get_local_id(0);

                          int lid1=get_local_id(1);

                       

                          int size_x=get_global_size(0);//width

                          //int size_y=get_global_size(1);//height

                       

                          int offset_X  = gid0+size_x+3;

                          int offset_X2 = size_x+2;

                          int offset_Y  = gid1*offset_X2;

                       

                          __local float localBuffer[4+2][64+2];

                          float result;

                       

                          //TODO

                          float myn=0.1;

                       

                       

                          //the first elements. In every raw 2 elements are left, and the last two raws are completely left

                          localBuffer[lid1][lid0]=in[gid1*offset_X2+gid0];

                       

                          //copy the last 2 raws

                          if(lid1==0 || lid1==1){

                              localBuffer[get_local_size(1)+lid1][lid0]=in[(gid1+get_local_size(1))*offset_X2+gid0];

                          }

                       

                          //copy now the last 2 elements in every raw (not in the last 2 raws)

                          if(lid0==0 || lid0==1){

                              localBuffer[lid1][get_local_size(0)+lid0]=in[gid1*offset_X2+gid0+get_local_size(0)];

                          }

                       

                          //copy the last 4 elements

                          if((lid0==0 || lid0==1) && (lid1==0 || lid1==1)){

                              localBuffer[get_local_size(1)+lid1][get_local_size(0)+lid0]=in[(gid1+get_local_size(1))*offset_X2+gid0+get_local_size(0)];

                          }

                       

                          barrier(CLK_LOCAL_MEM_FENCE);

                       

                          //calculate the new value

                          result=localBuffer[lid1+1][lid0+1]+myn*(

                                          localBuffer[lid1+1][lid0+2]+

                                          localBuffer[lid1+1][lid0]+

                                          localBuffer[lid1+2][lid0+1]+

                                          localBuffer[lid1][lid0+1]-

                                          4*localBuffer[lid1+1][lid0+1]

                                          );

                       

                          //copy the result to the output

                          out[offset_X+offset_Y]=result;

                       

                       

                      }


                      MethodExecutionOrderThreadIDCallIndexGlobalWorkSizeWorkGroupSizeTimeLocalMemSizeVGPRSSGPRSFCStacksScratchRegsWavefrontsLDSFetchInstsLDSWriteInstsFectchSizeCacheHitWriteUnitStalledLDSBankConflictALUInstsFetchInstsWriteInstsALUBusyALUFetchRatioALUPackingFetchUnitBusy

                      FetchUnitStalled

                      FastPathCompletePathPathUtilization

                      heatblade_kernel3__k1_Cypress1     1 4204 67 {   1920    1080       1} {   64     4     1}         0,33744        1664     5 NA     0     1     32400,00         3,00         3,00      8123,50        47,72         0,00         0,00        42,00         3,00         1,00        95,64        14,00        33,10        17,08         0,00      8058,63         0,00       100,00

                      heatblade_kernel3__k1_Cypress1     2 4204 70 {   1920    1080       1} {   64     4     1}         0,33478        1664     5 NA     0     1     32400,00         3,00         3,00      8123,50        47,74         0,00         0,00        42,00         3,00         1,00        95,60        14,00        33,10        17,07         0,03      8057,75         0,00       100,00

                       

                       

                      HD7970 vanilla Kernel:


                      MethodExecutionOrderThreadIDCallIndexGlobalWorkSizeWorkGroupSizeTimeLocalMemSizeVGPRsSGPRsScratchRegsFCStacksWavefrontsLDSFetchInstsLDSWriteInstsFetchSizeCacheHitWriteUnitStalledLDSBankConflictVALUInstsSALUInstsVFetchInstsSFetchInstsVWriteInstsVALUUtilizationVALUBusyMemUnitBusyMemUnitStalledWriteSizeGDSInsts

                      heatblade_kernel1__k1_Tahiti1     1 5828 67 {   1920    1080       1} {   64     4     1}         0,09407           0     6    16     0 NA     32400,00         0,00         0,00      8124,13        35,88         9,51         0,00        27,00         7,00         5,00         7,00         1,00       100,00        22,36        68,77         4,95      7893,56         0,00

                      heatblade_kernel1__k1_Tahiti1     2 5828 70 {   1920    1080       1} {   64     4     1}         0,08726           0     6    16     0 NA     32400,00         0,00         0,00      8124,06        35,88        10,33         0,00        27,00         7,00         5,00         7,00         1,00       100,00        22,43        68,88         5,31      7898,13         0,00

                       

                      LDS kernel:


                      Method

                      ExecutionOrder

                      ThreadIDCallIndexGlobalWorkSizeWorkGroupSizeTimeLocalMemSizeVGPRsSGPRsScratchRegsFCStacksWavefrontsLDSFetchInstsLDSWriteInsts

                      FetchSize

                      CacheHitWriteUnitStalledLDSBankConflictVALUInstsSALUInstsVFetchInstsSFetchInstsVWriteInstsVALUUtilizationVALUBusyMemUnitBusyMemUnitStalledWriteSizeGDSInsts

                      heatblade_kernel3__k1_Tahiti1     1 4852 67 {   1920    1080       1} {   64     4     1}         0,13585        1664    10    24     0 NA     32400,00         0,00         6,00      8124,31        35,88         7,90         0,00        41,50        17,00         3,00         7,00         1,00        73,74        23,02        66,15         1,53      7874,56         0,00

                      heatblade_kernel3__k1_Tahiti1     2 4852 70 {   1920    1080       1} {   64     4     1}         0,12770        1664    10    24     0 NA     32400,00         0,00         6,00      8124,25        35,88         7,31         0,00        41,50        17,00         3,00         7,00         1,00        73,74        22,76        66,50         1,39      7879,59         0,00

                       

                      Yes, i know, i could increase the amount of work per thread, but i wanted too look, if it is possible to utilice the Tahiti without vector datatypes, or more work"items" per thread. So it would be much easier to utilice the GPU for >50%

                       

                      And yes, i know, that i could reuse the data in LDS more often, if i would calculate a complete raw per workgroup

                  • Re: 7970 ISA Vector/Scalar instruction level paralellism
                    realhet

                    But please let's get back to the topic,

                     

                    So far I've found out that the V and S instructions can be interleaved in an 1:1 ratio and they will execute simultaneously.

                     

                    For example:repeating this over and over:

                      s_xor_b64 s0, s1, s2

                      v_xor_b32 v0, v1, v2

                    eats only 4 cycles, not 8.

                     

                    another one:

                      s_mul_u32 s0, s1, s2

                      v_mad_u24_u32 v0, v1, v2,v3

                    also run in paralell (in this case there is an instruction decoder bottleneck because the two 64bit instructions, so the S:V ratio can be at most 3:4)

                     

                    With this I was able to get additional 88 Gops of mul32 performance out of the 7970

                     

                    Now it generates a question: Is there a hardware mechanism that ensures that no S and V instructions can be executed in paralell when they're depend on each or other? Or is it the compiler's responsibility that it never send register-dependent S/V code to the GPU, like in the example below?

                     

                      s_xor_b64 s0(!!!!), s1, s2

                      v_xor_b32 v0, s0(!!!!), v2   ;does the instruction decoder delay the V instruction until the result is calculated by the S instruction? Or execute both using the previous s0 register value in the V instruction.

                     

                    (My opinion is that there is no extra hardware for this (thus more transistors can do math), and the compiler must take care of these situations, but I'm not sure)

                     

                    If anyone has accurate info on this, please tell me!

                      • Re: 7970 ISA Vector/Scalar instruction level paralellism
                        jeff_golds

                        The compiler will handle the dependencies by either scheduling work between the dependent operations or inserting a wait command if needed.

                        • Re: 7970 ISA Vector/Scalar instruction level paralellism
                          LeeHowes

                          My understanding is that the hardware does not schedule an s and v simultaneously from the same wavefront, only from two separate waves. So there cannot be any dependency between the two, and the latency of that code will be as it appears but the throughput would be double as you suggest.

                            • Re: 7970 ISA Vector/Scalar instruction level paralellism
                              realhet

                              That must be it!

                              Your answer also explains that why I measured so slow times when testing a big kernel filled with svsvsvsv instructions and used more than 128 vector registers: there was only one wave  and no S+V paralell executions at all.

                               

                              Thank you all for answers!

                                • Re: 7970 ISA Vector/Scalar instruction level paralellism
                                  jeff_golds

                                  Even with more than 128 vector registers, you can still schedule one wavefront per SIMD, i.e. four per CU.

                                    • Re: 7970 ISA Vector/Scalar instruction level paralellism
                                      realhet

                                      Let's say 256 vregs/wave -> 4bytes*256regs*64wavesize*4simd/cu=256KB which is exactly the amount of vregister ram in a CU. Exactly like in http://developer.amd.com/afds/assets/presentations/2620_final.pdf

                                       

                                      But I'm pretty sure that its not good to use more than 128, and the best performance can be achieved when not using more than 64 vregs. Just don't know, why is it . I'll do some simple tests tomorrow to point out this behaviour.

                                        • Re: 7970 ISA Vector/Scalar instruction level paralellism
                                          jeff_golds

                                          If you only have 4 wavefronts per CU, then that will limit how much latency you can hide.  Of course, if your kernel is completely ALU-bound, then that won't matter.  Just note that latency comes in different flavors based on the type of operation.  For example, global memory fetches have a different class of latency than local memory (LDS) fetches.

                                           

                                          Also, don't forget about instruction cache latency.  If you have a long, unrolled kernel and only 4 waves per CU, then all the waves will feel the pain of filling the instruction cache.  It's better to use loops so that you get instruction cache hits.

                                            • Re: 7970 ISA Vector/Scalar instruction level paralellism
                                              realhet

                                              Hello again,

                                               

                                              I'm only testing the raw computing power, no lds/gds/uav at all, and my test are fitting well in all of the caches (instr and data).

                                               

                                              I've ran some tests:

                                              The test kernel was this in ISA:

                                               

                                                s_mov_b32 s10,24

                                                label_1:

                                               

                                                   ;128 times repeated(unrolled) [inner_code]

                                                s_add_u32 s10,s10,-1

                                                s_cbranch_scc1 label_1   ;break on borrow, so it's a 25x loop

                                                s_endpgm

                                               

                                               

                                              inner_code was basically 4 V instructions interleaved with 0..3 S instructions

                                              Trying a mixture of 32bit and 64bit S/V instructions. A capital letter V or S means 64bit.and small letters are 32bit.

                                              For example: VSVsvv stands for

                                                v_mad_i32_i24 v0,v0,v1,v2     ;big 3 operand V instr

                                                s_mul_i32 s0,12345,s1         ;big 2 operand S instr with 32bit immed

                                                v_mad_i32_i24 v0,v0,v1,v2

                                                s_mul_i32 s0,s0,s1              ;small 2 operand S instr

                                                v_xor_b32 v0,v0,v1                ;small 2 operand V instr

                                                v_xor_b32 v0,v0,v1

                                               

                                              The kernel was issued for 40 000 000 workitems (64 workitems/workgroup) and the second time measurement was taken.

                                               

                                              Here are the diagrams:

                                              http://x.pgy.hu/~worm/het/7970_isa_test/7970_SV_timings_4-12dwords.png

                                              http://x.pgy.hu/~worm/het/7970_isa_test/7970_SV_timings_8-16dwords.png

                                              Also the raw data in excel:

                                              http://x.pgy.hu/~worm/het/7970_isa_test/7970_isa_speedtest.xls

                                               

                                              My conclusions:

                                              ONE CAN SIMPLY NOT interleave 4 S instructions with 4 V instructions, or it will slow down the V performance by a minimum of 1-2%

                                              Also when sending more code dwords to the S alu, there is  on higher numVGprs kernel settings:

                                              It can be partitioned to 3 parts:

                                              16..64  numVGprs -> excellent V and S paralellism (in my thoughts this is when only 4 waves are in the CU)

                                              65..84 numVGprs -> this starts to hate when it gets many S instruction dwords (I think 3 waves can sit in a CU)

                                              85..128  numVGprs -> 2 small S for 4 V is ok, bug starts to get slow (2 waves)

                                              129..255 numVGprs -> try to avoid ALL S instructions if posibble, or else there will be terrible stalls (1 waves)

                                               

                                              If there is a 4 stage pipeline thing in it, then multiply my waves_in_the CU expectations by 4 of course.

                                               

                                              But I think all this S instruction 'sensitivity' is based on the complicated instruction decoding/scheduling mechanism. I'd love to know more about it, if it's not a secret

                                              I don't even think about what things apply for the v_add instruction when it reads and writes from/to both S and V registers.

                                               

                                              Anyways It's really a fun puzzle to find out these things. (I have still so many questions to ask lol, just don't wanna flood the forums)