16 Replies Latest reply on Dec 7, 2012 7:26 AM by realhet

    Small temporary arrays in OpenCL

    realhet

      Hi,

       

      Does OpenCL take advantage of the following techniques when using small local arrays?

      - On VLIW -> indexed_temp_arrays (x0[n]) (aka. R55[A0.x] indirect register addressing in ISA)

      - On GCN -> v_movrel_b32 instruction

       

      Or if OpenCL always uses LDS memory for local arrays, is there an extension to enable those faster techniques?

       

      Thanks in advance.

        • Re: Small temporary arrays in OpenCL
          binying

          To find out the answer,  I think you can wirte  a simple kernal, in which you use a small local array, and compile it using kernel analyzer. Then check the result in the output window...

            • Re: Small temporary arrays in OpenCL
              realhet

              I'm not that lazy, but all I have right now is a HD4850, and on that OCL is terrible beta-ish

              So I now got to have 160 dwords of this kind of fast 'memory' for my project (implementing it with amd_il+indexed_temp_array), and just wonder if OCL can do it.

              I know that for GCN I have to use something hybrid LDS+register_array to stay inside the 128x dword vreg limit. But on VLIW this register array thing is just awesome (there is 128*4 regs limit).

            • Re: Small temporary arrays in OpenCL
              hazeman

              I've tested this feature on 58xx card. OpenCL compiler generates indexed_temp_array in IL.

              The problem is what is going on in IL compiler. Almost randomly ( it slightly depends on size of array, whether you use parts (.x,.y,...) of indexed 4-vector ) it can use A indexing register or use scratch memory ( painfully slow ) to implement array.

              Unfortunately in my kernel i couldn't trick IL compiler to reliably use A register indexing and I had to change kernel design so i wouldn't use this feature. 

              1 of 1 people found this helpful
                • Re: Small temporary arrays in OpenCL
                  realhet

                  Hi,

                  That's cool that OCL can use A0 index.

                  I've played with it a little and found out that on HD4850 it will always use A0 indexing when total NumGPRS<=118. If NumGPRS would be>118 with the array then it will use scratch instead. And I only used the .x part. I think it doesn't check what parts we use as it will address 128bit array elements only. Maybe your kernel is around that NumGPRS limit.

                  It's approx 400 instantly accessible dwords...

                  But on GCN I think we can address only 100dwords (depends on other vreg usage) while not runing out of 128 vregs. I need 160 dwords total, and it doesn't fit into either LDS(would be 41KB for a wavefront) or VRegs. I'm afraid I have to mix those two if I want to avoid slow memory access.

                • Re: Small temporary arrays in OpenCL
                  drallan

                  GCN can use up to 256 vgprs/thread with 4 waves per CU for full occupancy.

                  The maximum vgpr array size then depends on what other vgprs the compiler needs to use.

                  In one case, I saw the compiler wasted 74 vgprs as temporaries by loading blocks of data before writing to the array.

                  Even here, int array[160] was not a problem.

                   

                  But beware the devil. When the array indices are not known at compile time, both gcn and VLIW will access registers serially one thread at a time.

                   

                  gcn scans the 'lanes' for threads looking for an index. It will then read/write all threads with the same index in parallel using v_movereld/s, then repeat until all threads are processed.  Worst case is all 64 indexes in a wave are different = 64 read/write loops (yes branching too). Best case, all indexes are the same and there is only one read/write. (actually that's pretty cool .)  Although VLIW uses A0 register,  it also does something similar to serially access different indices.

                   

                  LDS might be faster but, there's not enough.

                    • Re: Small temporary arrays in OpenCL
                      realhet

                      (unfortunately I cant accept 2x answers, though all two has proven a part of my question (vliw&gcn))

                       

                      "But beware the devil."

                      From where the h3ll I get that v_movreld and A0 can access all the 64 lanes of register memory Individually in ONE CYCLE?! That would be so many wires and transistors just for these rare instructions. Thx for opening my eyes, haha!

                       

                      Btw my case of course would be that every lane will accesses different regs.

                       

                      "GCN can use up to 256 vgprs/thread with 4 waves per CU for full occupancy."

                      That's when the instruction stream is not too dense. Pls take a look at these charts:

                      http://x.pgy.hu/~worm/het/7970_isa_test/7970_SV_timings_4-12dwords.png

                      http://x.pgy.hu/~worm/het/7970_isa_test/7970_SV_timings_8-16dwords.png

                      If your GCN code [uses a few S instructions and also there are some 64bit big instructions] AND [you're using more than 128 regs] your kernel can end up twice as fast than its estimated ideal performance. That's why I try to avoid 128+ regs (now I can't ) and aim for under 84 or even 64.

                        • Re: Small temporary arrays in OpenCL
                          drallan

                          "But beware the devil."

                          From where the h3ll I get that v_movreld and A0 can access all the 64 lanes of register memory Individually in ONE CYCLE?! That would be so many wires and transistors just for these rare instructions. Thx for opening my eyes, haha!

                           

                          I know your eyes are wide open but some might not realize how the compiler impliments C in a GPU environment.

                          I was a bit surprised when I first saw it.

                           

                          I look forward to your solution, it's a tough problem!

                            • Re: Small temporary arrays in OpenCL
                              realhet

                              My actual struggle in a picture ->

                              indexed_temp_array.JPG

                              7 clocks instead of 1, This seemed like an easy 2-3x boost to my prog, but ouch

                              And it's not just the 4xxx, I've noticed it on 6xxx too.

                               

                              With A0 the exact same thing is around 10% slower:

                                ushr r999.x, dwIdx, 2

                                iand r999.z, dwIdx, 1

                                iand r999.w, dwIdx, 2

                                mov  r998  , x0[r999.x]

                                cmov_logical r999.xy, r999.ww, r998.zw, r998.xy

                                cmov_logical res, r999.z, r999.y, r999.x  

                               

                              Another discovery is that when I compared the above X0[] dword accessing with a uniform index (across wavefront), and I compared it to cb0 access (in the same way), the cb0 was faster. (It used a VTEX clause, but was slightly faster than A0).

                              • Re: Small temporary arrays in OpenCL
                                realhet

                                Finally I had the chance to do some experiments on a 7970:

                                 

                                - v_movrels_b32 does nothing with the contents of the source operand, it only uses the index of it, so all the lanes will read from the same register. Maybe a0 indexing can access different regs/lane, but now I'm sure, that movrel can't.

                                - ds_readx2 is pretty effective (with different addresses for all laness)! I interleaved it with with 10-12 vector instructions and all the latency was hidden. (Make sure to set up the M0 register before using DS_ stuff! I wasted like an hour on this lol)

                                - The amd_il compiler can't deal with indexed arrays effectively: It always swaps the contents of the indexed array with unoptimized movs before using those. (x0[const1]+=x0[const2] uses 3 movs and an add)

                                  • Re: Small temporary arrays in OpenCL
                                    himanshu.gautam

                                    Hi realhet,

                                    Can you please share some code, which can help us in reproducing the issue.

                                    I will ask someone more knowledgeable for directions here.

                                    Thanks

                                      • Re: Small temporary arrays in OpenCL
                                        realhet

                                        Hi!

                                         

                                        I've managed to narrow it down: This is the simple operation it does over and over:

                                         

                                          dcl_indexed_temp_array x0[![(bufLen+3)>>2]]

                                         

                                          //array initialization goes here 

                                         

                                          //shuffle the elements of the array

                                          forLoop(i,0,10000)  //a loop so big that cannot be unrolled by the optimizer

                                            iadd x0[0].w,x0[0].w,x0[0].x

                                            iadd x0[0].x,x0[0].x,x0[0].y

                                            iadd x0[0].y,x0[0].y,x0[0].z

                                            iadd x0[0].z,x0[0].z,x0[0].w

                                          endloop

                                         

                                        And the unoptimal code is triggered by the way, I initialize the array:

                                         

                                        If I do this: mov x0[0], cb0[0]   then it compiles a perfect code (only add instructions are in the inner loop)

                                        But if I initialize it with a dword indexing macro:

                                         

                                          XWrite(x0, 0, cb0[0].x)

                                          XWrite(x0, 1, cb0[0].y)

                                          XWrite(x0, 2, cb0[0].z)

                                          XWrite(x0, 3, cb0[0].w)

                                         

                                        Where the XWrite C style macro is this: (It writes a dword in any array (cb0, x0, ...) at any dword position)

                                         

                                        #define XWrite(XName,dwIdx,val)         \\

                                          ushr r999.x, dwIdx, 2                 \\

                                          iand r999.y, dwIdx, 3                 \\

                                          ifieq(r999.y,0) mov XName[r999.x].x, val \ endif \\

                                          ifieq(r999.y,1) mov XName[r999.x].y, val \ endif \\

                                          ifieq(r999.y,2) mov XName[r999.x].z, val \ endif \\

                                          ifieq(r999.y,3) mov XName[r999.x].w, val \ endif \\

                                         

                                        #define ifieq(a,b)      \\

                                        ieq r999.w, a, b        \\

                                        if_logicalnz r999.w     \\

                                         

                                        So If I touch that array with that flexible dword addressable thing, the compiler does the following:

                                        - It realizes that the dword address is a constant so the  ushr,  iand   calculations are constant too.

                                        - It also drops 3 IFs and leaves a specific mov instruction behind

                                        So it can optimize the whole XWrite(array , const,  anything)  macro into a single  mov instruction which is great.

                                        But later when I it got to thhe "iadd x0[0].w,x0[0].w,x0[0].x..." main loop, it does this:

                                          mov tmp1, x0[0].x  

                                          mov tmp2, x0[0].w

                                          add tmp2, tmp1, tmp2

                                          mov x0[0].w, tmp2

                                        And this is triggered by the XWrite(x0,0,1234) macro (that the resulting mov of that 4 IFs aren't optimized furter, even when operands are specified exactly (x0[0].x).

                                         

                                        ----------------------------------------------------------------------------------------------------

                                        Bad:

                                        ; --------  Disassembly --------------------

                                        00 ALU: ADDR(32) CNT(10) KCACHE0(CB0:0-15)    //initialize with XWrite(x0,0,cb2[0].x) ...and so on

                                              0  x: MOV         R0.x,  KC0[0].x     

                                                 y: MOV         R0.y,  KC0[0].y     

                                                 z: MOV         R0.z,  KC0[0].z     

                                                 w: MOV         R0.w,  KC0[0].w     

                                              1  x: MOV         R4.x,  R0.x     

                                              2  y: MOV         R4.y,  R0.y     

                                              3  z: MOV         R4.z,  R0.z     

                                              4  w: MOV         R4.w,  R0.w     

                                              5  w: MOV         R1.w,  (0xFFFFFFFF, -1.#QNANf).x     

                                        01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5)

                                            02 ALU: ADDR(42) CNT(3)

                                                  6  w: ADD_INT     R1.w,  R1.w,  1     

                                                  7  x: PREDGE_INT  ____,  10000,  R1.w      UPDATE_EXEC_MASK BREAK UPDATE_PRED

                                            03 ALU: ADDR(45) CNT(15)

                                                  8  x: MOV         R0.x,  R4.x                  // iadd x0[0].w,x0[0].w,x0[0].x  ...and so on

                                                     w: MOV         R0.w,  R4.w     

                                                  9  z: MOV         R0.z,  R4.z     

                                                 10  w: ADD_INT     R0.w,  R0.x,  R0.w     

                                                 11  w: MOV         R4.w,  R0.w     

                                                 12  x: MOV         R0.x,  R4.x     

                                                     y: MOV         R0.y,  R4.y     

                                                 13  x: ADD_INT     R0.x,  R0.x,  R0.y     

                                                     z: ADD_INT     R1.z,  R0.w,  R0.z     

                                                 14  x: MOV         R4.x,  R0.x     

                                                 15  y: MOV         R0.y,  R4.y     

                                                     z: MOV         R0.z,  R4.z     

                                                 16  y: ADD_INT     R0.y,  R0.y,  R0.z     

                                                 17  y: MOV         R4.y,  R0.y     

                                                 18  z: MOV         R4.z,  R1.z     

                                        04 ENDLOOP i0 PASS_JUMP_ADDR(2)

                                        05 ALU: ADDR(60) CNT(5) KCACHE0(CB1:0-15)

                                             19  x: MOV         R0.x,  R4.x     

                                             20  y: MULADD_UINT24  R127.y,  0.0f,  4,  KC0[0].x     

                                             21  x: LSHR        R1.x,  PV20.y,  2     

                                        06 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R1].x___, R0, ARRAY_SIZE(4)  VPM

                                        07 END

                                        END_OF_PROGRAM

                                         

                                        Good:

                                        ; --------  Disassembly --------------------

                                        00 ALU: ADDR(32) CNT(6) KCACHE0(CB0:0-15)   //initialized with mov x0[0],cb2[0]

                                              0  x: MOV         R1.x,  KC0[0].x     

                                                 y: MOV         R0.y,  KC0[0].y     

                                                 z: MOV         R0.z,  KC0[0].z     

                                                 w: MOV         R0.w,  KC0[0].w     

                                              1  w: MOV         R1.w,  (0xFFFFFFFF, -1.#QNANf).x     

                                        01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5)

                                            02 ALU: ADDR(38) CNT(3)

                                                  2  w: ADD_INT     R1.w,  R1.w,  1     

                                                  3  x: PREDGE_INT  ____,  10000,  R1.w      UPDATE_EXEC_MASK BREAK UPDATE_PRED

                                            03 ALU: ADDR(41) CNT(4)

                                                  4  x: ADD_INT     R1.x,  R1.x,  R0.y     

                                                     y: ADD_INT     R0.y,  R0.y,  R0.z     

                                                     w: ADD_INT     R0.w,  R1.x,  R0.w     

                                                  5  z: ADD_INT     R0.z,  R0.z,  PV4.w     

                                        04 ENDLOOP i0 PASS_JUMP_ADDR(2)

                                        05 ALU: ADDR(45) CNT(4) KCACHE0(CB1:0-15)

                                              6  y: MULADD_UINT24  R127.y,  0.0f,  4,  KC0[0].x     

                                              7  x: LSHR        R0.x,  PV6.y,  2     

                                        06 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R0].x___, R1, ARRAY_SIZE(4)  VPM

                                        07 END

                                        END_OF_PROGRAM

                                        ---------------------------------------------------------------------------------------------

                                         

                                        On the GCN it also does this:

                                          v_mov_b32     v4, v40                                     // 00001C9C: 7E080328

                                          v_add_i32     v3, vcc, v37, v4                            // 00001CA0: 4A060925

                                          v_mov_b32     v40, v3                                     // 00001CA4: 7E500303

                                          v_mov_b32     v4, v41                                     // 00001CA8: 7E080329

                                          v_mov_b32     v5, v38                                     // 00001CAC: 7E0A0326

                                          v_add_i32     v4, vcc, v4, v5                             // 00001CB0: 4A080B04

                                          v_mov_b32     v41, v4                                     // 00001CB4: 7E520304

                                          v_mov_b32     v5, v42                                     // 00001CB8: 7E0A032A

                                          v_mov_b32     v6, v39                                     // 00001CBC: 7E0C0327

                                          v_add_i32     v5, vcc, v5, v6                             // 00001CC0: 4A0A0D05

                                          v_mov_b32     v42, v5                                     // 00001CC4: 7E540305

                                          v_mov_b32     v6, v43                                     // 00001CC8: 7E0C032B

                                          v_add_i32     v3, vcc, v3, v6                             // 00001CCC: 4A060D03

                                        But I failed to reproduce it with a small test program. It needs more 'pressure', It could be high VReg usage, or big program code or whatever. For small arrays it optimizes fine.

                                         

                                        -----------------------------------------------------------------------------------------------------------------------

                                        (Attaching an HD6970 compatible AMD_IL code.)

                                          • Re: Small temporary arrays in OpenCL
                                            himanshu.gautam

                                            Hi Realhet,

                                            I will forward this to appropraite team. Can you let me know the some more details:

                                            1. Platform - win32 / win64 / lin32 / lin64 or some other?

                                                Win7 or win vista or Win8.. Similarly for linux, your distribution

                                            2. Version of driver

                                            3. CPU(s) or GPU(s) you worked on. I think this is HD 6970 and HD 7970. Please confirm.

                                              • Re: Small temporary arrays in OpenCL
                                                realhet

                                                Hi!

                                                 

                                                I've tried with the latest driver also (no changes).

                                                Attaching many files to make it easy to reproduce/analyze.

                                                 

                                                Thank You

                                                 

                                                -------------------------------------------------------------------------------------------------------------------------------------------

                                                This test in a nutshell:

                                                 

                                                GPU: HD6970

                                                OS: win7 64

                                                Cat: 12-10 and 13-1 (no differences in result)

                                                 

                                                Have an indexed array x0, length=1.

                                                 

                                                I do the following operation on that in a loop:

                                                  x0[0].x+=x0[0].y;

                                                  x0[0].y+=x0[0].x;    //note the constant indexing

                                                 

                                                The compiled ISA loop is differencing basen on the way I use that array.

                                                 

                                                1) When I initialize it, with constant indexing:

                                                    x0[0].xy=cb2[0].xy 

                                                  Then it will compile the loop to:

                                                    3  y: ADD_INT     R0.y,  R1.x,  R0.y     

                                                    4  x: ADD_INT     R1.x,  R1.x,  PV3.y      //2 cycles is the best time for this dependency chain

                                                 

                                                2) When I initialize it, with register indexing:

                                                    loop r1.x from 0 to 1 do  

                                                      if(r1.x%4=0) x0[r1.x/4].x=cb2[r1.x/4].x

                                                      if(r1.x%4=1) x0[r1.x/4].y=cb2[r1.x/4].y

                                                      if(r1.x%4=2) x0[r1.x/4].z=cb2[r1.x/4].z

                                                      if(r1.x%4=3) x0[r1.x/4].w=cb2[r1.x/4].w

                                                    endloop 

                                                  This is enought for the compiler, to mark the array that is it variable accessed, and then it will compile the loop to:

                                                    5  x: MOV         R0.x,  R4.x     

                                                       y: MOV         R0.y,  R4.y     

                                                    6  x: MOV         R1.x,  R4.x     

                                                    7  y: ADD_INT     R0.y,  R0.x,  R0.y     

                                                    8  x: ADD_INT     R1.x,  PV7.y,  R1.x     

                                                    9  y: MOV         R4.y,  R0.y     

                                                   10  x: MOV         R4.x,  R1.x