cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

himanshu_gautam
Grandmaster
Grandmaster

Re: Small temporary arrays in OpenCL

Jump to solution

Hi realhet,

Can you please share some code, which can help us in reproducing the issue.

I will ask someone more knowledgeable for directions here.

Thanks

0 Kudos
Reply
realhet
Miniboss
Miniboss

Re: Small temporary arrays in OpenCL

Jump to solution

Hi!

I've managed to narrow it down: This is the simple operation it does over and over:

  dcl_indexed_temp_array x0[![(bufLen+3)>>2]]

  //array initialization goes here 

  //shuffle the elements of the array

  forLoop(i,0,10000)  //a loop so big that cannot be unrolled by the optimizer

    iadd x0[0].w,x0[0].w,x0[0].x

    iadd x0[0].x,x0[0].x,x0[0].y

    iadd x0[0].y,x0[0].y,x0[0].z

    iadd x0[0].z,x0[0].z,x0[0].w

  endloop

And the unoptimal code is triggered by the way, I initialize the array:

If I do this: mov x0[0], cb0[0]   then it compiles a perfect code (only add instructions are in the inner loop)

But if I initialize it with a dword indexing macro:

  XWrite(x0, 0, cb0[0].x)

  XWrite(x0, 1, cb0[0].y)

  XWrite(x0, 2, cb0[0].z)

  XWrite(x0, 3, cb0[0].w)

Where the XWrite C style macro is this: (It writes a dword in any array (cb0, x0, ...) at any dword position)

#define XWrite(XName,dwIdx,val)         \\

  ushr r999.x, dwIdx, 2                 \\

  iand r999.y, dwIdx, 3                 \\

  ifieq(r999.y,0) mov XName[r999.x].x, val \ endif \\

  ifieq(r999.y,1) mov XName[r999.x].y, val \ endif \\

  ifieq(r999.y,2) mov XName[r999.x].z, val \ endif \\

  ifieq(r999.y,3) mov XName[r999.x].w, val \ endif \\

#define ifieq(a,b)      \\

ieq r999.w, a, b        \\

if_logicalnz r999.w     \\

So If I touch that array with that flexible dword addressable thing, the compiler does the following:

- It realizes that the dword address is a constant so the  ushr,  iand   calculations are constant too.

- It also drops 3 IFs and leaves a specific mov instruction behind

So it can optimize the whole XWrite(array , const,  anything)  macro into a single  mov instruction which is great.

But later when I it got to thhe "iadd x0[0].w,x0[0].w,x0[0].x..." main loop, it does this:

  mov tmp1, x0[0].x  

  mov tmp2, x0[0].w

  add tmp2, tmp1, tmp2

  mov x0[0].w, tmp2

And this is triggered by the XWrite(x0,0,1234) macro (that the resulting mov of that 4 IFs aren't optimized furter, even when operands are specified exactly (x0[0].x).

----------------------------------------------------------------------------------------------------

Bad:

; --------  Disassembly --------------------

00 ALU: ADDR(32) CNT(10) KCACHE0(CB0:0-15)    //initialize with XWrite(x0,0,cb2[0].x) ...and so on

      0  x: MOV         R0.x,  KC0[0].x     

         y: MOV         R0.y,  KC0[0].y     

         z: MOV         R0.z,  KC0[0].z     

         w: MOV         R0.w,  KC0[0].w     

      1  x: MOV         R4.x,  R0.x     

      2  y: MOV         R4.y,  R0.y     

      3  z: MOV         R4.z,  R0.z     

      4  w: MOV         R4.w,  R0.w     

      5  w: MOV         R1.w,  (0xFFFFFFFF, -1.#QNANf).x     

01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5)

    02 ALU: ADDR(42) CNT(3)

          6  w: ADD_INT     R1.w,  R1.w,  1     

          7  x: PREDGE_INT  ____,  10000,  R1.w      UPDATE_EXEC_MASK BREAK UPDATE_PRED

    03 ALU: ADDR(45) CNT(15)

          8  x: MOV         R0.x,  R4.x                  // iadd x0[0].w,x0[0].w,x0[0].x  ...and so on

             w: MOV         R0.w,  R4.w     

          9  z: MOV         R0.z,  R4.z     

         10  w: ADD_INT     R0.w,  R0.x,  R0.w     

         11  w: MOV         R4.w,  R0.w     

         12  x: MOV         R0.x,  R4.x     

             y: MOV         R0.y,  R4.y     

         13  x: ADD_INT     R0.x,  R0.x,  R0.y     

             z: ADD_INT     R1.z,  R0.w,  R0.z     

         14  x: MOV         R4.x,  R0.x     

         15  y: MOV         R0.y,  R4.y     

             z: MOV         R0.z,  R4.z     

         16  y: ADD_INT     R0.y,  R0.y,  R0.z     

         17  y: MOV         R4.y,  R0.y     

         18  z: MOV         R4.z,  R1.z     

04 ENDLOOP i0 PASS_JUMP_ADDR(2)

05 ALU: ADDR(60) CNT(5) KCACHE0(CB1:0-15)

     19  x: MOV         R0.x,  R4.x     

     20  y: MULADD_UINT24  R127.y,  0.0f,  4,  KC0[0].x     

     21  x: LSHR        R1.x,  PV20.y,  2     

06 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R1].x___, R0, ARRAY_SIZE(4)  VPM

07 END

END_OF_PROGRAM

Good:

; --------  Disassembly --------------------

00 ALU: ADDR(32) CNT(6) KCACHE0(CB0:0-15)   //initialized with mov x0[0],cb2[0]

      0  x: MOV         R1.x,  KC0[0].x     

         y: MOV         R0.y,  KC0[0].y     

         z: MOV         R0.z,  KC0[0].z     

         w: MOV         R0.w,  KC0[0].w     

      1  w: MOV         R1.w,  (0xFFFFFFFF, -1.#QNANf).x     

01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5)

    02 ALU: ADDR(38) CNT(3)

          2  w: ADD_INT     R1.w,  R1.w,  1     

          3  x: PREDGE_INT  ____,  10000,  R1.w      UPDATE_EXEC_MASK BREAK UPDATE_PRED

    03 ALU: ADDR(41) CNT(4)

          4  x: ADD_INT     R1.x,  R1.x,  R0.y     

             y: ADD_INT     R0.y,  R0.y,  R0.z     

             w: ADD_INT     R0.w,  R1.x,  R0.w     

          5  z: ADD_INT     R0.z,  R0.z,  PV4.w     

04 ENDLOOP i0 PASS_JUMP_ADDR(2)

05 ALU: ADDR(45) CNT(4) KCACHE0(CB1:0-15)

      6  y: MULADD_UINT24  R127.y,  0.0f,  4,  KC0[0].x     

      7  x: LSHR        R0.x,  PV6.y,  2     

06 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R0].x___, R1, ARRAY_SIZE(4)  VPM

07 END

END_OF_PROGRAM

---------------------------------------------------------------------------------------------

On the GCN it also does this:

  v_mov_b32     v4, v40                                     // 00001C9C: 7E080328

  v_add_i32     v3, vcc, v37, v4                            // 00001CA0: 4A060925

  v_mov_b32     v40, v3                                     // 00001CA4: 7E500303

  v_mov_b32     v4, v41                                     // 00001CA8: 7E080329

  v_mov_b32     v5, v38                                     // 00001CAC: 7E0A0326

  v_add_i32     v4, vcc, v4, v5                             // 00001CB0: 4A080B04

  v_mov_b32     v41, v4                                     // 00001CB4: 7E520304

  v_mov_b32     v5, v42                                     // 00001CB8: 7E0A032A

  v_mov_b32     v6, v39                                     // 00001CBC: 7E0C0327

  v_add_i32     v5, vcc, v5, v6                             // 00001CC0: 4A0A0D05

  v_mov_b32     v42, v5                                     // 00001CC4: 7E540305

  v_mov_b32     v6, v43                                     // 00001CC8: 7E0C032B

  v_add_i32     v3, vcc, v3, v6                             // 00001CCC: 4A060D03

But I failed to reproduce it with a small test program. It needs more 'pressure', It could be high VReg usage, or big program code or whatever. For small arrays it optimizes fine.

-----------------------------------------------------------------------------------------------------------------------

(Attaching an HD6970 compatible AMD_IL code.)

0 Kudos
Reply
himanshu_gautam
Grandmaster
Grandmaster

Re: Small temporary arrays in OpenCL

Jump to solution

Hi Realhet,

I will forward this to appropraite team. Can you let me know the some more details:

1. Platform - win32 / win64 / lin32 / lin64 or some other?

    Win7 or win vista or Win8.. Similarly for linux, your distribution

2. Version of driver

3. CPU(s) or GPU(s) you worked on. I think this is HD 6970 and HD 7970. Please confirm.

0 Kudos
Reply
realhet
Miniboss
Miniboss

Re: Small temporary arrays in OpenCL

Jump to solution

Hi!

I've tried with the latest driver also (no changes).

Attaching many files to make it easy to reproduce/analyze.

Thank You

-------------------------------------------------------------------------------------------------------------------------------------------

This test in a nutshell:

GPU: HD6970

OS: win7 64

Cat: 12-10 and 13-1 (no differences in result)

Have an indexed array x0, length=1.

I do the following operation on that in a loop:

  x0[0].x+=x0[0].y;

  x0[0].y+=x0[0].x;    //note the constant indexing

The compiled ISA loop is differencing basen on the way I use that array.

1) When I initialize it, with constant indexing:

    x0[0].xy=cb2[0].xy 

  Then it will compile the loop to:

    3  y: ADD_INT     R0.y,  R1.x,  R0.y     

    4  x: ADD_INT     R1.x,  R1.x,  PV3.y      //2 cycles is the best time for this dependency chain

2) When I initialize it, with register indexing:

    loop r1.x from 0 to 1 do  

      if(r1.x%4=0) x0[r1.x/4].x=cb2[r1.x/4].x

      if(r1.x%4=1) x0[r1.x/4].y=cb2[r1.x/4].y

      if(r1.x%4=2) x0[r1.x/4].z=cb2[r1.x/4].z

      if(r1.x%4=3) x0[r1.x/4].w=cb2[r1.x/4].w

    endloop 

  This is enought for the compiler, to mark the array that is it variable accessed, and then it will compile the loop to:

    5  x: MOV         R0.x,  R4.x     

       y: MOV         R0.y,  R4.y     

    6  x: MOV         R1.x,  R4.x     

    7  y: ADD_INT     R0.y,  R0.x,  R0.y     

    8  x: ADD_INT     R1.x,  PV7.y,  R1.x     

    9  y: MOV         R4.y,  R0.y     

   10  x: MOV         R4.x,  R1.x     

0 Kudos
Reply
himanshu_gautam
Grandmaster
Grandmaster

Re: Small temporary arrays in OpenCL

Jump to solution

Thank You for the testcase. I have reported the issue to AMD OpenCL compiler team. I will update the thread, once the issue has been fixed.

0 Kudos
Reply