3 Replies Latest reply on May 19, 2011 1:46 PM by techforums123

    Bug in compiling ushort kernel for GPU

    omion
      I'm writing a kernel that is designed to take the most advantage of the SSE2 instruction set, but I was doing a sanity check on the GPU and getting the wrong answers. It looks like some optimization in the compilation is getting the wrong answer.

      I traced the problem down and got the smallest kernel that showed it. The code below produced the following Cypress assembly:

      ; -------- Disassembly --------------------
      00 ALU: ADDR(32) CNT(3) KCACHE0(CB1:0-15)
      0 x: LSHR R1.x, KC0[0].x, (0x00000002, 2.802596929e-45f).x
      t: MOV R0.x, (0x00010001, 9.183689746e-41f).y
      01 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R1].x___, R0, ARRAY_SIZE(4) VPM
      END_OF_PROGRAM

      Basically all it does is store 0x00010001 into the buffer, which is the wrong value (each ushort should be 0x5555 at the end)


      However, if I manually unroll the entire loop it gives the following (correct) assembly:

      ; -------- Disassembly --------------------
      00 ALU: ADDR(32) CNT(3) KCACHE0(CB1:0-15)
      0 x: LSHR R1.x, KC0[0].x, (0x00000002, 2.802596929e-45f).x
      t: MOV R0.x, (0x55555555, 1.466015469e13f).y
      01 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R1].x___, R0, ARRAY_SIZE(4) VPM
      END_OF_PROGRAM


      Compiling to CPU assembly doesn't reduce all the computations but does produce the right answer.

      __kernel void __attribute__((reqd_work_group_size(1,1,1))) small_test_1_wu ( __global ushort2 *output ) { ushort2 xor_this = (ushort2)0; ushort2 input = (ushort2)0x4000; ushort2 factor = (ushort2)4; ushort2 hi; uint i; for(i = 0; i < 8; i++) { xor_this ^= ((as_ushort2((factor & (ushort2)1) == (ushort2)1)) & input); hi = as_ushort2(as_short2(input) >> (short)15); input <<= (ushort)1; input ^= hi & (ushort2)0x5555; factor = factor >> (ushort)1; } output[0] = xor_this; }