8 Replies Latest reply on Nov 13, 2009 6:43 PM by CaptainN

    Shared register not updated as it ought to be?

    lpw

      I'm working on a CAL program that performs a reduction using globally shared registers.  After reading the docs and the forum, I decided to start by implementing the following three pass algorithm:
      1 (init).  Run one wavefront per SIMD to initialize the shared registers to 0.
      2 (update).  Run a bunch of threads that increment the value in a shared register.
      3 (fetch).  Run one wavefront per SIMD to dump the shared registers to a global buffer.

      I'm using calCtxRunProgramGridArray to run the three kernels.  Each kernel uses 1 shared register, no LDS, and 64 threads per group.  The card is a 4870X2 (kernels run on device 0).

      The init kernel looks like this:

      il_cs_2_0
      dcl_num_thread_per_group 64
      dcl_shared_temp sr1
      dcl_literal l0, 0x0, 0x0, 0x0, 0x0
      mov sr0, l0
      end

      The update kernel looks like this:

      il_cs_2_0
      dcl_num_thread_per_group 64
      dcl_shared_temp sr1
      dcl_cb cb0[1]
      dcl_literal l0, 0x1, 0x1, 0x1, 0x1
      ; cb0[0].x contains total number of threads in the execution domain
      ult r17, vaTid.x, cb0[0].x
      if_logicalnz r17.x
          iadd sr0, sr0, l0
      endif
      end

      The fetch kernel looks like this:

      il_cs_2_0
      dcl_num_thread_per_group 64
      dcl_shared_temp sr1
      mov g[vaTid.x], sr0
      end

      The init and fetch kernels are executed with 640 threads each (numSIMDs * wavefrontSize).  Is this the correct way of launching one wavefront per SIMD?  The update kernel can be executed with any number of threads.

      The fetch kernel dumps the SRs to a global buffer (640 quad words).  I would expect that, if I added the x components of the 640 quad words, they should add up to the number of threads in the update kernel.  But this is not the case.  It appears that not all SRs are correctly incremented.

      Following is a set of outputs generated by printing the fetched global buffer.  I'm printing only the x components of each quad word (all four components are the same).

      If the update kernel is executed with 17 threads, the following values are produced (correct):

      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

      If the update kernel is executed with 640 threads, the following values are produced (correct):

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

      In fact, running between 0 and 640 threads seems to always work.  There are always k ones in the output, where k is the number of threads in the update kernel.  This situation corresponds to at most one wavefront per SIMD.

      However, things get wonky when k is greater than 640 (more than one wavefront per SIMD).

      For instance, running between 640 and 1280 threads in the update kernel produces the above output of all ones (incorrect, since I would expect that some SRs should be incremented to 2).  Running more than 1280 threads, the registers again appear to get incremented, but some increments were lost.  Here's the output for 1297 threads:

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

      2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

      I would expect to see something like this instead (each register incremented at least twice, with 17 of them incremented thrice):

      2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
      2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

      2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
      2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

      2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
      2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

      2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
      2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

      2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
      2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

      2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
      2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

      2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
      2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

      2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
      2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

      2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
      2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

      3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
      2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

      I dug a little deeper, and have generated results that seem to suggest that the shared registers are written correctly but are not read correctly by subsequent wavefronts.

      The shared register is now a quadruple which contains the following items:
      x: the value (as above)
      y: absolute thread id of the thread that initialized it (set by init kernel)
      z: absolute thread id of the thread that updated it (set by update kernel)
      w: absolute thread id of the thread that fetched it (set by fetch kernel)

      The new init kernel:

      il_cs_2_0
      dcl_num_thread_per_group 64
      dcl_shared_temp sr1
      dcl_literal l0, 0x0, 0x0, 0xffffffff, 0xffffffff
      mov sr0.x, l0.x
      mov sr0.y, vaTid.x
      mov sr0.z, l0.z
      mov sr0.w, l0.w
      end

      The new update kernel (liberally sprinkled with fence_sr's):

      il_cs_2_0
      dcl_num_thread_per_group 64
      dcl_shared_temp sr1
      dcl_cb cb0[1]
      dcl_literal l0, 0x1, 0x1, 0x1, 0x1
      ; cb0[0].x contains total number of threads in the execution domain
      ult r17, vaTid.x, cb0[0].x
      if_logicalnz r17.x
          fence_sr
          mov r0, sr0
          iadd r0.x, r0.x, l0.x
          mov r0.z, vaTid.x
          fence_sr
          mov sr0, r0
      endif
      end

      The new fetch kernel:

      il_cs_2_0
      dcl_num_thread_per_group 64
      dcl_shared_temp sr1
      mov r0, sr0
      mov r0.w, vaTid.x
      mov g[vaTid.x], r0
      end


      I ran the new kernels with 657 threads (640 + 17), expecting to see 623 ones and 17 twos.  All I see, however are ones.  These results are a little lengthy, but the interesting parts are highlighted in bold font near the bottom.

      ( 1,  64,  64,   0)
      ( 1,  65,  65,   1)
      ( 1,  66,  66,   2)
      ( 1,  67,  67,   3)
      ( 1,  68,  68,   4)
      ( 1,  69,  69,   5)
      ( 1,  70,  70,   6)
      ( 1,  71,  71,   7)
      ( 1,  72,  72,   8)
      ( 1,  73,  73,   9)
      ( 1,  74,  74,  10)
      ( 1,  75,  75,  11)
      ( 1,  76,  76,  12)
      ( 1,  77,  77,  13)
      ( 1,  78,  78,  14)
      ( 1,  79,  79,  15)
      ( 1,  80,  80,  16)
      ( 1,  81,  81,  17)
      ( 1,  82,  82,  18)
      ( 1,  83,  83,  19)
      ( 1,  84,  84,  20)
      ( 1,  85,  85,  21)
      ( 1,  86,  86,  22)
      ( 1,  87,  87,  23)
      ( 1,  88,  88,  24)
      ( 1,  89,  89,  25)
      ( 1,  90,  90,  26)
      ( 1,  91,  91,  27)
      ( 1,  92,  92,  28)
      ( 1,  93,  93,  29)
      ( 1,  94,  94,  30)
      ( 1,  95,  95,  31)
      ( 1,  96,  96,  32)
      ( 1,  97,  97,  33)
      ( 1,  98,  98,  34)
      ( 1,  99,  99,  35)
      ( 1, 100, 100,  36)
      ( 1, 101, 101,  37)
      ( 1, 102, 102,  38)
      ( 1, 103, 103,  39)
      ( 1, 104, 104,  40)
      ( 1, 105, 105,  41)
      ( 1, 106, 106,  42)
      ( 1, 107, 107,  43)
      ( 1, 108, 108,  44)
      ( 1, 109, 109,  45)
      ( 1, 110, 110,  46)
      ( 1, 111, 111,  47)
      ( 1, 112, 112,  48)
      ( 1, 113, 113,  49)
      ( 1, 114, 114,  50)
      ( 1, 115, 115,  51)
      ( 1, 116, 116,  52)
      ( 1, 117, 117,  53)
      ( 1, 118, 118,  54)
      ( 1, 119, 119,  55)
      ( 1, 120, 120,  56)
      ( 1, 121, 121,  57)
      ( 1, 122, 122,  58)
      ( 1, 123, 123,  59)
      ( 1, 124, 124,  60)
      ( 1, 125, 125,  61)
      ( 1, 126, 126,  62)
      ( 1, 127, 127,  63)

      ( 1, 128, 128,  64)
      ( 1, 129, 129,  65)
      ( 1, 130, 130,  66)
      ( 1, 131, 131,  67)
      ( 1, 132, 132,  68)
      ( 1, 133, 133,  69)
      ( 1, 134, 134,  70)
      ( 1, 135, 135,  71)
      ( 1, 136, 136,  72)
      ( 1, 137, 137,  73)
      ( 1, 138, 138,  74)
      ( 1, 139, 139,  75)
      ( 1, 140, 140,  76)
      ( 1, 141, 141,  77)
      ( 1, 142, 142,  78)
      ( 1, 143, 143,  79)
      ( 1, 144, 144,  80)
      ( 1, 145, 145,  81)
      ( 1, 146, 146,  82)
      ( 1, 147, 147,  83)
      ( 1, 148, 148,  84)
      ( 1, 149, 149,  85)
      ( 1, 150, 150,  86)
      ( 1, 151, 151,  87)
      ( 1, 152, 152,  88)
      ( 1, 153, 153,  89)
      ( 1, 154, 154,  90)
      ( 1, 155, 155,  91)
      ( 1, 156, 156,  92)
      ( 1, 157, 157,  93)
      ( 1, 158, 158,  94)
      ( 1, 159, 159,  95)
      ( 1, 160, 160,  96)
      ( 1, 161, 161,  97)
      ( 1, 162, 162,  98)
      ( 1, 163, 163,  99)
      ( 1, 164, 164, 100)
      ( 1, 165, 165, 101)
      ( 1, 166, 166, 102)
      ( 1, 167, 167, 103)
      ( 1, 168, 168, 104)
      ( 1, 169, 169, 105)
      ( 1, 170, 170, 106)
      ( 1, 171, 171, 107)
      ( 1, 172, 172, 108)
      ( 1, 173, 173, 109)
      ( 1, 174, 174, 110)
      ( 1, 175, 175, 111)
      ( 1, 176, 176, 112)
      ( 1, 177, 177, 113)
      ( 1, 178, 178, 114)
      ( 1, 179, 179, 115)
      ( 1, 180, 180, 116)
      ( 1, 181, 181, 117)
      ( 1, 182, 182, 118)
      ( 1, 183, 183, 119)
      ( 1, 184, 184, 120)
      ( 1, 185, 185, 121)
      ( 1, 186, 186, 122)
      ( 1, 187, 187, 123)
      ( 1, 188, 188, 124)
      ( 1, 189, 189, 125)
      ( 1, 190, 190, 126)
      ( 1, 191, 191, 127)

      ( 1, 192, 192, 128)
      ( 1, 193, 193, 129)
      ( 1, 194, 194, 130)
      ( 1, 195, 195, 131)
      ( 1, 196, 196, 132)
      ( 1, 197, 197, 133)
      ( 1, 198, 198, 134)
      ( 1, 199, 199, 135)
      ( 1, 200, 200, 136)
      ( 1, 201, 201, 137)
      ( 1, 202, 202, 138)
      ( 1, 203, 203, 139)
      ( 1, 204, 204, 140)
      ( 1, 205, 205, 141)
      ( 1, 206, 206, 142)
      ( 1, 207, 207, 143)
      ( 1, 208, 208, 144)
      ( 1, 209, 209, 145)
      ( 1, 210, 210, 146)
      ( 1, 211, 211, 147)
      ( 1, 212, 212, 148)
      ( 1, 213, 213, 149)
      ( 1, 214, 214, 150)
      ( 1, 215, 215, 151)
      ( 1, 216, 216, 152)
      ( 1, 217, 217, 153)
      ( 1, 218, 218, 154)
      ( 1, 219, 219, 155)
      ( 1, 220, 220, 156)
      ( 1, 221, 221, 157)
      ( 1, 222, 222, 158)
      ( 1, 223, 223, 159)
      ( 1, 224, 224, 160)
      ( 1, 225, 225, 161)
      ( 1, 226, 226, 162)
      ( 1, 227, 227, 163)
      ( 1, 228, 228, 164)
      ( 1, 229, 229, 165)
      ( 1, 230, 230, 166)
      ( 1, 231, 231, 167)
      ( 1, 232, 232, 168)
      ( 1, 233, 233, 169)
      ( 1, 234, 234, 170)
      ( 1, 235, 235, 171)
      ( 1, 236, 236, 172)
      ( 1, 237, 237, 173)
      ( 1, 238, 238, 174)
      ( 1, 239, 239, 175)
      ( 1, 240, 240, 176)
      ( 1, 241, 241, 177)
      ( 1, 242, 242, 178)
      ( 1, 243, 243, 179)
      ( 1, 244, 244, 180)
      ( 1, 245, 245, 181)
      ( 1, 246, 246, 182)
      ( 1, 247, 247, 183)
      ( 1, 248, 248, 184)
      ( 1, 249, 249, 185)
      ( 1, 250, 250, 186)
      ( 1, 251, 251, 187)
      ( 1, 252, 252, 188)
      ( 1, 253, 253, 189)
      ( 1, 254, 254, 190)
      ( 1, 255, 255, 191)

      ( 1, 256, 256, 192)
      ( 1, 257, 257, 193)
      ( 1, 258, 258, 194)
      ( 1, 259, 259, 195)
      ( 1, 260, 260, 196)
      ( 1, 261, 261, 197)
      ( 1, 262, 262, 198)
      ( 1, 263, 263, 199)
      ( 1, 264, 264, 200)
      ( 1, 265, 265, 201)
      ( 1, 266, 266, 202)
      ( 1, 267, 267, 203)
      ( 1, 268, 268, 204)
      ( 1, 269, 269, 205)
      ( 1, 270, 270, 206)
      ( 1, 271, 271, 207)
      ( 1, 272, 272, 208)
      ( 1, 273, 273, 209)
      ( 1, 274, 274, 210)
      ( 1, 275, 275, 211)
      ( 1, 276, 276, 212)
      ( 1, 277, 277, 213)
      ( 1, 278, 278, 214)
      ( 1, 279, 279, 215)
      ( 1, 280, 280, 216)
      ( 1, 281, 281, 217)
      ( 1, 282, 282, 218)
      ( 1, 283, 283, 219)
      ( 1, 284, 284, 220)
      ( 1, 285, 285, 221)
      ( 1, 286, 286, 222)
      ( 1, 287, 287, 223)
      ( 1, 288, 288, 224)
      ( 1, 289, 289, 225)
      ( 1, 290, 290, 226)
      ( 1, 291, 291, 227)
      ( 1, 292, 292, 228)
      ( 1, 293, 293, 229)
      ( 1, 294, 294, 230)
      ( 1, 295, 295, 231)
      ( 1, 296, 296, 232)
      ( 1, 297, 297, 233)
      ( 1, 298, 298, 234)
      ( 1, 299, 299, 235)
      ( 1, 300, 300, 236)
      ( 1, 301, 301, 237)
      ( 1, 302, 302, 238)
      ( 1, 303, 303, 239)
      ( 1, 304, 304, 240)
      ( 1, 305, 305, 241)
      ( 1, 306, 306, 242)
      ( 1, 307, 307, 243)
      ( 1, 308, 308, 244)
      ( 1, 309, 309, 245)
      ( 1, 310, 310, 246)
      ( 1, 311, 311, 247)
      ( 1, 312, 312, 248)
      ( 1, 313, 313, 249)
      ( 1, 314, 314, 250)
      ( 1, 315, 315, 251)
      ( 1, 316, 316, 252)
      ( 1, 317, 317, 253)
      ( 1, 318, 318, 254)
      ( 1, 319, 319, 255)

      ( 1, 320, 320, 256)
      ( 1, 321, 321, 257)
      ( 1, 322, 322, 258)
      ( 1, 323, 323, 259)
      ( 1, 324, 324, 260)
      ( 1, 325, 325, 261)
      ( 1, 326, 326, 262)
      ( 1, 327, 327, 263)
      ( 1, 328, 328, 264)
      ( 1, 329, 329, 265)
      ( 1, 330, 330, 266)
      ( 1, 331, 331, 267)
      ( 1, 332, 332, 268)
      ( 1, 333, 333, 269)
      ( 1, 334, 334, 270)
      ( 1, 335, 335, 271)
      ( 1, 336, 336, 272)
      ( 1, 337, 337, 273)
      ( 1, 338, 338, 274)
      ( 1, 339, 339, 275)
      ( 1, 340, 340, 276)
      ( 1, 341, 341, 277)
      ( 1, 342, 342, 278)
      ( 1, 343, 343, 279)
      ( 1, 344, 344, 280)
      ( 1, 345, 345, 281)
      ( 1, 346, 346, 282)
      ( 1, 347, 347, 283)
      ( 1, 348, 348, 284)
      ( 1, 349, 349, 285)
      ( 1, 350, 350, 286)
      ( 1, 351, 351, 287)
      ( 1, 352, 352, 288)
      ( 1, 353, 353, 289)
      ( 1, 354, 354, 290)
      ( 1, 355, 355, 291)
      ( 1, 356, 356, 292)
      ( 1, 357, 357, 293)
      ( 1, 358, 358, 294)
      ( 1, 359, 359, 295)
      ( 1, 360, 360, 296)
      ( 1, 361, 361, 297)
      ( 1, 362, 362, 298)
      ( 1, 363, 363, 299)
      ( 1, 364, 364, 300)
      ( 1, 365, 365, 301)
      ( 1, 366, 366, 302)
      ( 1, 367, 367, 303)
      ( 1, 368, 368, 304)
      ( 1, 369, 369, 305)
      ( 1, 370, 370, 306)
      ( 1, 371, 371, 307)
      ( 1, 372, 372, 308)
      ( 1, 373, 373, 309)
      ( 1, 374, 374, 310)
      ( 1, 375, 375, 311)
      ( 1, 376, 376, 312)
      ( 1, 377, 377, 313)
      ( 1, 378, 378, 314)
      ( 1, 379, 379, 315)
      ( 1, 380, 380, 316)
      ( 1, 381, 381, 317)
      ( 1, 382, 382, 318)
      ( 1, 383, 383, 319)

      ( 1, 384, 384, 320)
      ( 1, 385, 385, 321)
      ( 1, 386, 386, 322)
      ( 1, 387, 387, 323)
      ( 1, 388, 388, 324)
      ( 1, 389, 389, 325)
      ( 1, 390, 390, 326)
      ( 1, 391, 391, 327)
      ( 1, 392, 392, 328)
      ( 1, 393, 393, 329)
      ( 1, 394, 394, 330)
      ( 1, 395, 395, 331)
      ( 1, 396, 396, 332)
      ( 1, 397, 397, 333)
      ( 1, 398, 398, 334)
      ( 1, 399, 399, 335)
      ( 1, 400, 400, 336)
      ( 1, 401, 401, 337)
      ( 1, 402, 402, 338)
      ( 1, 403, 403, 339)
      ( 1, 404, 404, 340)
      ( 1, 405, 405, 341)
      ( 1, 406, 406, 342)
      ( 1, 407, 407, 343)
      ( 1, 408, 408, 344)
      ( 1, 409, 409, 345)
      ( 1, 410, 410, 346)
      ( 1, 411, 411, 347)
      ( 1, 412, 412, 348)
      ( 1, 413, 413, 349)
      ( 1, 414, 414, 350)
      ( 1, 415, 415, 351)
      ( 1, 416, 416, 352)
      ( 1, 417, 417, 353)
      ( 1, 418, 418, 354)
      ( 1, 419, 419, 355)
      ( 1, 420, 420, 356)
      ( 1, 421, 421, 357)
      ( 1, 422, 422, 358)
      ( 1, 423, 423, 359)
      ( 1, 424, 424, 360)
      ( 1, 425, 425, 361)
      ( 1, 426, 426, 362)
      ( 1, 427, 427, 363)
      ( 1, 428, 428, 364)
      ( 1, 429, 429, 365)
      ( 1, 430, 430, 366)
      ( 1, 431, 431, 367)
      ( 1, 432, 432, 368)
      ( 1, 433, 433, 369)
      ( 1, 434, 434, 370)
      ( 1, 435, 435, 371)
      ( 1, 436, 436, 372)
      ( 1, 437, 437, 373)
      ( 1, 438, 438, 374)
      ( 1, 439, 439, 375)
      ( 1, 440, 440, 376)
      ( 1, 441, 441, 377)
      ( 1, 442, 442, 378)
      ( 1, 443, 443, 379)
      ( 1, 444, 444, 380)
      ( 1, 445, 445, 381)
      ( 1, 446, 446, 382)
      ( 1, 447, 447, 383)

      ( 1, 448, 448, 384)
      ( 1, 449, 449, 385)
      ( 1, 450, 450, 386)
      ( 1, 451, 451, 387)
      ( 1, 452, 452, 388)
      ( 1, 453, 453, 389)
      ( 1, 454, 454, 390)
      ( 1, 455, 455, 391)
      ( 1, 456, 456, 392)
      ( 1, 457, 457, 393)
      ( 1, 458, 458, 394)
      ( 1, 459, 459, 395)
      ( 1, 460, 460, 396)
      ( 1, 461, 461, 397)
      ( 1, 462, 462, 398)
      ( 1, 463, 463, 399)
      ( 1, 464, 464, 400)
      ( 1, 465, 465, 401)
      ( 1, 466, 466, 402)
      ( 1, 467, 467, 403)
      ( 1, 468, 468, 404)
      ( 1, 469, 469, 405)
      ( 1, 470, 470, 406)
      ( 1, 471, 471, 407)
      ( 1, 472, 472, 408)
      ( 1, 473, 473, 409)
      ( 1, 474, 474, 410)
      ( 1, 475, 475, 411)
      ( 1, 476, 476, 412)
      ( 1, 477, 477, 413)
      ( 1, 478, 478, 414)
      ( 1, 479, 479, 415)
      ( 1, 480, 480, 416)
      ( 1, 481, 481, 417)
      ( 1, 482, 482, 418)
      ( 1, 483, 483, 419)
      ( 1, 484, 484, 420)
      ( 1, 485, 485, 421)
      ( 1, 486, 486, 422)
      ( 1, 487, 487, 423)
      ( 1, 488, 488, 424)
      ( 1, 489, 489, 425)
      ( 1, 490, 490, 426)
      ( 1, 491, 491, 427)
      ( 1, 492, 492, 428)
      ( 1, 493, 493, 429)
      ( 1, 494, 494, 430)
      ( 1, 495, 495, 431)
      ( 1, 496, 496, 432)
      ( 1, 497, 497, 433)
      ( 1, 498, 498, 434)
      ( 1, 499, 499, 435)
      ( 1, 500, 500, 436)
      ( 1, 501, 501, 437)
      ( 1, 502, 502, 438)
      ( 1, 503, 503, 439)
      ( 1, 504, 504, 440)
      ( 1, 505, 505, 441)
      ( 1, 506, 506, 442)
      ( 1, 507, 507, 443)
      ( 1, 508, 508, 444)
      ( 1, 509, 509, 445)
      ( 1, 510, 510, 446)
      ( 1, 511, 511, 447)

      ( 1, 512, 512, 448)
      ( 1, 513, 513, 449)
      ( 1, 514, 514, 450)
      ( 1, 515, 515, 451)
      ( 1, 516, 516, 452)
      ( 1, 517, 517, 453)
      ( 1, 518, 518, 454)
      ( 1, 519, 519, 455)
      ( 1, 520, 520, 456)
      ( 1, 521, 521, 457)
      ( 1, 522, 522, 458)
      ( 1, 523, 523, 459)
      ( 1, 524, 524, 460)
      ( 1, 525, 525, 461)
      ( 1, 526, 526, 462)
      ( 1, 527, 527, 463)
      ( 1, 528, 528, 464)
      ( 1, 529, 529, 465)
      ( 1, 530, 530, 466)
      ( 1, 531, 531, 467)
      ( 1, 532, 532, 468)
      ( 1, 533, 533, 469)
      ( 1, 534, 534, 470)
      ( 1, 535, 535, 471)
      ( 1, 536, 536, 472)
      ( 1, 537, 537, 473)
      ( 1, 538, 538, 474)
      ( 1, 539, 539, 475)
      ( 1, 540, 540, 476)
      ( 1, 541, 541, 477)
      ( 1, 542, 542, 478)
      ( 1, 543, 543, 479)
      ( 1, 544, 544, 480)
      ( 1, 545, 545, 481)
      ( 1, 546, 546, 482)
      ( 1, 547, 547, 483)
      ( 1, 548, 548, 484)
      ( 1, 549, 549, 485)
      ( 1, 550, 550, 486)
      ( 1, 551, 551, 487)
      ( 1, 552, 552, 488)
      ( 1, 553, 553, 489)
      ( 1, 554, 554, 490)
      ( 1, 555, 555, 491)
      ( 1, 556, 556, 492)
      ( 1, 557, 557, 493)
      ( 1, 558, 558, 494)
      ( 1, 559, 559, 495)
      ( 1, 560, 560, 496)
      ( 1, 561, 561, 497)
      ( 1, 562, 562, 498)
      ( 1, 563, 563, 499)
      ( 1, 564, 564, 500)
      ( 1, 565, 565, 501)
      ( 1, 566, 566, 502)
      ( 1, 567, 567, 503)
      ( 1, 568, 568, 504)
      ( 1, 569, 569, 505)
      ( 1, 570, 570, 506)
      ( 1, 571, 571, 507)
      ( 1, 572, 572, 508)
      ( 1, 573, 573, 509)
      ( 1, 574, 574, 510)
      ( 1, 575, 575, 511)

      ( 1, 576, 576, 512)
      ( 1, 577, 577, 513)
      ( 1, 578, 578, 514)
      ( 1, 579, 579, 515)
      ( 1, 580, 580, 516)
      ( 1, 581, 581, 517)
      ( 1, 582, 582, 518)
      ( 1, 583, 583, 519)
      ( 1, 584, 584, 520)
      ( 1, 585, 585, 521)
      ( 1, 586, 586, 522)
      ( 1, 587, 587, 523)
      ( 1, 588, 588, 524)
      ( 1, 589, 589, 525)
      ( 1, 590, 590, 526)
      ( 1, 591, 591, 527)
      ( 1, 592, 592, 528)
      ( 1, 593, 593, 529)
      ( 1, 594, 594, 530)
      ( 1, 595, 595, 531)
      ( 1, 596, 596, 532)
      ( 1, 597, 597, 533)
      ( 1, 598, 598, 534)
      ( 1, 599, 599, 535)
      ( 1, 600, 600, 536)
      ( 1, 601, 601, 537)
      ( 1, 602, 602, 538)
      ( 1, 603, 603, 539)
      ( 1, 604, 604, 540)
      ( 1, 605, 605, 541)
      ( 1, 606, 606, 542)
      ( 1, 607, 607, 543)
      ( 1, 608, 608, 544)
      ( 1, 609, 609, 545)
      ( 1, 610, 610, 546)
      ( 1, 611, 611, 547)
      ( 1, 612, 612, 548)
      ( 1, 613, 613, 549)
      ( 1, 614, 614, 550)
      ( 1, 615, 615, 551)
      ( 1, 616, 616, 552)
      ( 1, 617, 617, 553)
      ( 1, 618, 618, 554)
      ( 1, 619, 619, 555)
      ( 1, 620, 620, 556)
      ( 1, 621, 621, 557)
      ( 1, 622, 622, 558)
      ( 1, 623, 623, 559)
      ( 1, 624, 624, 560)
      ( 1, 625, 625, 561)
      ( 1, 626, 626, 562)
      ( 1, 627, 627, 563)
      ( 1, 628, 628, 564)
      ( 1, 629, 629, 565)
      ( 1, 630, 630, 566)
      ( 1, 631, 631, 567)
      ( 1, 632, 632, 568)
      ( 1, 633, 633, 569)
      ( 1, 634, 634, 570)
      ( 1, 635, 635, 571)
      ( 1, 636, 636, 572)
      ( 1, 637, 637, 573)
      ( 1, 638, 638, 574)
      ( 1, 639, 639, 575)

      ( 1,   0, 640, 576)
      ( 1,   1, 641, 577)
      ( 1,   2, 642, 578)
      ( 1,   3, 643, 579)
      ( 1,   4, 644, 580)
      ( 1,   5, 645, 581)
      ( 1,   6, 646, 582)
      ( 1,   7, 647, 583)
      ( 1,   8, 648, 584)
      ( 1,   9, 649, 585)
      ( 1,  10, 650, 586)
      ( 1,  11, 651, 587)
      ( 1,  12, 652, 588)
      ( 1,  13, 653, 589)
      ( 1,  14, 654, 590)
      ( 1,  15, 655, 591)
      ( 1,  16, 656, 592)
      ( 1,  17,  17, 593)
      ( 1,  18,  18, 594)
      ( 1,  19,  19, 595)
      ( 1,  20,  20, 596)
      ( 1,  21,  21, 597)
      ( 1,  22,  22, 598)
      ( 1,  23,  23, 599)
      ( 1,  24,  24, 600)
      ( 1,  25,  25, 601)
      ( 1,  26,  26, 602)
      ( 1,  27,  27, 603)
      ( 1,  28,  28, 604)
      ( 1,  29,  29, 605)
      ( 1,  30,  30, 606)
      ( 1,  31,  31, 607)
      ( 1,  32,  32, 608)
      ( 1,  33,  33, 609)
      ( 1,  34,  34, 610)
      ( 1,  35,  35, 611)
      ( 1,  36,  36, 612)
      ( 1,  37,  37, 613)
      ( 1,  38,  38, 614)
      ( 1,  39,  39, 615)
      ( 1,  40,  40, 616)
      ( 1,  41,  41, 617)
      ( 1,  42,  42, 618)
      ( 1,  43,  43, 619)
      ( 1,  44,  44, 620)
      ( 1,  45,  45, 621)
      ( 1,  46,  46, 622)
      ( 1,  47,  47, 623)
      ( 1,  48,  48, 624)
      ( 1,  49,  49, 625)
      ( 1,  50,  50, 626)
      ( 1,  51,  51, 627)
      ( 1,  52,  52, 628)
      ( 1,  53,  53, 629)
      ( 1,  54,  54, 630)
      ( 1,  55,  55, 631)
      ( 1,  56,  56, 632)
      ( 1,  57,  57, 633)
      ( 1,  58,  58, 634)
      ( 1,  59,  59, 635)
      ( 1,  60,  60, 636)
      ( 1,  61,  61, 637)
      ( 1,  62,  62, 638)
      ( 1,  63,  63, 639)

      The y, z, w values make sense according to my expectations, but the x value is not updated as I think it ought to be.  The highlighted results were produced by threads whose ids indicate that they were part of the second wavefront, yet they did not correctly increment the value (they did, however, correctly set their ids in the shared register).

      Can anyone shed some light on this?  Are my expectations incorrect?  What am I missing?

      Do I at least deserve an honourable mention for the longest post ever?

      hello

        • Shared register not updated as it ought to be?
          MicahVillmow
          lpw,
          You should read the thread 'calculating the bottleneck(threadid=115872)' and pay attention mainly to the discussion of the even & odd wavefronts. This explains your behavior that you are seeing. If you have further questions after reading, please post them and I'll try to answer. A quick summation is that the first 640 threads run on even wavefronts and the next 640 threads run on odd wavefronts.
          • Shared register not updated as it ought to be?
            MicahVillmow
            lpw,
            Just went through our documentation. One very important piece of information is left out that will fix your problems. Access to shared registers is only atomic if done in a single instruction.

            i.e.
            iadd sr0, sr0, sr1 is correct
            but
            mov r0, sr0
            mov r1, sr1
            iadd r2, r0, r1
            mov sr0, r2 is incorrect because of the even/odd wavefront issue.
              • Shared register not updated as it ought to be?
                lpw

                Hi Micah,

                Thanks for the quick reply.  Using the information you provided, I managed to get it working.  My interpetation (perhaps naive) of the forum discussions and documentation was that I should use two shared registers to perform the reduction, a different one for odd and even wavefronts.  Here's what I came up with:


                The init kernel:

                il_cs_2_0
                dcl_num_thread_per_group 64
                dcl_shared_temp sr2
                dcl_literal l0, 0x0, 0x0, 0x0, 0x0
                mov sr0, l0   ; used for odd wavefronts
                mov sr1, l0   ; used for even wavefronts
                end


                The update kernel:

                il_cs_2_0
                dcl_num_thread_per_group 64
                dcl_shared_temp sr2
                dcl_cb cb0[1]
                dcl_literal l0, 0x1, 0x1, 0x1, 0x1
                dcl_literal l1, 0x280, 0x1, 0x0, 0x0
                ult r17, vaTid.x, cb0[0].x
                if_logicalnz r17.x
                    udiv r0.x, vaTid.x, l1.x  ; divide by 640 (numSIMDs * wavefrontSize)
                    iand r0.x, r0.x, l1.y     ; check if odd
                    if_logicalnz r0.x
                        iadd sr0, sr0, l0 ; accumulate odd wavefronts
                    else
                        iadd sr1, sr1, l0 ; accumulate even wavefronts
                    endif
                endif
                end

                The fetch kernel:

                il_cs_2_0
                dcl_num_thread_per_group 64
                dcl_shared_temp sr2
                iadd r0, sr0, sr1      ; add even + odd SRs
                mov g[vaTid.x], r0
                end


                That works, but is this the most efficient way of doing this?  I'm concerned about the udiv instruction and the even/odd branch.

                Thanks again for your help.

              • Shared register not updated as it ought to be?
                MicahVillmow
                Well, you definitely do not want that division or the flow control in your code.
                You can get rid of the outer if statement by making your num_thread_per_group == cb0[0].x but this makes it hardcoded.
                You can get rid of the division/inner if by just writing to sr0 instead of sr0 and sr1.
                The fetch kernel can just have mov g[vaTid.x], sr0
                And then you can add a fourth pass which does the copy from GPU memory to PCIe memory and the copy kernel can also do a reduction by a factor of two to combine the even & odd wavefronts, similiar to.
                il_cs_2_0
                dcl_num_thread_per_group 64
                dcl_shared_temp sr2
                iadd g[vaTid.x], g[vaTid.x], g[vaTid.x + 640]
                end

                Another way to solve this is to make your num_thread_per_group equal to 128, which would then make the even and odd wavefronts be consecutive instead of strided by 10.
                  • Shared register not updated as it ought to be?
                    lpw

                     

                    Originally posted by: MicahVillmow Well, you definitely do not want that division or the flow control in your code. You can get rid of the division/inner if by just writing to sr0 instead of sr0 and sr1. The fetch kernel can just have mov g[vaTid.x], sr0
                    Maybe my brain is fried, but that's what I had in the original kernels (the versions at the top of my original post) but the results were incorrect.

                  • Shared register not updated as it ought to be?
                    MicahVillmow
                    Well,
                    The problem with your kernels up top is you weren't doing any combination of results from the even and odd wavefronts which adding a fourth path that does a reduction and copy from local to pcie memory should solve. Of course this fourth path is not the most efficient as just having your third kernel write directly to PCIe space would be more efficient.