cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

lpw
Journeyman III

Shared register not updated as it ought to be?

I'm working on a CAL program that performs a reduction using globally shared registers.  After reading the docs and the forum, I decided to start by implementing the following three pass algorithm:
1 (init).  Run one wavefront per SIMD to initialize the shared registers to 0.
2 (update).  Run a bunch of threads that increment the value in a shared register.
3 (fetch).  Run one wavefront per SIMD to dump the shared registers to a global buffer.

I'm using calCtxRunProgramGridArray to run the three kernels.  Each kernel uses 1 shared register, no LDS, and 64 threads per group.  The card is a 4870X2 (kernels run on device 0).

The init kernel looks like this:

il_cs_2_0
dcl_num_thread_per_group 64
dcl_shared_temp sr1
dcl_literal l0, 0x0, 0x0, 0x0, 0x0
mov sr0, l0
end

The update kernel looks like this:

il_cs_2_0
dcl_num_thread_per_group 64
dcl_shared_temp sr1
dcl_cb cb0[1]
dcl_literal l0, 0x1, 0x1, 0x1, 0x1
; cb0[0].x contains total number of threads in the execution domain
ult r17, vaTid.x, cb0[0].x
if_logicalnz r17.x
    iadd sr0, sr0, l0
endif
end

The fetch kernel looks like this:

il_cs_2_0
dcl_num_thread_per_group 64
dcl_shared_temp sr1
mov g[vaTid.x], sr0
end

The init and fetch kernels are executed with 640 threads each (numSIMDs * wavefrontSize).  Is this the correct way of launching one wavefront per SIMD?  The update kernel can be executed with any number of threads.

The fetch kernel dumps the SRs to a global buffer (640 quad words).  I would expect that, if I added the x components of the 640 quad words, they should add up to the number of threads in the update kernel.  But this is not the case.  It appears that not all SRs are correctly incremented.

Following is a set of outputs generated by printing the fetched global buffer.  I'm printing only the x components of each quad word (all four components are the same).

If the update kernel is executed with 17 threads, the following values are produced (correct):

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

If the update kernel is executed with 640 threads, the following values are produced (correct):

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

In fact, running between 0 and 640 threads seems to always work.  There are always k ones in the output, where k is the number of threads in the update kernel.  This situation corresponds to at most one wavefront per SIMD.

However, things get wonky when k is greater than 640 (more than one wavefront per SIMD).

For instance, running between 640 and 1280 threads in the update kernel produces the above output of all ones (incorrect, since I would expect that some SRs should be incremented to 2).  Running more than 1280 threads, the registers again appear to get incremented, but some increments were lost.  Here's the output for 1297 threads:

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

I would expect to see something like this instead (each register incremented at least twice, with 17 of them incremented thrice):

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

I dug a little deeper, and have generated results that seem to suggest that the shared registers are written correctly but are not read correctly by subsequent wavefronts.

The shared register is now a quadruple which contains the following items:
x: the value (as above)
y: absolute thread id of the thread that initialized it (set by init kernel)
z: absolute thread id of the thread that updated it (set by update kernel)
w: absolute thread id of the thread that fetched it (set by fetch kernel)

The new init kernel:

il_cs_2_0
dcl_num_thread_per_group 64
dcl_shared_temp sr1
dcl_literal l0, 0x0, 0x0, 0xffffffff, 0xffffffff
mov sr0.x, l0.x
mov sr0.y, vaTid.x
mov sr0.z, l0.z
mov sr0.w, l0.w
end

The new update kernel (liberally sprinkled with fence_sr's):

il_cs_2_0
dcl_num_thread_per_group 64
dcl_shared_temp sr1
dcl_cb cb0[1]
dcl_literal l0, 0x1, 0x1, 0x1, 0x1
; cb0[0].x contains total number of threads in the execution domain
ult r17, vaTid.x, cb0[0].x
if_logicalnz r17.x
    fence_sr
    mov r0, sr0
    iadd r0.x, r0.x, l0.x
    mov r0.z, vaTid.x
    fence_sr
    mov sr0, r0
endif
end

The new fetch kernel:

il_cs_2_0
dcl_num_thread_per_group 64
dcl_shared_temp sr1
mov r0, sr0
mov r0.w, vaTid.x
mov g[vaTid.x], r0
end


I ran the new kernels with 657 threads (640 + 17), expecting to see 623 ones and 17 twos.  All I see, however are ones.  These results are a little lengthy, but the interesting parts are highlighted in bold font near the bottom.

( 1,  64,  64,   0)
( 1,  65,  65,   1)
( 1,  66,  66,   2)
( 1,  67,  67,   3)
( 1,  68,  68,   4)
( 1,  69,  69,   5)
( 1,  70,  70,   6)
( 1,  71,  71,   7)
( 1,  72,  72,   😎
( 1,  73,  73,   9)
( 1,  74,  74,  10)
( 1,  75,  75,  11)
( 1,  76,  76,  12)
( 1,  77,  77,  13)
( 1,  78,  78,  14)
( 1,  79,  79,  15)
( 1,  80,  80,  16)
( 1,  81,  81,  17)
( 1,  82,  82,  18)
( 1,  83,  83,  19)
( 1,  84,  84,  20)
( 1,  85,  85,  21)
( 1,  86,  86,  22)
( 1,  87,  87,  23)
( 1,  88,  88,  24)
( 1,  89,  89,  25)
( 1,  90,  90,  26)
( 1,  91,  91,  27)
( 1,  92,  92,  28)
( 1,  93,  93,  29)
( 1,  94,  94,  30)
( 1,  95,  95,  31)
( 1,  96,  96,  32)
( 1,  97,  97,  33)
( 1,  98,  98,  34)
( 1,  99,  99,  35)
( 1, 100, 100,  36)
( 1, 101, 101,  37)
( 1, 102, 102,  38)
( 1, 103, 103,  39)
( 1, 104, 104,  40)
( 1, 105, 105,  41)
( 1, 106, 106,  42)
( 1, 107, 107,  43)
( 1, 108, 108,  44)
( 1, 109, 109,  45)
( 1, 110, 110,  46)
( 1, 111, 111,  47)
( 1, 112, 112,  48)
( 1, 113, 113,  49)
( 1, 114, 114,  50)
( 1, 115, 115,  51)
( 1, 116, 116,  52)
( 1, 117, 117,  53)
( 1, 118, 118,  54)
( 1, 119, 119,  55)
( 1, 120, 120,  56)
( 1, 121, 121,  57)
( 1, 122, 122,  58)
( 1, 123, 123,  59)
( 1, 124, 124,  60)
( 1, 125, 125,  61)
( 1, 126, 126,  62)
( 1, 127, 127,  63)

( 1, 128, 128,  64)
( 1, 129, 129,  65)
( 1, 130, 130,  66)
( 1, 131, 131,  67)
( 1, 132, 132,  68)
( 1, 133, 133,  69)
( 1, 134, 134,  70)
( 1, 135, 135,  71)
( 1, 136, 136,  72)
( 1, 137, 137,  73)
( 1, 138, 138,  74)
( 1, 139, 139,  75)
( 1, 140, 140,  76)
( 1, 141, 141,  77)
( 1, 142, 142,  78)
( 1, 143, 143,  79)
( 1, 144, 144,  80)
( 1, 145, 145,  81)
( 1, 146, 146,  82)
( 1, 147, 147,  83)
( 1, 148, 148,  84)
( 1, 149, 149,  85)
( 1, 150, 150,  86)
( 1, 151, 151,  87)
( 1, 152, 152,  88)
( 1, 153, 153,  89)
( 1, 154, 154,  90)
( 1, 155, 155,  91)
( 1, 156, 156,  92)
( 1, 157, 157,  93)
( 1, 158, 158,  94)
( 1, 159, 159,  95)
( 1, 160, 160,  96)
( 1, 161, 161,  97)
( 1, 162, 162,  98)
( 1, 163, 163,  99)
( 1, 164, 164, 100)
( 1, 165, 165, 101)
( 1, 166, 166, 102)
( 1, 167, 167, 103)
( 1, 168, 168, 104)
( 1, 169, 169, 105)
( 1, 170, 170, 106)
( 1, 171, 171, 107)
( 1, 172, 172, 108)
( 1, 173, 173, 109)
( 1, 174, 174, 110)
( 1, 175, 175, 111)
( 1, 176, 176, 112)
( 1, 177, 177, 113)
( 1, 178, 178, 114)
( 1, 179, 179, 115)
( 1, 180, 180, 116)
( 1, 181, 181, 117)
( 1, 182, 182, 118)
( 1, 183, 183, 119)
( 1, 184, 184, 120)
( 1, 185, 185, 121)
( 1, 186, 186, 122)
( 1, 187, 187, 123)
( 1, 188, 188, 124)
( 1, 189, 189, 125)
( 1, 190, 190, 126)
( 1, 191, 191, 127)

( 1, 192, 192, 128)
( 1, 193, 193, 129)
( 1, 194, 194, 130)
( 1, 195, 195, 131)
( 1, 196, 196, 132)
( 1, 197, 197, 133)
( 1, 198, 198, 134)
( 1, 199, 199, 135)
( 1, 200, 200, 136)
( 1, 201, 201, 137)
( 1, 202, 202, 138)
( 1, 203, 203, 139)
( 1, 204, 204, 140)
( 1, 205, 205, 141)
( 1, 206, 206, 142)
( 1, 207, 207, 143)
( 1, 208, 208, 144)
( 1, 209, 209, 145)
( 1, 210, 210, 146)
( 1, 211, 211, 147)
( 1, 212, 212, 148)
( 1, 213, 213, 149)
( 1, 214, 214, 150)
( 1, 215, 215, 151)
( 1, 216, 216, 152)
( 1, 217, 217, 153)
( 1, 218, 218, 154)
( 1, 219, 219, 155)
( 1, 220, 220, 156)
( 1, 221, 221, 157)
( 1, 222, 222, 158)
( 1, 223, 223, 159)
( 1, 224, 224, 160)
( 1, 225, 225, 161)
( 1, 226, 226, 162)
( 1, 227, 227, 163)
( 1, 228, 228, 164)
( 1, 229, 229, 165)
( 1, 230, 230, 166)
( 1, 231, 231, 167)
( 1, 232, 232, 168)
( 1, 233, 233, 169)
( 1, 234, 234, 170)
( 1, 235, 235, 171)
( 1, 236, 236, 172)
( 1, 237, 237, 173)
( 1, 238, 238, 174)
( 1, 239, 239, 175)
( 1, 240, 240, 176)
( 1, 241, 241, 177)
( 1, 242, 242, 178)
( 1, 243, 243, 179)
( 1, 244, 244, 180)
( 1, 245, 245, 181)
( 1, 246, 246, 182)
( 1, 247, 247, 183)
( 1, 248, 248, 184)
( 1, 249, 249, 185)
( 1, 250, 250, 186)
( 1, 251, 251, 187)
( 1, 252, 252, 188)
( 1, 253, 253, 189)
( 1, 254, 254, 190)
( 1, 255, 255, 191)

( 1, 256, 256, 192)
( 1, 257, 257, 193)
( 1, 258, 258, 194)
( 1, 259, 259, 195)
( 1, 260, 260, 196)
( 1, 261, 261, 197)
( 1, 262, 262, 198)
( 1, 263, 263, 199)
( 1, 264, 264, 200)
( 1, 265, 265, 201)
( 1, 266, 266, 202)
( 1, 267, 267, 203)
( 1, 268, 268, 204)
( 1, 269, 269, 205)
( 1, 270, 270, 206)
( 1, 271, 271, 207)
( 1, 272, 272, 208)
( 1, 273, 273, 209)
( 1, 274, 274, 210)
( 1, 275, 275, 211)
( 1, 276, 276, 212)
( 1, 277, 277, 213)
( 1, 278, 278, 214)
( 1, 279, 279, 215)
( 1, 280, 280, 216)
( 1, 281, 281, 217)
( 1, 282, 282, 218)
( 1, 283, 283, 219)
( 1, 284, 284, 220)
( 1, 285, 285, 221)
( 1, 286, 286, 222)
( 1, 287, 287, 223)
( 1, 288, 288, 224)
( 1, 289, 289, 225)
( 1, 290, 290, 226)
( 1, 291, 291, 227)
( 1, 292, 292, 228)
( 1, 293, 293, 229)
( 1, 294, 294, 230)
( 1, 295, 295, 231)
( 1, 296, 296, 232)
( 1, 297, 297, 233)
( 1, 298, 298, 234)
( 1, 299, 299, 235)
( 1, 300, 300, 236)
( 1, 301, 301, 237)
( 1, 302, 302, 238)
( 1, 303, 303, 239)
( 1, 304, 304, 240)
( 1, 305, 305, 241)
( 1, 306, 306, 242)
( 1, 307, 307, 243)
( 1, 308, 308, 244)
( 1, 309, 309, 245)
( 1, 310, 310, 246)
( 1, 311, 311, 247)
( 1, 312, 312, 248)
( 1, 313, 313, 249)
( 1, 314, 314, 250)
( 1, 315, 315, 251)
( 1, 316, 316, 252)
( 1, 317, 317, 253)
( 1, 318, 318, 254)
( 1, 319, 319, 255)

( 1, 320, 320, 256)
( 1, 321, 321, 257)
( 1, 322, 322, 258)
( 1, 323, 323, 259)
( 1, 324, 324, 260)
( 1, 325, 325, 261)
( 1, 326, 326, 262)
( 1, 327, 327, 263)
( 1, 328, 328, 264)
( 1, 329, 329, 265)
( 1, 330, 330, 266)
( 1, 331, 331, 267)
( 1, 332, 332, 268)
( 1, 333, 333, 269)
( 1, 334, 334, 270)
( 1, 335, 335, 271)
( 1, 336, 336, 272)
( 1, 337, 337, 273)
( 1, 338, 338, 274)
( 1, 339, 339, 275)
( 1, 340, 340, 276)
( 1, 341, 341, 277)
( 1, 342, 342, 278)
( 1, 343, 343, 279)
( 1, 344, 344, 280)
( 1, 345, 345, 281)
( 1, 346, 346, 282)
( 1, 347, 347, 283)
( 1, 348, 348, 284)
( 1, 349, 349, 285)
( 1, 350, 350, 286)
( 1, 351, 351, 287)
( 1, 352, 352, 288)
( 1, 353, 353, 289)
( 1, 354, 354, 290)
( 1, 355, 355, 291)
( 1, 356, 356, 292)
( 1, 357, 357, 293)
( 1, 358, 358, 294)
( 1, 359, 359, 295)
( 1, 360, 360, 296)
( 1, 361, 361, 297)
( 1, 362, 362, 298)
( 1, 363, 363, 299)
( 1, 364, 364, 300)
( 1, 365, 365, 301)
( 1, 366, 366, 302)
( 1, 367, 367, 303)
( 1, 368, 368, 304)
( 1, 369, 369, 305)
( 1, 370, 370, 306)
( 1, 371, 371, 307)
( 1, 372, 372, 308)
( 1, 373, 373, 309)
( 1, 374, 374, 310)
( 1, 375, 375, 311)
( 1, 376, 376, 312)
( 1, 377, 377, 313)
( 1, 378, 378, 314)
( 1, 379, 379, 315)
( 1, 380, 380, 316)
( 1, 381, 381, 317)
( 1, 382, 382, 318)
( 1, 383, 383, 319)

( 1, 384, 384, 320)
( 1, 385, 385, 321)
( 1, 386, 386, 322)
( 1, 387, 387, 323)
( 1, 388, 388, 324)
( 1, 389, 389, 325)
( 1, 390, 390, 326)
( 1, 391, 391, 327)
( 1, 392, 392, 328)
( 1, 393, 393, 329)
( 1, 394, 394, 330)
( 1, 395, 395, 331)
( 1, 396, 396, 332)
( 1, 397, 397, 333)
( 1, 398, 398, 334)
( 1, 399, 399, 335)
( 1, 400, 400, 336)
( 1, 401, 401, 337)
( 1, 402, 402, 338)
( 1, 403, 403, 339)
( 1, 404, 404, 340)
( 1, 405, 405, 341)
( 1, 406, 406, 342)
( 1, 407, 407, 343)
( 1, 408, 408, 344)
( 1, 409, 409, 345)
( 1, 410, 410, 346)
( 1, 411, 411, 347)
( 1, 412, 412, 348)
( 1, 413, 413, 349)
( 1, 414, 414, 350)
( 1, 415, 415, 351)
( 1, 416, 416, 352)
( 1, 417, 417, 353)
( 1, 418, 418, 354)
( 1, 419, 419, 355)
( 1, 420, 420, 356)
( 1, 421, 421, 357)
( 1, 422, 422, 358)
( 1, 423, 423, 359)
( 1, 424, 424, 360)
( 1, 425, 425, 361)
( 1, 426, 426, 362)
( 1, 427, 427, 363)
( 1, 428, 428, 364)
( 1, 429, 429, 365)
( 1, 430, 430, 366)
( 1, 431, 431, 367)
( 1, 432, 432, 368)
( 1, 433, 433, 369)
( 1, 434, 434, 370)
( 1, 435, 435, 371)
( 1, 436, 436, 372)
( 1, 437, 437, 373)
( 1, 438, 438, 374)
( 1, 439, 439, 375)
( 1, 440, 440, 376)
( 1, 441, 441, 377)
( 1, 442, 442, 378)
( 1, 443, 443, 379)
( 1, 444, 444, 380)
( 1, 445, 445, 381)
( 1, 446, 446, 382)
( 1, 447, 447, 383)

( 1, 448, 448, 384)
( 1, 449, 449, 385)
( 1, 450, 450, 386)
( 1, 451, 451, 387)
( 1, 452, 452, 388)
( 1, 453, 453, 389)
( 1, 454, 454, 390)
( 1, 455, 455, 391)
( 1, 456, 456, 392)
( 1, 457, 457, 393)
( 1, 458, 458, 394)
( 1, 459, 459, 395)
( 1, 460, 460, 396)
( 1, 461, 461, 397)
( 1, 462, 462, 398)
( 1, 463, 463, 399)
( 1, 464, 464, 400)
( 1, 465, 465, 401)
( 1, 466, 466, 402)
( 1, 467, 467, 403)
( 1, 468, 468, 404)
( 1, 469, 469, 405)
( 1, 470, 470, 406)
( 1, 471, 471, 407)
( 1, 472, 472, 408)
( 1, 473, 473, 409)
( 1, 474, 474, 410)
( 1, 475, 475, 411)
( 1, 476, 476, 412)
( 1, 477, 477, 413)
( 1, 478, 478, 414)
( 1, 479, 479, 415)
( 1, 480, 480, 416)
( 1, 481, 481, 417)
( 1, 482, 482, 418)
( 1, 483, 483, 419)
( 1, 484, 484, 420)
( 1, 485, 485, 421)
( 1, 486, 486, 422)
( 1, 487, 487, 423)
( 1, 488, 488, 424)
( 1, 489, 489, 425)
( 1, 490, 490, 426)
( 1, 491, 491, 427)
( 1, 492, 492, 428)
( 1, 493, 493, 429)
( 1, 494, 494, 430)
( 1, 495, 495, 431)
( 1, 496, 496, 432)
( 1, 497, 497, 433)
( 1, 498, 498, 434)
( 1, 499, 499, 435)
( 1, 500, 500, 436)
( 1, 501, 501, 437)
( 1, 502, 502, 438)
( 1, 503, 503, 439)
( 1, 504, 504, 440)
( 1, 505, 505, 441)
( 1, 506, 506, 442)
( 1, 507, 507, 443)
( 1, 508, 508, 444)
( 1, 509, 509, 445)
( 1, 510, 510, 446)
( 1, 511, 511, 447)

( 1, 512, 512, 448)
( 1, 513, 513, 449)
( 1, 514, 514, 450)
( 1, 515, 515, 451)
( 1, 516, 516, 452)
( 1, 517, 517, 453)
( 1, 518, 518, 454)
( 1, 519, 519, 455)
( 1, 520, 520, 456)
( 1, 521, 521, 457)
( 1, 522, 522, 458)
( 1, 523, 523, 459)
( 1, 524, 524, 460)
( 1, 525, 525, 461)
( 1, 526, 526, 462)
( 1, 527, 527, 463)
( 1, 528, 528, 464)
( 1, 529, 529, 465)
( 1, 530, 530, 466)
( 1, 531, 531, 467)
( 1, 532, 532, 468)
( 1, 533, 533, 469)
( 1, 534, 534, 470)
( 1, 535, 535, 471)
( 1, 536, 536, 472)
( 1, 537, 537, 473)
( 1, 538, 538, 474)
( 1, 539, 539, 475)
( 1, 540, 540, 476)
( 1, 541, 541, 477)
( 1, 542, 542, 478)
( 1, 543, 543, 479)
( 1, 544, 544, 480)
( 1, 545, 545, 481)
( 1, 546, 546, 482)
( 1, 547, 547, 483)
( 1, 548, 548, 484)
( 1, 549, 549, 485)
( 1, 550, 550, 486)
( 1, 551, 551, 487)
( 1, 552, 552, 488)
( 1, 553, 553, 489)
( 1, 554, 554, 490)
( 1, 555, 555, 491)
( 1, 556, 556, 492)
( 1, 557, 557, 493)
( 1, 558, 558, 494)
( 1, 559, 559, 495)
( 1, 560, 560, 496)
( 1, 561, 561, 497)
( 1, 562, 562, 498)
( 1, 563, 563, 499)
( 1, 564, 564, 500)
( 1, 565, 565, 501)
( 1, 566, 566, 502)
( 1, 567, 567, 503)
( 1, 568, 568, 504)
( 1, 569, 569, 505)
( 1, 570, 570, 506)
( 1, 571, 571, 507)
( 1, 572, 572, 508)
( 1, 573, 573, 509)
( 1, 574, 574, 510)
( 1, 575, 575, 511)

( 1, 576, 576, 512)
( 1, 577, 577, 513)
( 1, 578, 578, 514)
( 1, 579, 579, 515)
( 1, 580, 580, 516)
( 1, 581, 581, 517)
( 1, 582, 582, 518)
( 1, 583, 583, 519)
( 1, 584, 584, 520)
( 1, 585, 585, 521)
( 1, 586, 586, 522)
( 1, 587, 587, 523)
( 1, 588, 588, 524)
( 1, 589, 589, 525)
( 1, 590, 590, 526)
( 1, 591, 591, 527)
( 1, 592, 592, 528)
( 1, 593, 593, 529)
( 1, 594, 594, 530)
( 1, 595, 595, 531)
( 1, 596, 596, 532)
( 1, 597, 597, 533)
( 1, 598, 598, 534)
( 1, 599, 599, 535)
( 1, 600, 600, 536)
( 1, 601, 601, 537)
( 1, 602, 602, 538)
( 1, 603, 603, 539)
( 1, 604, 604, 540)
( 1, 605, 605, 541)
( 1, 606, 606, 542)
( 1, 607, 607, 543)
( 1, 608, 608, 544)
( 1, 609, 609, 545)
( 1, 610, 610, 546)
( 1, 611, 611, 547)
( 1, 612, 612, 548)
( 1, 613, 613, 549)
( 1, 614, 614, 550)
( 1, 615, 615, 551)
( 1, 616, 616, 552)
( 1, 617, 617, 553)
( 1, 618, 618, 554)
( 1, 619, 619, 555)
( 1, 620, 620, 556)
( 1, 621, 621, 557)
( 1, 622, 622, 558)
( 1, 623, 623, 559)
( 1, 624, 624, 560)
( 1, 625, 625, 561)
( 1, 626, 626, 562)
( 1, 627, 627, 563)
( 1, 628, 628, 564)
( 1, 629, 629, 565)
( 1, 630, 630, 566)
( 1, 631, 631, 567)
( 1, 632, 632, 568)
( 1, 633, 633, 569)
( 1, 634, 634, 570)
( 1, 635, 635, 571)
( 1, 636, 636, 572)
( 1, 637, 637, 573)
( 1, 638, 638, 574)
( 1, 639, 639, 575)

( 1,   0, 640, 576)
( 1,   1, 641, 577)
( 1,   2, 642, 578)
( 1,   3, 643, 579)
( 1,   4, 644, 580)
( 1,   5, 645, 581)
( 1,   6, 646, 582)
( 1,   7, 647, 583)
( 1,   8, 648, 584)
( 1,   9, 649, 585)
( 1,  10, 650, 586)
( 1,  11, 651, 587)
( 1,  12, 652, 588)
( 1,  13, 653, 589)
( 1,  14, 654, 590)
( 1,  15, 655, 591)
( 1,  16, 656, 592)
( 1,  17,  17, 593)
( 1,  18,  18, 594)
( 1,  19,  19, 595)
( 1,  20,  20, 596)
( 1,  21,  21, 597)
( 1,  22,  22, 598)
( 1,  23,  23, 599)
( 1,  24,  24, 600)
( 1,  25,  25, 601)
( 1,  26,  26, 602)
( 1,  27,  27, 603)
( 1,  28,  28, 604)
( 1,  29,  29, 605)
( 1,  30,  30, 606)
( 1,  31,  31, 607)
( 1,  32,  32, 608)
( 1,  33,  33, 609)
( 1,  34,  34, 610)
( 1,  35,  35, 611)
( 1,  36,  36, 612)
( 1,  37,  37, 613)
( 1,  38,  38, 614)
( 1,  39,  39, 615)
( 1,  40,  40, 616)
( 1,  41,  41, 617)
( 1,  42,  42, 618)
( 1,  43,  43, 619)
( 1,  44,  44, 620)
( 1,  45,  45, 621)
( 1,  46,  46, 622)
( 1,  47,  47, 623)
( 1,  48,  48, 624)
( 1,  49,  49, 625)
( 1,  50,  50, 626)
( 1,  51,  51, 627)
( 1,  52,  52, 628)
( 1,  53,  53, 629)
( 1,  54,  54, 630)
( 1,  55,  55, 631)
( 1,  56,  56, 632)
( 1,  57,  57, 633)
( 1,  58,  58, 634)
( 1,  59,  59, 635)
( 1,  60,  60, 636)
( 1,  61,  61, 637)
( 1,  62,  62, 638)
( 1,  63,  63, 639)

The y, z, w values make sense according to my expectations, but the x value is not updated as I think it ought to be.  The highlighted results were produced by threads whose ids indicate that they were part of the second wavefront, yet they did not correctly increment the value (they did, however, correctly set their ids in the shared register).

Can anyone shed some light on this?  Are my expectations incorrect?  What am I missing?

Do I at least deserve an honourable mention for the longest post ever?

hello

0 Likes
8 Replies

lpw,
You should read the thread 'calculating the bottleneck(threadid=115872)' and pay attention mainly to the discussion of the even & odd wavefronts. This explains your behavior that you are seeing. If you have further questions after reading, please post them and I'll try to answer. A quick summation is that the first 640 threads run on even wavefronts and the next 640 threads run on odd wavefronts.
0 Likes

lpw,
Just went through our documentation. One very important piece of information is left out that will fix your problems. Access to shared registers is only atomic if done in a single instruction.

i.e.
iadd sr0, sr0, sr1 is correct
but
mov r0, sr0
mov r1, sr1
iadd r2, r0, r1
mov sr0, r2 is incorrect because of the even/odd wavefront issue.
0 Likes

<never mind>

0 Likes

Hi Micah,

Thanks for the quick reply.  Using the information you provided, I managed to get it working.  My interpetation (perhaps naive) of the forum discussions and documentation was that I should use two shared registers to perform the reduction, a different one for odd and even wavefronts.  Here's what I came up with:


The init kernel:

il_cs_2_0
dcl_num_thread_per_group 64
dcl_shared_temp sr2
dcl_literal l0, 0x0, 0x0, 0x0, 0x0
mov sr0, l0   ; used for odd wavefronts
mov sr1, l0   ; used for even wavefronts
end


The update kernel:

il_cs_2_0
dcl_num_thread_per_group 64
dcl_shared_temp sr2
dcl_cb cb0[1]
dcl_literal l0, 0x1, 0x1, 0x1, 0x1
dcl_literal l1, 0x280, 0x1, 0x0, 0x0
ult r17, vaTid.x, cb0[0].x
if_logicalnz r17.x
    udiv r0.x, vaTid.x, l1.x  ; divide by 640 (numSIMDs * wavefrontSize)
    iand r0.x, r0.x, l1.y     ; check if odd
    if_logicalnz r0.x
        iadd sr0, sr0, l0 ; accumulate odd wavefronts
    else
        iadd sr1, sr1, l0 ; accumulate even wavefronts
    endif
endif
end

The fetch kernel:

il_cs_2_0
dcl_num_thread_per_group 64
dcl_shared_temp sr2
iadd r0, sr0, sr1      ; add even + odd SRs
mov g[vaTid.x], r0
end


That works, but is this the most efficient way of doing this?  I'm concerned about the udiv instruction and the even/odd branch.

Thanks again for your help.

0 Likes

Well, you definitely do not want that division or the flow control in your code.
You can get rid of the outer if statement by making your num_thread_per_group == cb0[0].x but this makes it hardcoded.
You can get rid of the division/inner if by just writing to sr0 instead of sr0 and sr1.
The fetch kernel can just have mov g[vaTid.x], sr0
And then you can add a fourth pass which does the copy from GPU memory to PCIe memory and the copy kernel can also do a reduction by a factor of two to combine the even & odd wavefronts, similiar to.
il_cs_2_0
dcl_num_thread_per_group 64
dcl_shared_temp sr2
iadd g[vaTid.x], g[vaTid.x], g[vaTid.x + 640]
end

Another way to solve this is to make your num_thread_per_group equal to 128, which would then make the even and odd wavefronts be consecutive instead of strided by 10.
0 Likes

Originally posted by: MicahVillmow Well, you definitely do not want that division or the flow control in your code. You can get rid of the division/inner if by just writing to sr0 instead of sr0 and sr1. The fetch kernel can just have mov g[vaTid.x], sr0
Maybe my brain is fried, but that's what I had in the original kernels (the versions at the top of my original post) but the results were incorrect.

0 Likes

Well,
The problem with your kernels up top is you weren't doing any combination of results from the even and odd wavefronts which adding a fourth path that does a reduction and copy from local to pcie memory should solve. Of course this fourth path is not the most efficient as just having your third kernel write directly to PCIe space would be more efficient.
0 Likes

Dear lpw,

Could you please post host code how you managed to run calCtxRunProgramGridArray for starting up one cs kernel after another? I have have confused by limitations known...

0 Likes