I'm working on a CAL program that performs a reduction using globally shared registers. After reading the docs and the forum, I decided to start by implementing the following three pass algorithm:
1 (init). Run one wavefront per SIMD to initialize the shared registers to 0.
2 (update). Run a bunch of threads that increment the value in a shared register.
3 (fetch). Run one wavefront per SIMD to dump the shared registers to a global buffer.
I'm using calCtxRunProgramGridArray to run the three kernels. Each kernel uses 1 shared register, no LDS, and 64 threads per group. The card is a 4870X2 (kernels run on device 0).
The init kernel looks like this:
il_cs_2_0
dcl_num_thread_per_group 64
dcl_shared_temp sr1
dcl_literal l0, 0x0, 0x0, 0x0, 0x0
mov sr0, l0
end
The update kernel looks like this:
il_cs_2_0
dcl_num_thread_per_group 64
dcl_shared_temp sr1
dcl_cb cb0[1]
dcl_literal l0, 0x1, 0x1, 0x1, 0x1
; cb0[0].x contains total number of threads in the execution domain
ult r17, vaTid.x, cb0[0].x
if_logicalnz r17.x
iadd sr0, sr0, l0
endif
end
The fetch kernel looks like this:
il_cs_2_0
dcl_num_thread_per_group 64
dcl_shared_temp sr1
mov g[vaTid.x], sr0
end
The init and fetch kernels are executed with 640 threads each (numSIMDs * wavefrontSize). Is this the correct way of launching one wavefront per SIMD? The update kernel can be executed with any number of threads.
The fetch kernel dumps the SRs to a global buffer (640 quad words). I would expect that, if I added the x components of the 640 quad words, they should add up to the number of threads in the update kernel. But this is not the case. It appears that not all SRs are correctly incremented.
Following is a set of outputs generated by printing the fetched global buffer. I'm printing only the x components of each quad word (all four components are the same).
If the update kernel is executed with 17 threads, the following values are produced (correct):
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
If the update kernel is executed with 640 threads, the following values are produced (correct):
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
In fact, running between 0 and 640 threads seems to always work. There are always k ones in the output, where k is the number of threads in the update kernel. This situation corresponds to at most one wavefront per SIMD.
However, things get wonky when k is greater than 640 (more than one wavefront per SIMD).
For instance, running between 640 and 1280 threads in the update kernel produces the above output of all ones (incorrect, since I would expect that some SRs should be incremented to 2). Running more than 1280 threads, the registers again appear to get incremented, but some increments were lost. Here's the output for 1297 threads:
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
I would expect to see something like this instead (each register incremented at least twice, with 17 of them incremented thrice):
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
I dug a little deeper, and have generated results that seem to suggest that the shared registers are written correctly but are not read correctly by subsequent wavefronts.
The shared register is now a quadruple which contains the following items:
x: the value (as above)
y: absolute thread id of the thread that initialized it (set by init kernel)
z: absolute thread id of the thread that updated it (set by update kernel)
w: absolute thread id of the thread that fetched it (set by fetch kernel)
The new init kernel:
il_cs_2_0
dcl_num_thread_per_group 64
dcl_shared_temp sr1
dcl_literal l0, 0x0, 0x0, 0xffffffff, 0xffffffff
mov sr0.x, l0.x
mov sr0.y, vaTid.x
mov sr0.z, l0.z
mov sr0.w, l0.w
end
The new update kernel (liberally sprinkled with fence_sr's):
il_cs_2_0
dcl_num_thread_per_group 64
dcl_shared_temp sr1
dcl_cb cb0[1]
dcl_literal l0, 0x1, 0x1, 0x1, 0x1
; cb0[0].x contains total number of threads in the execution domain
ult r17, vaTid.x, cb0[0].x
if_logicalnz r17.x
fence_sr
mov r0, sr0
iadd r0.x, r0.x, l0.x
mov r0.z, vaTid.x
fence_sr
mov sr0, r0
endif
end
The new fetch kernel:
il_cs_2_0
dcl_num_thread_per_group 64
dcl_shared_temp sr1
mov r0, sr0
mov r0.w, vaTid.x
mov g[vaTid.x], r0
end
I ran the new kernels with 657 threads (640 + 17), expecting to see 623 ones and 17 twos. All I see, however are ones. These results are a little lengthy, but the interesting parts are highlighted in bold font near the bottom.
( 1, 64, 64, 0)
( 1, 65, 65, 1)
( 1, 66, 66, 2)
( 1, 67, 67, 3)
( 1, 68, 68, 4)
( 1, 69, 69, 5)
( 1, 70, 70, 6)
( 1, 71, 71, 7)
( 1, 72, 72, 😎
( 1, 73, 73, 9)
( 1, 74, 74, 10)
( 1, 75, 75, 11)
( 1, 76, 76, 12)
( 1, 77, 77, 13)
( 1, 78, 78, 14)
( 1, 79, 79, 15)
( 1, 80, 80, 16)
( 1, 81, 81, 17)
( 1, 82, 82, 18)
( 1, 83, 83, 19)
( 1, 84, 84, 20)
( 1, 85, 85, 21)
( 1, 86, 86, 22)
( 1, 87, 87, 23)
( 1, 88, 88, 24)
( 1, 89, 89, 25)
( 1, 90, 90, 26)
( 1, 91, 91, 27)
( 1, 92, 92, 28)
( 1, 93, 93, 29)
( 1, 94, 94, 30)
( 1, 95, 95, 31)
( 1, 96, 96, 32)
( 1, 97, 97, 33)
( 1, 98, 98, 34)
( 1, 99, 99, 35)
( 1, 100, 100, 36)
( 1, 101, 101, 37)
( 1, 102, 102, 38)
( 1, 103, 103, 39)
( 1, 104, 104, 40)
( 1, 105, 105, 41)
( 1, 106, 106, 42)
( 1, 107, 107, 43)
( 1, 108, 108, 44)
( 1, 109, 109, 45)
( 1, 110, 110, 46)
( 1, 111, 111, 47)
( 1, 112, 112, 48)
( 1, 113, 113, 49)
( 1, 114, 114, 50)
( 1, 115, 115, 51)
( 1, 116, 116, 52)
( 1, 117, 117, 53)
( 1, 118, 118, 54)
( 1, 119, 119, 55)
( 1, 120, 120, 56)
( 1, 121, 121, 57)
( 1, 122, 122, 58)
( 1, 123, 123, 59)
( 1, 124, 124, 60)
( 1, 125, 125, 61)
( 1, 126, 126, 62)
( 1, 127, 127, 63)
( 1, 128, 128, 64)
( 1, 129, 129, 65)
( 1, 130, 130, 66)
( 1, 131, 131, 67)
( 1, 132, 132, 68)
( 1, 133, 133, 69)
( 1, 134, 134, 70)
( 1, 135, 135, 71)
( 1, 136, 136, 72)
( 1, 137, 137, 73)
( 1, 138, 138, 74)
( 1, 139, 139, 75)
( 1, 140, 140, 76)
( 1, 141, 141, 77)
( 1, 142, 142, 78)
( 1, 143, 143, 79)
( 1, 144, 144, 80)
( 1, 145, 145, 81)
( 1, 146, 146, 82)
( 1, 147, 147, 83)
( 1, 148, 148, 84)
( 1, 149, 149, 85)
( 1, 150, 150, 86)
( 1, 151, 151, 87)
( 1, 152, 152, 88)
( 1, 153, 153, 89)
( 1, 154, 154, 90)
( 1, 155, 155, 91)
( 1, 156, 156, 92)
( 1, 157, 157, 93)
( 1, 158, 158, 94)
( 1, 159, 159, 95)
( 1, 160, 160, 96)
( 1, 161, 161, 97)
( 1, 162, 162, 98)
( 1, 163, 163, 99)
( 1, 164, 164, 100)
( 1, 165, 165, 101)
( 1, 166, 166, 102)
( 1, 167, 167, 103)
( 1, 168, 168, 104)
( 1, 169, 169, 105)
( 1, 170, 170, 106)
( 1, 171, 171, 107)
( 1, 172, 172, 108)
( 1, 173, 173, 109)
( 1, 174, 174, 110)
( 1, 175, 175, 111)
( 1, 176, 176, 112)
( 1, 177, 177, 113)
( 1, 178, 178, 114)
( 1, 179, 179, 115)
( 1, 180, 180, 116)
( 1, 181, 181, 117)
( 1, 182, 182, 118)
( 1, 183, 183, 119)
( 1, 184, 184, 120)
( 1, 185, 185, 121)
( 1, 186, 186, 122)
( 1, 187, 187, 123)
( 1, 188, 188, 124)
( 1, 189, 189, 125)
( 1, 190, 190, 126)
( 1, 191, 191, 127)
( 1, 192, 192, 128)
( 1, 193, 193, 129)
( 1, 194, 194, 130)
( 1, 195, 195, 131)
( 1, 196, 196, 132)
( 1, 197, 197, 133)
( 1, 198, 198, 134)
( 1, 199, 199, 135)
( 1, 200, 200, 136)
( 1, 201, 201, 137)
( 1, 202, 202, 138)
( 1, 203, 203, 139)
( 1, 204, 204, 140)
( 1, 205, 205, 141)
( 1, 206, 206, 142)
( 1, 207, 207, 143)
( 1, 208, 208, 144)
( 1, 209, 209, 145)
( 1, 210, 210, 146)
( 1, 211, 211, 147)
( 1, 212, 212, 148)
( 1, 213, 213, 149)
( 1, 214, 214, 150)
( 1, 215, 215, 151)
( 1, 216, 216, 152)
( 1, 217, 217, 153)
( 1, 218, 218, 154)
( 1, 219, 219, 155)
( 1, 220, 220, 156)
( 1, 221, 221, 157)
( 1, 222, 222, 158)
( 1, 223, 223, 159)
( 1, 224, 224, 160)
( 1, 225, 225, 161)
( 1, 226, 226, 162)
( 1, 227, 227, 163)
( 1, 228, 228, 164)
( 1, 229, 229, 165)
( 1, 230, 230, 166)
( 1, 231, 231, 167)
( 1, 232, 232, 168)
( 1, 233, 233, 169)
( 1, 234, 234, 170)
( 1, 235, 235, 171)
( 1, 236, 236, 172)
( 1, 237, 237, 173)
( 1, 238, 238, 174)
( 1, 239, 239, 175)
( 1, 240, 240, 176)
( 1, 241, 241, 177)
( 1, 242, 242, 178)
( 1, 243, 243, 179)
( 1, 244, 244, 180)
( 1, 245, 245, 181)
( 1, 246, 246, 182)
( 1, 247, 247, 183)
( 1, 248, 248, 184)
( 1, 249, 249, 185)
( 1, 250, 250, 186)
( 1, 251, 251, 187)
( 1, 252, 252, 188)
( 1, 253, 253, 189)
( 1, 254, 254, 190)
( 1, 255, 255, 191)
( 1, 256, 256, 192)
( 1, 257, 257, 193)
( 1, 258, 258, 194)
( 1, 259, 259, 195)
( 1, 260, 260, 196)
( 1, 261, 261, 197)
( 1, 262, 262, 198)
( 1, 263, 263, 199)
( 1, 264, 264, 200)
( 1, 265, 265, 201)
( 1, 266, 266, 202)
( 1, 267, 267, 203)
( 1, 268, 268, 204)
( 1, 269, 269, 205)
( 1, 270, 270, 206)
( 1, 271, 271, 207)
( 1, 272, 272, 208)
( 1, 273, 273, 209)
( 1, 274, 274, 210)
( 1, 275, 275, 211)
( 1, 276, 276, 212)
( 1, 277, 277, 213)
( 1, 278, 278, 214)
( 1, 279, 279, 215)
( 1, 280, 280, 216)
( 1, 281, 281, 217)
( 1, 282, 282, 218)
( 1, 283, 283, 219)
( 1, 284, 284, 220)
( 1, 285, 285, 221)
( 1, 286, 286, 222)
( 1, 287, 287, 223)
( 1, 288, 288, 224)
( 1, 289, 289, 225)
( 1, 290, 290, 226)
( 1, 291, 291, 227)
( 1, 292, 292, 228)
( 1, 293, 293, 229)
( 1, 294, 294, 230)
( 1, 295, 295, 231)
( 1, 296, 296, 232)
( 1, 297, 297, 233)
( 1, 298, 298, 234)
( 1, 299, 299, 235)
( 1, 300, 300, 236)
( 1, 301, 301, 237)
( 1, 302, 302, 238)
( 1, 303, 303, 239)
( 1, 304, 304, 240)
( 1, 305, 305, 241)
( 1, 306, 306, 242)
( 1, 307, 307, 243)
( 1, 308, 308, 244)
( 1, 309, 309, 245)
( 1, 310, 310, 246)
( 1, 311, 311, 247)
( 1, 312, 312, 248)
( 1, 313, 313, 249)
( 1, 314, 314, 250)
( 1, 315, 315, 251)
( 1, 316, 316, 252)
( 1, 317, 317, 253)
( 1, 318, 318, 254)
( 1, 319, 319, 255)
( 1, 320, 320, 256)
( 1, 321, 321, 257)
( 1, 322, 322, 258)
( 1, 323, 323, 259)
( 1, 324, 324, 260)
( 1, 325, 325, 261)
( 1, 326, 326, 262)
( 1, 327, 327, 263)
( 1, 328, 328, 264)
( 1, 329, 329, 265)
( 1, 330, 330, 266)
( 1, 331, 331, 267)
( 1, 332, 332, 268)
( 1, 333, 333, 269)
( 1, 334, 334, 270)
( 1, 335, 335, 271)
( 1, 336, 336, 272)
( 1, 337, 337, 273)
( 1, 338, 338, 274)
( 1, 339, 339, 275)
( 1, 340, 340, 276)
( 1, 341, 341, 277)
( 1, 342, 342, 278)
( 1, 343, 343, 279)
( 1, 344, 344, 280)
( 1, 345, 345, 281)
( 1, 346, 346, 282)
( 1, 347, 347, 283)
( 1, 348, 348, 284)
( 1, 349, 349, 285)
( 1, 350, 350, 286)
( 1, 351, 351, 287)
( 1, 352, 352, 288)
( 1, 353, 353, 289)
( 1, 354, 354, 290)
( 1, 355, 355, 291)
( 1, 356, 356, 292)
( 1, 357, 357, 293)
( 1, 358, 358, 294)
( 1, 359, 359, 295)
( 1, 360, 360, 296)
( 1, 361, 361, 297)
( 1, 362, 362, 298)
( 1, 363, 363, 299)
( 1, 364, 364, 300)
( 1, 365, 365, 301)
( 1, 366, 366, 302)
( 1, 367, 367, 303)
( 1, 368, 368, 304)
( 1, 369, 369, 305)
( 1, 370, 370, 306)
( 1, 371, 371, 307)
( 1, 372, 372, 308)
( 1, 373, 373, 309)
( 1, 374, 374, 310)
( 1, 375, 375, 311)
( 1, 376, 376, 312)
( 1, 377, 377, 313)
( 1, 378, 378, 314)
( 1, 379, 379, 315)
( 1, 380, 380, 316)
( 1, 381, 381, 317)
( 1, 382, 382, 318)
( 1, 383, 383, 319)
( 1, 384, 384, 320)
( 1, 385, 385, 321)
( 1, 386, 386, 322)
( 1, 387, 387, 323)
( 1, 388, 388, 324)
( 1, 389, 389, 325)
( 1, 390, 390, 326)
( 1, 391, 391, 327)
( 1, 392, 392, 328)
( 1, 393, 393, 329)
( 1, 394, 394, 330)
( 1, 395, 395, 331)
( 1, 396, 396, 332)
( 1, 397, 397, 333)
( 1, 398, 398, 334)
( 1, 399, 399, 335)
( 1, 400, 400, 336)
( 1, 401, 401, 337)
( 1, 402, 402, 338)
( 1, 403, 403, 339)
( 1, 404, 404, 340)
( 1, 405, 405, 341)
( 1, 406, 406, 342)
( 1, 407, 407, 343)
( 1, 408, 408, 344)
( 1, 409, 409, 345)
( 1, 410, 410, 346)
( 1, 411, 411, 347)
( 1, 412, 412, 348)
( 1, 413, 413, 349)
( 1, 414, 414, 350)
( 1, 415, 415, 351)
( 1, 416, 416, 352)
( 1, 417, 417, 353)
( 1, 418, 418, 354)
( 1, 419, 419, 355)
( 1, 420, 420, 356)
( 1, 421, 421, 357)
( 1, 422, 422, 358)
( 1, 423, 423, 359)
( 1, 424, 424, 360)
( 1, 425, 425, 361)
( 1, 426, 426, 362)
( 1, 427, 427, 363)
( 1, 428, 428, 364)
( 1, 429, 429, 365)
( 1, 430, 430, 366)
( 1, 431, 431, 367)
( 1, 432, 432, 368)
( 1, 433, 433, 369)
( 1, 434, 434, 370)
( 1, 435, 435, 371)
( 1, 436, 436, 372)
( 1, 437, 437, 373)
( 1, 438, 438, 374)
( 1, 439, 439, 375)
( 1, 440, 440, 376)
( 1, 441, 441, 377)
( 1, 442, 442, 378)
( 1, 443, 443, 379)
( 1, 444, 444, 380)
( 1, 445, 445, 381)
( 1, 446, 446, 382)
( 1, 447, 447, 383)
( 1, 448, 448, 384)
( 1, 449, 449, 385)
( 1, 450, 450, 386)
( 1, 451, 451, 387)
( 1, 452, 452, 388)
( 1, 453, 453, 389)
( 1, 454, 454, 390)
( 1, 455, 455, 391)
( 1, 456, 456, 392)
( 1, 457, 457, 393)
( 1, 458, 458, 394)
( 1, 459, 459, 395)
( 1, 460, 460, 396)
( 1, 461, 461, 397)
( 1, 462, 462, 398)
( 1, 463, 463, 399)
( 1, 464, 464, 400)
( 1, 465, 465, 401)
( 1, 466, 466, 402)
( 1, 467, 467, 403)
( 1, 468, 468, 404)
( 1, 469, 469, 405)
( 1, 470, 470, 406)
( 1, 471, 471, 407)
( 1, 472, 472, 408)
( 1, 473, 473, 409)
( 1, 474, 474, 410)
( 1, 475, 475, 411)
( 1, 476, 476, 412)
( 1, 477, 477, 413)
( 1, 478, 478, 414)
( 1, 479, 479, 415)
( 1, 480, 480, 416)
( 1, 481, 481, 417)
( 1, 482, 482, 418)
( 1, 483, 483, 419)
( 1, 484, 484, 420)
( 1, 485, 485, 421)
( 1, 486, 486, 422)
( 1, 487, 487, 423)
( 1, 488, 488, 424)
( 1, 489, 489, 425)
( 1, 490, 490, 426)
( 1, 491, 491, 427)
( 1, 492, 492, 428)
( 1, 493, 493, 429)
( 1, 494, 494, 430)
( 1, 495, 495, 431)
( 1, 496, 496, 432)
( 1, 497, 497, 433)
( 1, 498, 498, 434)
( 1, 499, 499, 435)
( 1, 500, 500, 436)
( 1, 501, 501, 437)
( 1, 502, 502, 438)
( 1, 503, 503, 439)
( 1, 504, 504, 440)
( 1, 505, 505, 441)
( 1, 506, 506, 442)
( 1, 507, 507, 443)
( 1, 508, 508, 444)
( 1, 509, 509, 445)
( 1, 510, 510, 446)
( 1, 511, 511, 447)
( 1, 512, 512, 448)
( 1, 513, 513, 449)
( 1, 514, 514, 450)
( 1, 515, 515, 451)
( 1, 516, 516, 452)
( 1, 517, 517, 453)
( 1, 518, 518, 454)
( 1, 519, 519, 455)
( 1, 520, 520, 456)
( 1, 521, 521, 457)
( 1, 522, 522, 458)
( 1, 523, 523, 459)
( 1, 524, 524, 460)
( 1, 525, 525, 461)
( 1, 526, 526, 462)
( 1, 527, 527, 463)
( 1, 528, 528, 464)
( 1, 529, 529, 465)
( 1, 530, 530, 466)
( 1, 531, 531, 467)
( 1, 532, 532, 468)
( 1, 533, 533, 469)
( 1, 534, 534, 470)
( 1, 535, 535, 471)
( 1, 536, 536, 472)
( 1, 537, 537, 473)
( 1, 538, 538, 474)
( 1, 539, 539, 475)
( 1, 540, 540, 476)
( 1, 541, 541, 477)
( 1, 542, 542, 478)
( 1, 543, 543, 479)
( 1, 544, 544, 480)
( 1, 545, 545, 481)
( 1, 546, 546, 482)
( 1, 547, 547, 483)
( 1, 548, 548, 484)
( 1, 549, 549, 485)
( 1, 550, 550, 486)
( 1, 551, 551, 487)
( 1, 552, 552, 488)
( 1, 553, 553, 489)
( 1, 554, 554, 490)
( 1, 555, 555, 491)
( 1, 556, 556, 492)
( 1, 557, 557, 493)
( 1, 558, 558, 494)
( 1, 559, 559, 495)
( 1, 560, 560, 496)
( 1, 561, 561, 497)
( 1, 562, 562, 498)
( 1, 563, 563, 499)
( 1, 564, 564, 500)
( 1, 565, 565, 501)
( 1, 566, 566, 502)
( 1, 567, 567, 503)
( 1, 568, 568, 504)
( 1, 569, 569, 505)
( 1, 570, 570, 506)
( 1, 571, 571, 507)
( 1, 572, 572, 508)
( 1, 573, 573, 509)
( 1, 574, 574, 510)
( 1, 575, 575, 511)
( 1, 576, 576, 512)
( 1, 577, 577, 513)
( 1, 578, 578, 514)
( 1, 579, 579, 515)
( 1, 580, 580, 516)
( 1, 581, 581, 517)
( 1, 582, 582, 518)
( 1, 583, 583, 519)
( 1, 584, 584, 520)
( 1, 585, 585, 521)
( 1, 586, 586, 522)
( 1, 587, 587, 523)
( 1, 588, 588, 524)
( 1, 589, 589, 525)
( 1, 590, 590, 526)
( 1, 591, 591, 527)
( 1, 592, 592, 528)
( 1, 593, 593, 529)
( 1, 594, 594, 530)
( 1, 595, 595, 531)
( 1, 596, 596, 532)
( 1, 597, 597, 533)
( 1, 598, 598, 534)
( 1, 599, 599, 535)
( 1, 600, 600, 536)
( 1, 601, 601, 537)
( 1, 602, 602, 538)
( 1, 603, 603, 539)
( 1, 604, 604, 540)
( 1, 605, 605, 541)
( 1, 606, 606, 542)
( 1, 607, 607, 543)
( 1, 608, 608, 544)
( 1, 609, 609, 545)
( 1, 610, 610, 546)
( 1, 611, 611, 547)
( 1, 612, 612, 548)
( 1, 613, 613, 549)
( 1, 614, 614, 550)
( 1, 615, 615, 551)
( 1, 616, 616, 552)
( 1, 617, 617, 553)
( 1, 618, 618, 554)
( 1, 619, 619, 555)
( 1, 620, 620, 556)
( 1, 621, 621, 557)
( 1, 622, 622, 558)
( 1, 623, 623, 559)
( 1, 624, 624, 560)
( 1, 625, 625, 561)
( 1, 626, 626, 562)
( 1, 627, 627, 563)
( 1, 628, 628, 564)
( 1, 629, 629, 565)
( 1, 630, 630, 566)
( 1, 631, 631, 567)
( 1, 632, 632, 568)
( 1, 633, 633, 569)
( 1, 634, 634, 570)
( 1, 635, 635, 571)
( 1, 636, 636, 572)
( 1, 637, 637, 573)
( 1, 638, 638, 574)
( 1, 639, 639, 575)
( 1, 0, 640, 576)
( 1, 1, 641, 577)
( 1, 2, 642, 578)
( 1, 3, 643, 579)
( 1, 4, 644, 580)
( 1, 5, 645, 581)
( 1, 6, 646, 582)
( 1, 7, 647, 583)
( 1, 8, 648, 584)
( 1, 9, 649, 585)
( 1, 10, 650, 586)
( 1, 11, 651, 587)
( 1, 12, 652, 588)
( 1, 13, 653, 589)
( 1, 14, 654, 590)
( 1, 15, 655, 591)
( 1, 16, 656, 592)
( 1, 17, 17, 593)
( 1, 18, 18, 594)
( 1, 19, 19, 595)
( 1, 20, 20, 596)
( 1, 21, 21, 597)
( 1, 22, 22, 598)
( 1, 23, 23, 599)
( 1, 24, 24, 600)
( 1, 25, 25, 601)
( 1, 26, 26, 602)
( 1, 27, 27, 603)
( 1, 28, 28, 604)
( 1, 29, 29, 605)
( 1, 30, 30, 606)
( 1, 31, 31, 607)
( 1, 32, 32, 608)
( 1, 33, 33, 609)
( 1, 34, 34, 610)
( 1, 35, 35, 611)
( 1, 36, 36, 612)
( 1, 37, 37, 613)
( 1, 38, 38, 614)
( 1, 39, 39, 615)
( 1, 40, 40, 616)
( 1, 41, 41, 617)
( 1, 42, 42, 618)
( 1, 43, 43, 619)
( 1, 44, 44, 620)
( 1, 45, 45, 621)
( 1, 46, 46, 622)
( 1, 47, 47, 623)
( 1, 48, 48, 624)
( 1, 49, 49, 625)
( 1, 50, 50, 626)
( 1, 51, 51, 627)
( 1, 52, 52, 628)
( 1, 53, 53, 629)
( 1, 54, 54, 630)
( 1, 55, 55, 631)
( 1, 56, 56, 632)
( 1, 57, 57, 633)
( 1, 58, 58, 634)
( 1, 59, 59, 635)
( 1, 60, 60, 636)
( 1, 61, 61, 637)
( 1, 62, 62, 638)
( 1, 63, 63, 639)
The y, z, w values make sense according to my expectations, but the x value is not updated as I think it ought to be. The highlighted results were produced by threads whose ids indicate that they were part of the second wavefront, yet they did not correctly increment the value (they did, however, correctly set their ids in the shared register).
Can anyone shed some light on this? Are my expectations incorrect? What am I missing?
Do I at least deserve an honourable mention for the longest post ever?
hello
<never mind>
Hi Micah,
Thanks for the quick reply. Using the information you provided, I managed to get it working. My interpetation (perhaps naive) of the forum discussions and documentation was that I should use two shared registers to perform the reduction, a different one for odd and even wavefronts. Here's what I came up with:
The init kernel:
il_cs_2_0
dcl_num_thread_per_group 64
dcl_shared_temp sr2
dcl_literal l0, 0x0, 0x0, 0x0, 0x0
mov sr0, l0 ; used for odd wavefronts
mov sr1, l0 ; used for even wavefronts
end
The update kernel:
il_cs_2_0
dcl_num_thread_per_group 64
dcl_shared_temp sr2
dcl_cb cb0[1]
dcl_literal l0, 0x1, 0x1, 0x1, 0x1
dcl_literal l1, 0x280, 0x1, 0x0, 0x0
ult r17, vaTid.x, cb0[0].x
if_logicalnz r17.x
udiv r0.x, vaTid.x, l1.x ; divide by 640 (numSIMDs * wavefrontSize)
iand r0.x, r0.x, l1.y ; check if odd
if_logicalnz r0.x
iadd sr0, sr0, l0 ; accumulate odd wavefronts
else
iadd sr1, sr1, l0 ; accumulate even wavefronts
endif
endif
end
The fetch kernel:
il_cs_2_0
dcl_num_thread_per_group 64
dcl_shared_temp sr2
iadd r0, sr0, sr1 ; add even + odd SRs
mov g[vaTid.x], r0
end
That works, but is this the most efficient way of doing this? I'm concerned about the udiv instruction and the even/odd branch.
Thanks again for your help.
Maybe my brain is fried, but that's what I had in the original kernels (the versions at the top of my original post) but the results were incorrect.
Originally posted by: MicahVillmow Well, you definitely do not want that division or the flow control in your code. You can get rid of the division/inner if by just writing to sr0 instead of sr0 and sr1. The fetch kernel can just have mov g[vaTid.x], sr0
Dear lpw,
Could you please post host code how you managed to run calCtxRunProgramGridArray for starting up one cs kernel after another? I have have confused by limitations known...