I'm trying to get the thread index from a cs kernel and put it into that index's address in the global buffer as follows:
Any ideas how to do this properly?
il_cs_2_0 dcl_num_thread_per_group 64 dcl_cb cb0[1] dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(2)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) mov g[vaTid.x], vAbsTidFlat.x ret_dyn end
I'm doing something similar but I move the thread ID to a register first and use that register to index into the global buffer. I don't remember if it was necessary to move Tid to a register or I did it only because I actually have to multiply the Tid to get the proper global buffer offset.
Well, that's not working for me either. The following code only produces zero. I'm pretty stumped.
const char * HILKernel = "il_cs_2_0\n" "dcl_num_thread_per_group 64\n" "mov r7.z, vaTid.x\n" "mov g[vaTid.x], r7.zzzz\n" "ret_dyn\n" "end\n"; Also, this doesn't work: const char * HILKernel = "il_cs_2_0\n" "dcl_num_thread_per_group 64\n" "mov r7.z, vaTid.x\n" "mov g[r7.z], r7.zzzz\n" "ret_dyn\n" "end\n";
Originally posted by: lipi I'm doing something similar but I move the thread ID to a register first and use that register to index into the global buffer. I don't remember if it was necessary to move Tid to a register or I did it only because I actually have to multiply the Tid to get the proper global buffer offset.
Can you post your kernel or at least the important parts... I have no idea why I can't simply get the thread ID and put it into the global buffer.
So this seems to work in case anyone is interested. I'd be interested hear from AMD why this is the case.
const char * HILKernel = "il_cs_2_0\n" "dcl_num_thread_per_group 64\n" "itof r7.z, vAbsTidFlat.x\n" "mov g[vAbsTidFlat.x], r7.zzzz\n" "ret_dyn\n" "end\n";
I'm using vTid currently but I remember using vaTid at one point and it worked. The relevant parts of my code are in the attached code.
Have you checked the disassembly to see if something got optimized away? I had to add fences to keep global memory access within the loops.
il_cs_2_0 dcl_num_thread_per_blk 64 ;;; r1 -- per-thread constants ;;; ;;; x: thread ID ;;; y: 2 x thread ID mov r1.x, vTid0.x ishl r1.y, r1.x, l0.x whileloop whileloop fence_memory ; keep RD_SCATTER in the loop mov r2, g[r1.y] ; RD_SCATTER ;; code omitted break_logicalz r2.w endloop ;; code omitted mov g[r1.y].y, r2.y iadd r4.x, r1.y, l0.x mov g[r4.x], r3 fence_memory endloop end
lipi,
Thanks for your replies, I pretty much got it working (at least for 64x1 block size). I think I just forgot that the output needed to be a float and that the addressing needed to be ints.