Hello!
I've been trying to write GCN ISA assembly code by hand and I just can't get the "DS_" instructions to work.
The docs said that the address shouldn't be the same in all threads, because it causes conflicts.
I tried to load the global id into the address register so there is no conflict, but it didn't work.
I also tried to initialize m0, as the doc says.
Here is what I tried:
; Initially v2 contains the global id
v_mov_b32 v6, v2
v_mul_i32_i24 v6, 4, v6
v_mov_b32 v7, 99
v_mov_b32 b8, 0
s_mov_b32 m0, 0xFFFFFFFF
; ds_write's operands: (vdst) (addr) (data0) (data1)
; v5 is just a placeholder, it shouldn't be used I think
v_mov_b32 v5, 0
ds_write_b32 v5, v6, v7, v7
ds_read_b32 v8, v6, v5, v5
I tried many variations of the above code, but in the end v8 always remains 0.
Does anybody know what I'm doing wrong?
Thanks in advance!
Hi,
Same ds address in all threads -> It's a bank conflict. It's not bad but slow. There are 32 banks mapped to the lowest bits of the dw offset.
After a ds_write you have to s_wait expcnt to ensure that the used registers are free again. If ds_write can't do it immediately then it still holds the data in the regs.
After a ds_read -> Use s_wait lgkmcnt ! It will wait until you have the requested data in the dst register.
I'm not sure with the params, maybe you should disassemble the binary and ensure if the params are ok not. (ds_r/w needs only 2 params)
Hello!
Thanks for your reply! After I posted I realized I need to wait after the read, but I didn't know I also have to wait after write.
The params are different because in the assembler I use I have to pass all operands that the DS format takes (vdst, addr, data0, data1). It's not that flexible yet
Unfortunately it still doesn't work. Here is the changed assembly:
; Initially v2 contains the global id
v_mov_b32 v6, v2
v_mul_i32_i24 v6, 32, v6
v_add_i32 v6, vcc, 1024, v6
v_mov_b32 v7, 99
v_mov_b32 v4, 1
s_mov_b32 m0, 0xFFFFFFFF
; ds_write's operands: (vdst) (addr) (data0) (data1)
; v5 is just a placeholder, it shouldn't be used I think
v_mov_b32 v5, 0
ds_write_b32 v5, v6, v7, v5
s_waitcnt expcnt(0)
ds_read_b32 v4, v6, v5, v5
s_waitcnt lgkmcnt(0)
v_mov_b32 v0, v4
I attached the full ISA file too. Thanks for the help!
Regards,
Daniel
Kick your code around until you see this parameter order in the disassembled isa:
//puts 'value' into LDS, then reads it back into 'result'. 'addr' contains get_local_id*4
v_lshlrev_b32 addr, 2, lid
s_mov_b32 m0, $FFFF
ds_write_b32 addr, value
ds_read_b32 result, addr offset:4 //read from a different location
s_waitcnt lgkmcnt(0)
This works well.
If the 'value' vector contains (0,1,2,3,4,5,....)
Then the corresponding 'result' vector will be (1,2,3,4,5,6,...)
(At every 63rd lane there will be garbage.)
And don't forget to declare LDS size!
Thanks!
I think the last issue I have is not setting the LDS size. Here is the assembly now:
; User code starts here
; v2 == global_id
; Set value
#define value v3
v_mov_b32 value, 99
; Set address
#define addr v4
#define lid v5
v_mov_b32 lid, v2
v_lshlrev_b32 addr, 2, lid
; Set m0
s_mov_b32 m0, 0xFFFF
; LDS write/read
#define NULL v6
#define result v7
v_mov_b32 v6, 0
ds_write_b32 NULL, addr, value, NULL
ds_read_b32 result, addr, NULL, NULL offset0:4
s_waitcnt lgkmcnt(0)
v_mov_b32 v0, result
About the LDS size: is it 0 by default? I can't find which byte corresponds to it in the ATI CAL comment section of the ELF. I tried looking in your code too, but I couldn't understand this part:
//set prog3 notes:
with AOptions do begin
SetCalNote($80001041,numvgprs);
SetCalNote($80001042,numsgprs);
{ SetCalNote($8000001C,NumThreadPerGroup.x);
SetCalNote($8000001D,NumThreadPerGroup.y);
SetCalNote($8000001E,NumThreadPerGroup.z); not needed because of __attribute__((reqd_work_group_size}
SetCalNote($80000082,ldsSizeBytes);
//compute_pgm_rsrc2
SetCalNote($00002e13,(ldsSizeBytes+255)shr 8 shl 15,$FFF07FFF{and mask}); //lds size {256byte granularity}
SetCalNote($00002e13,1 shl 7,$FFFFFF7F); //tgid_x_en=1
end;
Could you tell me where the LDS size is declared? I found the VGPR and SGPR numbers, but the LDS is harder to spot just by looking at values.
Thanks again for your help, I really appreciate it. Also, I'm really sad that there seems to be no documentation about this.
(There was this thread: http://devgurus.amd.com/thread/166955, which contains a PDF, but ctrl+f-ing for "LDS" doesn't yield any results)
Regards,
Daniel
Here are some more info: http://www.multi2sim.org/svn/multi2sim/trunk/src/arch/southern-islands/asm/bin-file.h
SetCalNote($80001041,numvgprs);... These things are altering the Note Section. Its format is described in the ELF Format pdf.
SetCalNote($00002e13,(ldsSizeBytes+255)shr 8 shl 15,$FFF07FFF{and mask}); <- this updates the ldsSize field in the COMPUTE_PGM_RSRC2 structure.
LDS size, numRegs are always defined, there aren't any default values for those.
For testing you can use the GDS option with the ds_ insturctions because the the full gds is always 'allocated'.
You don't need any waits for LDS writes if your workgroup size is <= wavefront size. If your workgroup size is > wavefront size, then you need to wait for the LDS op to complete and you also need a barrier to make sure no wavefronts in the workgroup move forward until all wavefronts in the workgroup have completed their LDS writes. But, yes, you need to specify the LDS size otherwise your operations will be clamped/discarded.
Also, you don't need to wait on exports for LDS writes, that only applies to global memory. Just use "lgkmcnt(desired_cnt)".
Thank you for clarifying!
I did a test that proved that the ds_write (LDS) instruction picks out the address and data values immediately from the regs, and it's no problem if I overwrite them.
But with gds_write it is a must that you don't touch the registers until expcnt.
v_lshlrev_b32 addr, 2, gid
s_mov_b32 m0, $FFFF
ds_write_b32 addr, value gds
s_waitcnt expcnt(0) //<--------- if this is not here
v_mov_b32 value,1234 //then this will alter the gds_write
ds_read_b32 result, addr gds
s_waitcnt lgkmcnt(0)
uavWrite(1,gid,result)
How come that they are (lds and gds) working differently? I thought they are almost the same hardware elements.
I did a test that proved that the ds_write (LDS) instruction picks out the address and data values immediately from the regs, and it's no problem if I overwrite them.
But with gds_write it is a must that you don't touch the registers until expcnt.
This is a good point, especially the address registers because it was not obvious (to me) that the address is being 'exported'.
But it is.
How come that they are (lds and gds) working differently? I thought they are almost the same hardware elements.
I think anything that goes out of the compute unit must go through the export unit, which does not immediately read the instruction's data and address registers.