cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

balidani
Adept I

DS_WRITE GCN instruction

Hello!

I've been trying to write GCN ISA assembly code by hand and I just can't get the "DS_" instructions to work.

The docs said that the address shouldn't be the same in all threads, because it causes conflicts.

I tried to load the global id into the address register so there is no conflict, but it didn't work.

I also tried to initialize m0, as the doc says.

Here is what I tried:

; Initially v2 contains the global id

v_mov_b32       v6, v2

v_mul_i32_i24   v6, 4, v6

v_mov_b32       v7, 99

v_mov_b32       b8, 0

s_mov_b32       m0, 0xFFFFFFFF


; ds_write's operands: (vdst) (addr) (data0) (data1)

; v5 is just a placeholder, it shouldn't be used I think

v_mov_b32       v5, 0

ds_write_b32    v5, v6, v7, v7

ds_read_b32     v8, v6, v5, v5

I tried many variations of the above code, but in the end v8 always remains 0.

Does anybody know what I'm doing wrong?

Thanks in advance!

0 Likes
8 Replies
realhet
Miniboss

Hi,

Same ds address in all threads -> It's a bank conflict. It's not bad but slow. There are 32 banks mapped to the lowest bits of the dw offset.

After a ds_write you have to s_wait expcnt to ensure that the used registers are free again. If ds_write can't do it immediately then it still holds the data in the regs.

After a ds_read -> Use s_wait lgkmcnt ! It will wait until you have the requested data in the dst register.

I'm not sure with the params, maybe you should disassemble the binary and ensure if the params are ok not. (ds_r/w needs only 2 params)

0 Likes

Hello!

Thanks for your reply! After I posted I realized I need to wait after the read, but I didn't know I also have to wait after write.

The params are different because in the assembler I use I have to pass all operands that the DS format takes (vdst, addr, data0, data1). It's not that flexible yet

Unfortunately it still doesn't work. Here is the changed assembly:

; Initially v2 contains the global id

v_mov_b32      v6, v2

v_mul_i32_i24  v6, 32, v6

v_add_i32      v6, vcc, 1024, v6

v_mov_b32      v7, 99

v_mov_b32      v4, 1

s_mov_b32      m0, 0xFFFFFFFF

; ds_write's operands: (vdst) (addr) (data0) (data1)

; v5 is just a placeholder, it shouldn't be used I think

v_mov_b32      v5, 0

ds_write_b32    v5, v6, v7, v5

s_waitcnt      expcnt(0)

ds_read_b32    v4, v6, v5, v5

s_waitcnt      lgkmcnt(0)

v_mov_b32      v0, v4

I attached the full ISA file too. Thanks for the help!

Regards,

Daniel

0 Likes

Kick your code around until you see this parameter order in the disassembled isa:

//puts 'value' into LDS, then reads it back into 'result'. 'addr' contains get_local_id*4

  v_lshlrev_b32 addr, 2, lid 

  s_mov_b32 m0, $FFFF

  ds_write_b32  addr, value

  ds_read_b32   result, addr offset:4 //read from a different location

  s_waitcnt     lgkmcnt(0) 

 

This works well.

If the 'value' vector contains (0,1,2,3,4,5,....)

Then the corresponding 'result' vector will be (1,2,3,4,5,6,...)

(At every 63rd lane there will be garbage.)

And don't forget to declare LDS size!

0 Likes

Thanks!

I think the last issue I have is not setting the LDS size. Here is the assembly now:

; User code starts here

; v2 == global_id

; Set value

#define value v3

v_mov_b32 value, 99

; Set address

#define addr v4

#define lid v5

v_mov_b32 lid, v2

v_lshlrev_b32 addr, 2, lid

; Set m0

s_mov_b32 m0, 0xFFFF

; LDS write/read

#define NULL v6

#define result v7

v_mov_b32 v6, 0

ds_write_b32 NULL, addr, value, NULL

ds_read_b32 result, addr, NULL, NULL offset0:4

s_waitcnt lgkmcnt(0)

v_mov_b32       v0, result

About the LDS size: is it 0 by default? I can't find which byte corresponds to it in the ATI CAL comment section of the ELF. I tried looking in your code too, but I couldn't understand this part:

//set prog3 notes:

  with AOptions do begin

    SetCalNote($80001041,numvgprs);

    SetCalNote($80001042,numsgprs);

{    SetCalNote($8000001C,NumThreadPerGroup.x);

    SetCalNote($8000001D,NumThreadPerGroup.y);

    SetCalNote($8000001E,NumThreadPerGroup.z);  not needed because of __attribute__((reqd_work_group_size}

    SetCalNote($80000082,ldsSizeBytes);

    //compute_pgm_rsrc2

    SetCalNote($00002e13,(ldsSizeBytes+255)shr 8 shl 15,$FFF07FFF{and mask}); //lds size {256byte granularity}

    SetCalNote($00002e13,1 shl 7,$FFFFFF7F);   //tgid_x_en=1

  end;

Could you tell me where the LDS size is declared? I found the VGPR and SGPR numbers, but the LDS is harder to spot just by looking at values.

Thanks again for your help, I really appreciate it. Also, I'm really sad that there seems to be no documentation about this.

(There was this thread: http://devgurus.amd.com/thread/166955, which contains a PDF, but ctrl+f-ing for "LDS" doesn't yield any results)

Regards,

Daniel

0 Likes

Here are some more info: http://www.multi2sim.org/svn/multi2sim/trunk/src/arch/southern-islands/asm/bin-file.h

SetCalNote($80001041,numvgprs);...  These things are altering the Note Section. Its format is described in the ELF Format pdf.


SetCalNote($00002e13,(ldsSizeBytes+255)shr 8 shl 15,$FFF07FFF{and mask}); <- this updates the ldsSize field in the COMPUTE_PGM_RSRC2 structure.


LDS size, numRegs are always defined, there aren't any default values for those.

For testing you can use the GDS option with the ds_ insturctions because the the full gds is always 'allocated'.

0 Likes

You don't need any waits for LDS writes if your workgroup size is <= wavefront size.  If your workgroup size is > wavefront size, then you need to wait for the LDS op to complete and you also need a barrier to make sure no wavefronts in the workgroup move forward until all wavefronts in the workgroup have completed their LDS writes.  But, yes, you need to specify the LDS size otherwise your operations will be clamped/discarded.

Also, you don't need to wait on exports for LDS writes, that only applies to global memory.  Just use "lgkmcnt(desired_cnt)".

0 Likes

Thank you for clarifying!

I did a test that proved that the ds_write (LDS) instruction picks out the address and data values immediately from the regs, and it's no problem if I overwrite them.

But with gds_write it is a must that you don't touch the registers until expcnt.

  v_lshlrev_b32 addr, 2, gid

  s_mov_b32 m0, $FFFF

  ds_write_b32  addr, value gds

  s_waitcnt expcnt(0)                     //<--------- if this is not here

  v_mov_b32 value,1234                 //then this will alter the gds_write

  ds_read_b32   result, addr gds 

  s_waitcnt     lgkmcnt(0)

  uavWrite(1,gid,result)

How come that they are (lds and gds) working differently? I thought they are almost the same hardware elements.

0 Likes


I did a test that proved that the ds_write (LDS) instruction picks out the address and data values immediately from the regs, and it's no problem if I overwrite them.


But with gds_write it is a must that you don't touch the registers until expcnt.



This is a good point, especially the address registers because it was not obvious (to me) that the address is being 'exported'.

But it is.


How come that they are (lds and gds) working differently? I thought they are almost the same hardware elements.



I think anything that goes out of the compute unit must go through the export unit, which does not immediately read the instruction's data and address registers.

0 Likes