sgratton

schemes for reading/writing from/to UAVs

Discussion created by sgratton on Jan 15, 2011
Latest reply on Feb 5, 2011 by sgratton
Any recommendations for optimal performance?

 

Hi there,

 

I'm experimenting with the SKA analyzer to try and get good read/write performance at the individual shader level on newer hardware.  Any advice would be much appreciated.

 

Presumably a "raw" UAV is the best thing to use, declared like dcl_raw_uav_id(0) ?

 

And one should read/write float4's from/to it, using something like uav_raw_load_id(0) r1,lr0.x / uav_raw_store_id(0) mem,r0.x,r1 ?

On the older hardware, one was always advised to read/write 4 float4's at once (this allowed e.g. burst writes to occur).  But I couldn't seem to get anything similar to make much difference on the newer targets.  For example, the simple program:

 

il_cs_2_0

dcl_raw_uav_id(0)

dcl_literal l1,0,16,32,48
dcl_literal l2,64,80,96,112

uav_raw_load_id(0) r1,l1.x
uav_raw_load_id(0) r2,l1.y
uav_raw_load_id(0) r3,l1.z
uav_raw_load_id(0) r4,l1.w

uav_raw_store_id(0) mem,l2.x,r1
uav_raw_store_id(0) mem,l2.y,r2
uav_raw_store_id(0) mem,l2.z,r3
uav_raw_store_id(0) mem,l2.w,r4

ret_dyn
end

compiles to:

 

00 ALU: ADDR(32) CNT(6)
      0  x: MOV         R0.x,  (0x0000000C, 1.681558157e-44f).x     
         y: MOV         R0.y,  0.0f     
         z: MOV         R0.z,  (0x00000008, 1.121038771e-44f).y     
         w: MOV         R0.w,  (0x00000004, 5.605193857e-45f).z     
01 TEX: ADDR(48) CNT(4)
      1  VFETCH R1, R0.y, fc154  FORMAT(32_32_32_32_FLOAT) MEGA(1)
         FETCH_TYPE(NO_INDEX_OFFSET)
      2  VFETCH R2, R0.w, fc154  FORMAT(32_32_32_32_FLOAT) MEGA(1)
         FETCH_TYPE(NO_INDEX_OFFSET)
      3  VFETCH R3, R0.z, fc154  FORMAT(32_32_32_32_FLOAT) MEGA(1)
         FETCH_TYPE(NO_INDEX_OFFSET)
      4  VFETCH R0, R0.x, fc154  FORMAT(32_32_32_32_FLOAT) MEGA(1)
         FETCH_TYPE(NO_INDEX_OFFSET)
02 ALU: ADDR(38) CNT(2)
      5  x: MOV         R4.x,  (0x00000010, 2.242077543e-44f).x     
03 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(0)[R4], R1, ARRAY_SIZE(4)  MARK  VPM
04 ALU: ADDR(40) CNT(2)
      6  x: MOV         R1.x,  (0x00000014, 2.802596929e-44f).x     
05 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(0)[R1], R2, ARRAY_SIZE(4)  MARK  VPM
06 ALU: ADDR(42) CNT(2)
      7  x: MOV         R1.x,  (0x00000018, 3.363116314e-44f).x     
07 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(0)[R1], R3, ARRAY_SIZE(4)  MARK  VPM
08 ALU: ADDR(44) CNT(2)
      8  x: MOV         R1.x,  (0x0000001C, 3.923635700e-44f).x     
09 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(0)[R1], R0, ARRAY_SIZE(4)  MARK  VPM
10 END
END_OF_PROGRAM

for a HD6970 (using cat 10.12).  (Incidentally, what is the  fc154 for in the vfetch instruction?  Is it an issue with the disassembler?  And what does a vfetch instruction actually correspond to the ISA reference guide?  I couldn't quite find a match.)

 

It seems that the new hardware (5870 and 6970) access data as float4's but no other optimizations are either available or used by the compiler.  Indeed, shader length seems to have gone up; the above program needs 11 control-flow instructions and 6 clauses on the 6970, 8 and 4 on the 5870 but only 5 and 1 on the 4870.  Might this change with time?

 

Best wishes,

Steven.

 

 

Outcomes