cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

sgratton
Adept I

schemes for reading/writing from/to UAVs

Any recommendations for optimal performance?

 

Hi there,

 

I'm experimenting with the SKA analyzer to try and get good read/write performance at the individual shader level on newer hardware.  Any advice would be much appreciated.

 

Presumably a "raw" UAV is the best thing to use, declared like dcl_raw_uav_id(0) ?

 

And one should read/write float4's from/to it, using something like uav_raw_load_id(0) r1,lr0.x / uav_raw_store_id(0) mem,r0.x,r1 ?

On the older hardware, one was always advised to read/write 4 float4's at once (this allowed e.g. burst writes to occur).  But I couldn't seem to get anything similar to make much difference on the newer targets.  For example, the simple program:

 

il_cs_2_0

dcl_raw_uav_id(0)

dcl_literal l1,0,16,32,48
dcl_literal l2,64,80,96,112

uav_raw_load_id(0) r1,l1.x
uav_raw_load_id(0) r2,l1.y
uav_raw_load_id(0) r3,l1.z
uav_raw_load_id(0) r4,l1.w

uav_raw_store_id(0) mem,l2.x,r1
uav_raw_store_id(0) mem,l2.y,r2
uav_raw_store_id(0) mem,l2.z,r3
uav_raw_store_id(0) mem,l2.w,r4

ret_dyn
end

compiles to:

 

00 ALU: ADDR(32) CNT(6)
      0  x: MOV         R0.x,  (0x0000000C, 1.681558157e-44f).x     
         y: MOV         R0.y,  0.0f     
         z: MOV         R0.z,  (0x00000008, 1.121038771e-44f).y     
         w: MOV         R0.w,  (0x00000004, 5.605193857e-45f).z     
01 TEX: ADDR(48) CNT(4)
      1  VFETCH R1, R0.y, fc154  FORMAT(32_32_32_32_FLOAT) MEGA(1)
         FETCH_TYPE(NO_INDEX_OFFSET)
      2  VFETCH R2, R0.w, fc154  FORMAT(32_32_32_32_FLOAT) MEGA(1)
         FETCH_TYPE(NO_INDEX_OFFSET)
      3  VFETCH R3, R0.z, fc154  FORMAT(32_32_32_32_FLOAT) MEGA(1)
         FETCH_TYPE(NO_INDEX_OFFSET)
      4  VFETCH R0, R0.x, fc154  FORMAT(32_32_32_32_FLOAT) MEGA(1)
         FETCH_TYPE(NO_INDEX_OFFSET)
02 ALU: ADDR(38) CNT(2)
      5  x: MOV         R4.x,  (0x00000010, 2.242077543e-44f).x     
03 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(0)[R4], R1, ARRAY_SIZE(4)  MARK  VPM
04 ALU: ADDR(40) CNT(2)
      6  x: MOV         R1.x,  (0x00000014, 2.802596929e-44f).x     
05 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(0)[R1], R2, ARRAY_SIZE(4)  MARK  VPM
06 ALU: ADDR(42) CNT(2)
      7  x: MOV         R1.x,  (0x00000018, 3.363116314e-44f).x     
07 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(0)[R1], R3, ARRAY_SIZE(4)  MARK  VPM
08 ALU: ADDR(44) CNT(2)
      8  x: MOV         R1.x,  (0x0000001C, 3.923635700e-44f).x     
09 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(0)[R1], R0, ARRAY_SIZE(4)  MARK  VPM
10 END
END_OF_PROGRAM

for a HD6970 (using cat 10.12).  (Incidentally, what is the  fc154 for in the vfetch instruction?  Is it an issue with the disassembler?  And what does a vfetch instruction actually correspond to the ISA reference guide?  I couldn't quite find a match.)

 

It seems that the new hardware (5870 and 6970) access data as float4's but no other optimizations are either available or used by the compiler.  Indeed, shader length seems to have gone up; the above program needs 11 control-flow instructions and 6 clauses on the 6970, 8 and 4 on the 5870 but only 5 and 1 on the 4870.  Might this change with time?

 

Best wishes,

Steven.

 

 

0 Likes
6 Replies

sgratton,
This is a compiler performance issue, I'll report it to he correct team. Also, I would not compare clauses between different architectures as they have hardware changes that affect how code is generated.
0 Likes

 

Hi Micah,

 

I see, thanks.  So you do recommend still writing IL that accesses four float4's at a time,  in that eventually it will be quicker?

 

Best,

Steven.

 

0 Likes

sgratton,
I believe that is what we recommend, but I would check in the programming guide for the exact recommendation. Whatever we recommend for OpenCL should also apply to IL.
0 Likes

sgratton,
We have some changes that we are considering making, but do you have a benchmark that we can use to verify that our changes will improve performance?
0 Likes

 

Hi Micah,

 

I'm afraid I don't have my application (Cholesky factorization) up and running with UAV's yet.  However, the two nice ways of implementing it on GPUs basically revolve around either:

1/ Given a big square matrix (4000x4000 say) M in a UAV, subtract the outer product of a vector from it, i.e. A'_ij=A_ij-v_i*v_j

 

2/ Given a big square-ish matrix M in a UAV and two vectors, form y'_i=y_i-M_ij x_j

 

The vectors will generalise to "tall and thin" matrices to reduce the number of reads, but that is the idea. 

 

So if you have  a decentish matrix multiply (s/dgemm) kernel handy that uses UAV's, then applying it with

1/ m=4000, n=4000, k=16

2/ m=4000, n=16, k=4000

with tranposes appropriately set, should be a good test.

 

Note that case 1 is almost as simple as reading a big square portion of a UAV in, then writing it out again in-place.

 

If I get my code working in the next few days I'd be happy to supply it.  

 

Best wishes,

Steven.

 

0 Likes

 

Hi Micah,

 

So here is a suitable IL program to test against.  I've only been able to compile it in the shader analyzer so haven't been able to fully test it (see my recent post about issues with calclcompile --- any ideas?), but I think it should work properly.

The comments at the top should indicate how to call it in a test case.  Hope it is of some use.

 

Best,

Steven.

 

il_cs_2_0 ; (c) 2011 Steven Gratton ; This is a program to compute C'=C-A B^T ; where C is mxn, A is 4xn, B is 4xm. ; m and n are assumed to be a multiple of 32. ; Matrices are in a tiled format, ; C is doubly tiled, split into 8x8 blocks, each 4x4 ; A and B are split into 4x4 blocks ; C is in uav0, A in res 1, B in res 2. ; Each thread processes one 4x4 block. dcl_num_thread_per_group 8,8,1 dcl_raw_uav_id(0) dcl_resource_id(1)_type(1d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(2)_type(1d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_literal l0,6,0,0,0 ishl r40.x,vAbsTidFlat.x,l0.x dcl_literal l1,0,16,32,48 iadd r40,r40.x,l1 dcl_literal l2,4,0,0,0 ishl r41.xy,VAbsTid.xy,l2.x mov r42.x,r41.y uav_raw_load_id(0) r0,r40.x uav_raw_load_id(0) r1,r40.y uav_raw_load_id(0) r2,r40.z uav_raw_load_id(0) r3,r40.w load_id(1) r10,r41.x load_id(1)_aoffimmi(1.0,0.0,0.0) r11,r41.x load_id(1)_aoffimmi(2.0,0.0,0.0) r12,r41.x load_id(1)_aoffimmi(3.0,0.0,0.0) r13,r41.x load_id(2) r20,r42.x load_id(2)_aoffimmi(1.0,0.0,0.0) r21,r42.x load_id(2)_aoffimmi(2.0,0.0,0.0) r22,r42.x load_id(2)_aoffimmi(3.0,0.0,0.0) r23,r42.x mad_ieee r0,r20.x,r10_neg(xyzw),r0 mad_ieee r0,r21.x,r11_neg(xyzw),r0 mad_ieee r0,r22.x,r12_neg(xyzw),r0 mad_ieee r0,r23.x,r13_neg(xyzw),r0 mad_ieee r1,r20.y,r10_neg(xyzw),r1 mad_ieee r1,r21.y,r11_neg(xyzw),r1 mad_ieee r1,r22.y,r12_neg(xyzw),r1 mad_ieee r1,r23.y,r13_neg(xyzw),r1 mad_ieee r2,r20.z,r10_neg(xyzw),r2 mad_ieee r2,r21.z,r11_neg(xyzw),r2 mad_ieee r2,r22.z,r12_neg(xyzw),r2 mad_ieee r2,r23.z,r13_neg(xyzw),r2 mad_ieee r3,r20.w,r10_neg(xyzw),r3 mad_ieee r3,r21.w,r11_neg(xyzw),r3 mad_ieee r3,r22.w,r12_neg(xyzw),r3 mad_ieee r3,r23.w,r13_neg(xyzw),r3 uav_raw_store_id(0) mem,r40.x,r0 uav_raw_store_id(0) mem,r40.y,r1 uav_raw_store_id(0) mem,r40.z,r2 uav_raw_store_id(0) mem,r40.w,r3 ret_dyn end

0 Likes