Hi there,
I'm experimenting with the SKA analyzer to try and get good read/write performance at the individual shader level on newer hardware. Any advice would be much appreciated.
Presumably a "raw" UAV is the best thing to use, declared like dcl_raw_uav_id(0) ?
And one should read/write float4's from/to it, using something like uav_raw_load_id(0) r1,lr0.x / uav_raw_store_id(0) mem,r0.x,r1 ?
On the older hardware, one was always advised to read/write 4 float4's at once (this allowed e.g. burst writes to occur). But I couldn't seem to get anything similar to make much difference on the newer targets. For example, the simple program:
il_cs_2_0
dcl_raw_uav_id(0)
dcl_literal l1,0,16,32,48
dcl_literal l2,64,80,96,112
uav_raw_load_id(0) r1,l1.x
uav_raw_load_id(0) r2,l1.y
uav_raw_load_id(0) r3,l1.z
uav_raw_load_id(0) r4,l1.w
uav_raw_store_id(0) mem,l2.x,r1
uav_raw_store_id(0) mem,l2.y,r2
uav_raw_store_id(0) mem,l2.z,r3
uav_raw_store_id(0) mem,l2.w,r4
ret_dyn
end
compiles to:
00 ALU: ADDR(32) CNT(6)
0 x: MOV R0.x, (0x0000000C, 1.681558157e-44f).x
y: MOV R0.y, 0.0f
z: MOV R0.z, (0x00000008, 1.121038771e-44f).y
w: MOV R0.w, (0x00000004, 5.605193857e-45f).z
01 TEX: ADDR(48) CNT(4)
1 VFETCH R1, R0.y, fc154 FORMAT(32_32_32_32_FLOAT) MEGA(1)
FETCH_TYPE(NO_INDEX_OFFSET)
2 VFETCH R2, R0.w, fc154 FORMAT(32_32_32_32_FLOAT) MEGA(1)
FETCH_TYPE(NO_INDEX_OFFSET)
3 VFETCH R3, R0.z, fc154 FORMAT(32_32_32_32_FLOAT) MEGA(1)
FETCH_TYPE(NO_INDEX_OFFSET)
4 VFETCH R0, R0.x, fc154 FORMAT(32_32_32_32_FLOAT) MEGA(1)
FETCH_TYPE(NO_INDEX_OFFSET)
02 ALU: ADDR(38) CNT(2)
5 x: MOV R4.x, (0x00000010, 2.242077543e-44f).x
03 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(0)[R4], R1, ARRAY_SIZE(4) MARK VPM
04 ALU: ADDR(40) CNT(2)
6 x: MOV R1.x, (0x00000014, 2.802596929e-44f).x
05 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(0)[R1], R2, ARRAY_SIZE(4) MARK VPM
06 ALU: ADDR(42) CNT(2)
7 x: MOV R1.x, (0x00000018, 3.363116314e-44f).x
07 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(0)[R1], R3, ARRAY_SIZE(4) MARK VPM
08 ALU: ADDR(44) CNT(2)
8 x: MOV R1.x, (0x0000001C, 3.923635700e-44f).x
09 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(0)[R1], R0, ARRAY_SIZE(4) MARK VPM
10 END
END_OF_PROGRAM
for a HD6970 (using cat 10.12). (Incidentally, what is the fc154 for in the vfetch instruction? Is it an issue with the disassembler? And what does a vfetch instruction actually correspond to the ISA reference guide? I couldn't quite find a match.)
It seems that the new hardware (5870 and 6970) access data as float4's but no other optimizations are either available or used by the compiler. Indeed, shader length seems to have gone up; the above program needs 11 control-flow instructions and 6 clauses on the 6970, 8 and 4 on the 5870 but only 5 and 1 on the 4870. Might this change with time?
Best wishes,
Steven.
Hi Micah,
I see, thanks. So you do recommend still writing IL that accesses four float4's at a time, in that eventually it will be quicker?
Best,
Steven.
Hi Micah,
I'm afraid I don't have my application (Cholesky factorization) up and running with UAV's yet. However, the two nice ways of implementing it on GPUs basically revolve around either:
1/ Given a big square matrix (4000x4000 say) M in a UAV, subtract the outer product of a vector from it, i.e. A'_ij=A_ij-v_i*v_j
2/ Given a big square-ish matrix M in a UAV and two vectors, form y'_i=y_i-M_ij x_j
The vectors will generalise to "tall and thin" matrices to reduce the number of reads, but that is the idea.
So if you have a decentish matrix multiply (s/dgemm) kernel handy that uses UAV's, then applying it with
1/ m=4000, n=4000, k=16
2/ m=4000, n=16, k=4000
with tranposes appropriately set, should be a good test.
Note that case 1 is almost as simple as reading a big square portion of a UAV in, then writing it out again in-place.
If I get my code working in the next few days I'd be happy to supply it.
Best wishes,
Steven.
Hi Micah,
So here is a suitable IL program to test against. I've only been able to compile it in the shader analyzer so haven't been able to fully test it (see my recent post about issues with calclcompile --- any ideas?), but I think it should work properly.
The comments at the top should indicate how to call it in a test case. Hope it is of some use.
Best,
Steven.
il_cs_2_0 ; (c) 2011 Steven Gratton ; This is a program to compute C'=C-A B^T ; where C is mxn, A is 4xn, B is 4xm. ; m and n are assumed to be a multiple of 32. ; Matrices are in a tiled format, ; C is doubly tiled, split into 8x8 blocks, each 4x4 ; A and B are split into 4x4 blocks ; C is in uav0, A in res 1, B in res 2. ; Each thread processes one 4x4 block. dcl_num_thread_per_group 8,8,1 dcl_raw_uav_id(0) dcl_resource_id(1)_type(1d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(2)_type(1d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_literal l0,6,0,0,0 ishl r40.x,vAbsTidFlat.x,l0.x dcl_literal l1,0,16,32,48 iadd r40,r40.x,l1 dcl_literal l2,4,0,0,0 ishl r41.xy,VAbsTid.xy,l2.x mov r42.x,r41.y uav_raw_load_id(0) r0,r40.x uav_raw_load_id(0) r1,r40.y uav_raw_load_id(0) r2,r40.z uav_raw_load_id(0) r3,r40.w load_id(1) r10,r41.x load_id(1)_aoffimmi(1.0,0.0,0.0) r11,r41.x load_id(1)_aoffimmi(2.0,0.0,0.0) r12,r41.x load_id(1)_aoffimmi(3.0,0.0,0.0) r13,r41.x load_id(2) r20,r42.x load_id(2)_aoffimmi(1.0,0.0,0.0) r21,r42.x load_id(2)_aoffimmi(2.0,0.0,0.0) r22,r42.x load_id(2)_aoffimmi(3.0,0.0,0.0) r23,r42.x mad_ieee r0,r20.x,r10_neg(xyzw),r0 mad_ieee r0,r21.x,r11_neg(xyzw),r0 mad_ieee r0,r22.x,r12_neg(xyzw),r0 mad_ieee r0,r23.x,r13_neg(xyzw),r0 mad_ieee r1,r20.y,r10_neg(xyzw),r1 mad_ieee r1,r21.y,r11_neg(xyzw),r1 mad_ieee r1,r22.y,r12_neg(xyzw),r1 mad_ieee r1,r23.y,r13_neg(xyzw),r1 mad_ieee r2,r20.z,r10_neg(xyzw),r2 mad_ieee r2,r21.z,r11_neg(xyzw),r2 mad_ieee r2,r22.z,r12_neg(xyzw),r2 mad_ieee r2,r23.z,r13_neg(xyzw),r2 mad_ieee r3,r20.w,r10_neg(xyzw),r3 mad_ieee r3,r21.w,r11_neg(xyzw),r3 mad_ieee r3,r22.w,r12_neg(xyzw),r3 mad_ieee r3,r23.w,r13_neg(xyzw),r3 uav_raw_store_id(0) mem,r40.x,r0 uav_raw_store_id(0) mem,r40.y,r1 uav_raw_store_id(0) mem,r40.z,r2 uav_raw_store_id(0) mem,r40.w,r3 ret_dyn end