6 Replies Latest reply on Feb 5, 2011 1:34 PM by sgratton

    schemes for reading/writing from/to UAVs

    sgratton
      Any recommendations for optimal performance?

       

      Hi there,

       

      I'm experimenting with the SKA analyzer to try and get good read/write performance at the individual shader level on newer hardware.  Any advice would be much appreciated.

       

      Presumably a "raw" UAV is the best thing to use, declared like dcl_raw_uav_id(0) ?

       

      And one should read/write float4's from/to it, using something like uav_raw_load_id(0) r1,lr0.x / uav_raw_store_id(0) mem,r0.x,r1 ?

      On the older hardware, one was always advised to read/write 4 float4's at once (this allowed e.g. burst writes to occur).  But I couldn't seem to get anything similar to make much difference on the newer targets.  For example, the simple program:

       

      il_cs_2_0

      dcl_raw_uav_id(0)

      dcl_literal l1,0,16,32,48
      dcl_literal l2,64,80,96,112

      uav_raw_load_id(0) r1,l1.x
      uav_raw_load_id(0) r2,l1.y
      uav_raw_load_id(0) r3,l1.z
      uav_raw_load_id(0) r4,l1.w

      uav_raw_store_id(0) mem,l2.x,r1
      uav_raw_store_id(0) mem,l2.y,r2
      uav_raw_store_id(0) mem,l2.z,r3
      uav_raw_store_id(0) mem,l2.w,r4

      ret_dyn
      end

      compiles to:

       

      00 ALU: ADDR(32) CNT(6)
            0  x: MOV         R0.x,  (0x0000000C, 1.681558157e-44f).x     
               y: MOV         R0.y,  0.0f     
               z: MOV         R0.z,  (0x00000008, 1.121038771e-44f).y     
               w: MOV         R0.w,  (0x00000004, 5.605193857e-45f).z     
      01 TEX: ADDR(48) CNT(4)
            1  VFETCH R1, R0.y, fc154  FORMAT(32_32_32_32_FLOAT) MEGA(1)
               FETCH_TYPE(NO_INDEX_OFFSET)
            2  VFETCH R2, R0.w, fc154  FORMAT(32_32_32_32_FLOAT) MEGA(1)
               FETCH_TYPE(NO_INDEX_OFFSET)
            3  VFETCH R3, R0.z, fc154  FORMAT(32_32_32_32_FLOAT) MEGA(1)
               FETCH_TYPE(NO_INDEX_OFFSET)
            4  VFETCH R0, R0.x, fc154  FORMAT(32_32_32_32_FLOAT) MEGA(1)
               FETCH_TYPE(NO_INDEX_OFFSET)
      02 ALU: ADDR(38) CNT(2)
            5  x: MOV         R4.x,  (0x00000010, 2.242077543e-44f).x     
      03 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(0)[R4], R1, ARRAY_SIZE(4)  MARK  VPM
      04 ALU: ADDR(40) CNT(2)
            6  x: MOV         R1.x,  (0x00000014, 2.802596929e-44f).x     
      05 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(0)[R1], R2, ARRAY_SIZE(4)  MARK  VPM
      06 ALU: ADDR(42) CNT(2)
            7  x: MOV         R1.x,  (0x00000018, 3.363116314e-44f).x     
      07 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(0)[R1], R3, ARRAY_SIZE(4)  MARK  VPM
      08 ALU: ADDR(44) CNT(2)
            8  x: MOV         R1.x,  (0x0000001C, 3.923635700e-44f).x     
      09 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(0)[R1], R0, ARRAY_SIZE(4)  MARK  VPM
      10 END
      END_OF_PROGRAM

      for a HD6970 (using cat 10.12).  (Incidentally, what is the  fc154 for in the vfetch instruction?  Is it an issue with the disassembler?  And what does a vfetch instruction actually correspond to the ISA reference guide?  I couldn't quite find a match.)

       

      It seems that the new hardware (5870 and 6970) access data as float4's but no other optimizations are either available or used by the compiler.  Indeed, shader length seems to have gone up; the above program needs 11 control-flow instructions and 6 clauses on the 6970, 8 and 4 on the 5870 but only 5 and 1 on the 4870.  Might this change with time?

       

      Best wishes,

      Steven.

       

       

        • schemes for reading/writing from/to UAVs
          MicahVillmow
          sgratton,
          This is a compiler performance issue, I'll report it to he correct team. Also, I would not compare clauses between different architectures as they have hardware changes that affect how code is generated.
          • schemes for reading/writing from/to UAVs
            MicahVillmow
            sgratton,
            I believe that is what we recommend, but I would check in the programming guide for the exact recommendation. Whatever we recommend for OpenCL should also apply to IL.
            • schemes for reading/writing from/to UAVs
              MicahVillmow
              sgratton,
              We have some changes that we are considering making, but do you have a benchmark that we can use to verify that our changes will improve performance?
                • schemes for reading/writing from/to UAVs
                  sgratton

                   

                  Hi Micah,

                   

                  I'm afraid I don't have my application (Cholesky factorization) up and running with UAV's yet.  However, the two nice ways of implementing it on GPUs basically revolve around either:

                  1/ Given a big square matrix (4000x4000 say) M in a UAV, subtract the outer product of a vector from it, i.e. A'_ij=A_ij-v_i*v_j

                   

                  2/ Given a big square-ish matrix M in a UAV and two vectors, form y'_i=y_i-M_ij x_j

                   

                  The vectors will generalise to "tall and thin" matrices to reduce the number of reads, but that is the idea. 

                   

                  So if you have  a decentish matrix multiply (s/dgemm) kernel handy that uses UAV's, then applying it with

                  1/ m=4000, n=4000, k=16

                  2/ m=4000, n=16, k=4000

                  with tranposes appropriately set, should be a good test.

                   

                  Note that case 1 is almost as simple as reading a big square portion of a UAV in, then writing it out again in-place.

                   

                  If I get my code working in the next few days I'd be happy to supply it.  

                   

                  Best wishes,

                  Steven.

                   

                    • schemes for reading/writing from/to UAVs
                      sgratton

                       

                      Hi Micah,

                       

                      So here is a suitable IL program to test against.  I've only been able to compile it in the shader analyzer so haven't been able to fully test it (see my recent post about issues with calclcompile --- any ideas?), but I think it should work properly.

                      The comments at the top should indicate how to call it in a test case.  Hope it is of some use.

                       

                      Best,

                      Steven.

                       

                      il_cs_2_0 ; (c) 2011 Steven Gratton ; This is a program to compute C'=C-A B^T ; where C is mxn, A is 4xn, B is 4xm. ; m and n are assumed to be a multiple of 32. ; Matrices are in a tiled format, ; C is doubly tiled, split into 8x8 blocks, each 4x4 ; A and B are split into 4x4 blocks ; C is in uav0, A in res 1, B in res 2. ; Each thread processes one 4x4 block. dcl_num_thread_per_group 8,8,1 dcl_raw_uav_id(0) dcl_resource_id(1)_type(1d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(2)_type(1d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_literal l0,6,0,0,0 ishl r40.x,vAbsTidFlat.x,l0.x dcl_literal l1,0,16,32,48 iadd r40,r40.x,l1 dcl_literal l2,4,0,0,0 ishl r41.xy,VAbsTid.xy,l2.x mov r42.x,r41.y uav_raw_load_id(0) r0,r40.x uav_raw_load_id(0) r1,r40.y uav_raw_load_id(0) r2,r40.z uav_raw_load_id(0) r3,r40.w load_id(1) r10,r41.x load_id(1)_aoffimmi(1.0,0.0,0.0) r11,r41.x load_id(1)_aoffimmi(2.0,0.0,0.0) r12,r41.x load_id(1)_aoffimmi(3.0,0.0,0.0) r13,r41.x load_id(2) r20,r42.x load_id(2)_aoffimmi(1.0,0.0,0.0) r21,r42.x load_id(2)_aoffimmi(2.0,0.0,0.0) r22,r42.x load_id(2)_aoffimmi(3.0,0.0,0.0) r23,r42.x mad_ieee r0,r20.x,r10_neg(xyzw),r0 mad_ieee r0,r21.x,r11_neg(xyzw),r0 mad_ieee r0,r22.x,r12_neg(xyzw),r0 mad_ieee r0,r23.x,r13_neg(xyzw),r0 mad_ieee r1,r20.y,r10_neg(xyzw),r1 mad_ieee r1,r21.y,r11_neg(xyzw),r1 mad_ieee r1,r22.y,r12_neg(xyzw),r1 mad_ieee r1,r23.y,r13_neg(xyzw),r1 mad_ieee r2,r20.z,r10_neg(xyzw),r2 mad_ieee r2,r21.z,r11_neg(xyzw),r2 mad_ieee r2,r22.z,r12_neg(xyzw),r2 mad_ieee r2,r23.z,r13_neg(xyzw),r2 mad_ieee r3,r20.w,r10_neg(xyzw),r3 mad_ieee r3,r21.w,r11_neg(xyzw),r3 mad_ieee r3,r22.w,r12_neg(xyzw),r3 mad_ieee r3,r23.w,r13_neg(xyzw),r3 uav_raw_store_id(0) mem,r40.x,r0 uav_raw_store_id(0) mem,r40.y,r1 uav_raw_store_id(0) mem,r40.z,r2 uav_raw_store_id(0) mem,r40.w,r3 ret_dyn end