4 Replies Latest reply on Oct 25, 2011 4:42 PM by corry

    Bursting...What am I doing wrong?

    corry

      See the code...As I said before, I'm writing something doing block processing.  Performance when I limited the amount of data to 12 bytes/thread was about what I expected.  I scaled up in size to 60 bytes, not even half my block size, and performance is abysmal!  So I go to the handy dandy disassembly, and to my shock and horror, despite reading consecutive addresses, I see no bursting...see attached code.  For reference, l8.x=16, since as far as I could tell, uav addresses are byte addresses, so +16 should make 1 full GPR....and indeed, when I print my data, I do see my sequential data in order as I expect...r25.x is where I'm storing the address, l120.x happens to correspond to 60, the buffer size I'm reading in.

      Everything I can see says this should burst.  I'm reading sequential values into sequential registers, so what gives?  I believe this is with Catalyst 11.10 preview 2

       

      iadd r32.x, r25.x, l8.x iadd r32.y, r32.x, l8.x iadd r32.z, r32.y, l8.x iadd r32.w, r32.z, l8.x iadd r33.x, r32.w, l8.x iadd r33.y, r33.x, l8.x iadd r33.z, r33.y, l8.x iadd r33.w, r33.z, l8.x iadd r34.x, r33.w, l8.x iadd r34.y, r34.x, l8.x iadd r34.z, r34.y, l8.x iadd r34.w, r34.z, l8.x iadd r35.x, r34.w, l8.x iadd r35.y, r35.x, l8.x uav_raw_load_id(8) r6, r25.x uav_raw_load_id(8) r7, r32.x uav_raw_load_id(8) r8, r32.y uav_raw_load_id(8) r9, r32.z uav_raw_load_id(8) r10, r32.w uav_raw_load_id(8) r11, r33.x uav_raw_load_id(8) r12, r33.y uav_raw_load_id(8) r13, r33.z uav_raw_load_id(8) r14, r33.w uav_raw_load_id(8) r15, r34.x uav_raw_load_id(8) r16, r34.y uav_raw_load_id(8) r17, r34.z uav_raw_load_id(8) r18, r34.w uav_raw_load_id(8) r19, r35.x uav_raw_load_id(8) r20, r35.y iadd r25.x, r25.x, l120.x /////END IL.... 132 TEX: ADDR(22084) CNT(15) 299 VFETCH R9, R0.y, fc170 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 300 VFETCH R38, R0.x, fc170 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 301 VFETCH R10, R0.z, fc170 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 302 VFETCH R11, R0.w, fc170 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 303 VFETCH R12, R1.y, fc170 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 304 VFETCH R13, R1.x, fc170 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 305 VFETCH R14, R1.z, fc170 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 306 VFETCH R15, R1.w, fc170 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 307 VFETCH R16, R2.y, fc170 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 308 VFETCH R17, R2.x, fc170 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 309 VFETCH R18, R2.z, fc170 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 310 VFETCH R19, R2.w, fc170 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 311 VFETCH R23, R3.y, fc170 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 312 VFETCH R21, R3.x, fc170 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 313 VFETCH R20, R3.z, fc170 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET)

        • Bursting...What am I doing wrong?
          corry

          Odd, when I switch UAV8 and 11, I see about 12% performance increase.  It's throw away data, so I'm not caching, which is why I didn't just use 11 anyways...looks like the *only* thing that changed in the ISA code is the mysterious fc value.  Attached isa....fc changed to fc176 from 170...12%....still no bursting...what gives?

          117 TEX: ADDR(22100) CNT(15) 299 VFETCH R9, R0.y, fc176 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 300 VFETCH R38, R0.x, fc176 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 301 VFETCH R10, R0.z, fc176 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 302 VFETCH R11, R0.w, fc176 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 303 VFETCH R12, R1.y, fc176 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 304 VFETCH R13, R1.x, fc176 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 305 VFETCH R14, R1.z, fc176 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 306 VFETCH R15, R1.w, fc176 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 307 VFETCH R16, R2.y, fc176 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 308 VFETCH R17, R2.x, fc176 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 309 VFETCH R18, R2.z, fc176 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 310 VFETCH R19, R2.w, fc176 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 311 VFETCH R23, R3.y, fc176 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 312 VFETCH R21, R3.x, fc176 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 313 VFETCH R20, R3.z, fc176 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET)

          • Bursting...What am I doing wrong?
            MicahVillmow
            UAV 8 is a typed surface that is aliased between 3 data types(8, 16 and 32bit) and thus must go down a conversion path in the hardware. UAV 11 is a raw surface, where no conversions are required, and aliases 3 data types(32, 32x2 and 32x4).
              • Bursting...What am I doing wrong?
                corry

                 

                Originally posted by: MicahVillmow UAV 8 is a typed surface that is aliased between 3 data types(8, 16 and 32bit) and thus must go down a conversion path in the hardware. UAV 11 is a raw surface, where no conversions are required, and aliases 3 data types(32, 32x2 and 32x4).


                Interesting, I had thought you had said uav11 was 32 bit access, 12 was 16 bit , and 13 was 8 bit previously, and that the same pattern mapped to some of the other UAVs

                I'm still not seeing the MEGA being tacked on even in UAV11, and my data isn't making it into my program for some reason.  Still working on that, probably something simple, but the lack of the MEGA like what most of the online documentation seems to indicate being the bursting, is not being generated.  That's my real problem.  In this case, its only 15 elements, but it says it can burst read 16 elements at a time.  As some cases read much more than 16 elements, I'd like to see it burst read 16 at a time.  How do I get it to do this?  Data is sequential...

                  • Bursting...What am I doing wrong?
                    corry

                    So, can I get a final answer on why I'm not getting the MEGA (15) like I would expect here, and perhaps a final answer on the uav numbering dependencies?  Please?  I'll beg for the MEGA part :)  (though it would probably be interpreted as whining with nothing but text to give context)