6 Replies Latest reply on Nov 11, 2009 5:27 PM by MicahVillmow

    Fetch in compute shader

    ryta1203

      Ok, sorry about the "Horrible 5870 performance" but this goes to the same topic...

      ... why is the 64x1 block size performance so horrid?

      Compute shader might be faster but you really need to know how to get perfect texture fetch to make it so.

      Accessing naively (64x1) gives HORRIBLE performance... WAY worse than pixel shader mode. And if LDS isn't any faster... I mean how many applications out there really need LDS?

        • Fetch in compute shader
          ryta1203

          What's more curious to me, is that the 4870 runs twice as fast for accessing that block size in compute shader mode than the 5870.

          Also, for the same number of inputs in pixel shader mode and an increasing ALU:Fetch ratio, the 4870 changes from texture bound to ALU bound at a lower ALU:Fetch ratio than the 5870, though the overall execution is still faster on the 5870.

            • Fetch in compute shader
              MicahVillmow
              Ryta,
              One of the reasons why 64x1 gives such bad performance in compute shader is because of how the caches are setup on the 7XX series of cards. The cache's are optimized to work in a tiled mode but not in linear mode which compute shader is. This should be improved with the 8XX series of cards but I have not had time to verify it myself. In order to get optimal cache re-use from the texture in CS mode on 7XX series of cards, you need to reblock your thread ID's. a 16x4, 8x8, or 4x16 should give you good enough blocking to get similiar cache performance as your PS kernel. This is because a cacheline can be thought of as a 4x2 block of data coming in at once. So in pixel shader, 64 threads are blocked in a 8x8 block which uses exactly 8 cache lines. In compute shader, your 64x1 block pattern uses 16 cache lines, but only uses half the data in each cache line. So this is why the performance is worse in CS mode.

              As for LDS, the 7XX series of LDS is very specialized and does not work for many algorithms, it also has very similiar performance characteristics when utilized correctly as the L1 cache but the user gains more control over it. The LDS_Transpose utilizes the 7XX series of LDS very efficiently.

              The 8XX series however is a different story as the LDS is generic and also has the ability to perform many times faster than the 7XX series LDS.

              For the performance differences between 8XX and 7XX cards, if you have specific benchmarks that can show the performance difference, please let us know so we can analyze them and fix the performance issue.

              Hope this helps.
                • Fetch in compute shader
                  ryta1203

                  Micah,

                    In compute shader mode using a 64x1 block size the 5870 performs twice as bad as the 4870 for a very simple benchmark. It would be great if you could verify this. The kernel code for all my benchmarks stays the same when I run it across the cards.

                    I do have several benchmarks trying to test different parameters/aspects of the newer generation cards; however, at the moment I'm only using 64x1 block size. I will be trying an 8x8 block size shortly I think, if I have time, I'm on a deadline.

                    • Fetch in compute shader
                      ryta1203

                      Micah,

                        I just have two more questions:

                      1. It doesn't seem that streaming store and global write are any different? Is that true?

                      2. I get better performance on the 5870 using streaming store and global read than I do using streaming store and texture fetch in pixel shader mode for float4 data types. Does that sound right? What I mean is that it appears for pixel shader mode texture fetch stays the bottleneck for even very high ALU:Fetch ratios, while this is not true if I use Global Read.

                        • Fetch in compute shader
                          ryta1203

                           

                          Originally posted by: ryta1203 Micah,

                            I just have two more questions:

                          1. It doesn't seem that streaming store and global write are any different? Is that true?

                          2. I get better performance on the 5870 using streaming store and global read than I do using streaming store and texture fetch in pixel shader mode for float4 data types. Does that sound right? What I mean is that it appears for pixel shader mode texture fetch stays the bottleneck for even very high ALU:Fetch ratios, while this is not true if I use Global Read.

                          1. Would still like an answer.

                          2. What I mean is, for float4 on 5870 (greater is better)

                          4x16 > 64x1 block size in compute shader mode

                          global read in pixel shader mode > texture fetch in pixel shader mode

                          4x16 in compute shader mode ~= global read in compute shade mode

                          Do these comparisons look accurate? It still seems that pixel shader mode is faster, regardless, even though ATI claims that theoretically it shouldn't be.

                          However, for the 4870, I found that 4x16 texture fetch was much better than global read. On the 4870, I found that global read was the same for float, float4, pixel and compute (any combo thereof); however, on the 5870, it seems pixel was a little faster than compute. I have a lot of results, and so a lot of questions, but I don't have the time to share them all to see if the results are accurate or not.

                            • Fetch in compute shader
                              MicahVillmow
                              Ryta,
                              There are known performance issues that we are working on for the 5XXX series of cards and you should see improvements in the next release of the SDK. Theoretically, CS mode should be faster, but depending on use case the optimized paths in PS mode can outperform the general paths in CS mode. I haven't done enough testing myself to answer with greater detail on the 5XXX series of cards, but if you can give a little more detail on what exactly you are trying to do for these steps, I can ask a few people to see if they align with what we currently know about.
                              "
                              4x16 > 64x1 block size in compute shader mode

                              global read in pixel shader mode > texture fetch in pixel shader mode

                              4x16 in compute shader mode ~= global read in compute shade mode
                              "

                              Are these copy tests? etc...