10 Replies Latest reply on Nov 12, 2008 7:00 AM by sgratton

    Can LDS reads be "broadcast" within a wavefront"?

    sgratton

      Hi there,

      I've finally gotten a compute shader working using LDS, doing matrix multiply as a test (code here). However, I am disappointed at its performance at about 200GFLOP/s on a 4870 which is worse than e.g. the sdk pixel shader code.

      The idea was to store a block of one of the matrices in the LDS for each wavefront, to cut down on global memory access. Each wavefront computes a large subblock, again to reduce memory access. Then, all threads in the wavefront read the same element of the LDS at any given time (i.e. first all threads in the wavefront read element 0 of thread 0, then all threads read element 0 of thread 1, and so on). I had imagined that this might only count as 1 read of the LDS, rather than 64, and so be "fast" -- does anybody know if this the case or not? I am wondering if this might be the bottleneck.

      If reads aren't "broadcast" then it does suggest that the LDS is not necessarily ideally suited to being used as a "cache"; have people found other good uses for it, that couldn't be done say with synchronising on global memory accesses? If shared registers were addressible in il code then they could potentially play the role of a cache -- might this be in the pipeline?

      Thanks,
      Steven.


        • Can LDS reads be "broadcast" within a wavefront"?
          eduardoschardong
          Have you tried:
          - Other sizes for the sub-matrices?
          - Shared registers?
          - _neighborExch (for data layout)?
            • Can LDS reads be "broadcast" within a wavefront"?
              sgratton

              Hi there,

              Thanks a lot for the suggestions! Other sub-matrix sizes are a possibility, but are highly constrained by wavefront sizes, using float4's and wanting to burst write the result (for application to a future problem). I did what I thought seemed good in theory; it takes me such a long time to get a working IL program I can't really try all reasonable possibilities! I did think about shared registers, but couldn't see how to use them in this problem, especially as they are per simd rather than per wavefront and non-indexible. I chose to do C=A^T B and laid out the accesses in a certain way so I didn't need to use _NeighborExch, though for other options it could indeed be useful. I think more information and a detailed profiler would really help in situations like this where each thread has to both access a lot of data and do a lot of calculation.

              Best,
              Steven.

            • Can LDS reads be
              MicahVillmow
              Steven,
              The issue is not exaclty the usage of LDS, but with how it is compiled to ISA. If you look at the resulting ISA code you will see that there is an LDS_WRITE followed by a couple of alu instructions and then and LDS write again, etc.. for all of the LDS operations. This is the bottleneck and is a known issue. This cuts the performance dramatically when using the LDS but once the write/read combining gets implemented in the compiler and generates proper ISA then you should see an immediate performance improvement.
                • Can LDS reads be "broadcast" within a wavefront?
                  sgratton

                  Hi Micah,

                  Thanks for the info; glad to hear that there could well be a performance boost soon.

                  However, looking at the gpuisa (I've put it here), the vast majority of the loop is ALU (1024 MULADD's in 16 clauses plus a bit of other ALU in 5 clauses vs 17 SAMPLE's in 3 clauses and 65 LDS instructions in 17 clauses), so I thought that the 4870 would really do well here. There is only one LDS_WRITE per loop but 16 LDS_READS (4 clauses of 4 instructions) that do seem pretty well integrated, hence my suspecting the instrinsic speed of reads rather than a problem with the write. Do you think the issue you mention would really slow it down that much? (Also, to use the LDS as a shared cache you need a sync between the write and the reads so LDS read/write combining might not help.)

                  One other possibility is each thread has ended up using too many gpr's (55) to hide all the latency. (This is an example where turning off "optimization" might be really helpful -- the code shouldn't need more than about 35 registers and could easily be trimmed down to perhaps 26 or so. Also note the suboptimal burst-writing at the end.)

                  Or, perhaps the program is just too big, at 12960 bytes?

                  Do you think either of these could also be an issue?

                  Thanks a lot,
                  Steven.
                • Can LDS reads be
                  MicahVillmow
                  Steven,
                  It is actually a combination of the first two. Each clause break has a penalty of some number of cycles and the fact that you have 41 clause breaks per loop is pretty large. If you look at the simple_matmult ISA you can see that there are only 4-5 clause breaks. So right away you have ~10 times more hardware context switches. The other issue is you are using a lot of registers and therefor not enough wavefronts are executing to hide the memory/clause latency.

                  For LDS, you need a sync between a write and a read. If the compiler determines that a sync is not needed it will remove it. It determines this in the case where your group size is smaller than or equal to the wavefront size. The only time this is not required is when a read and a write are in the same clause. The hardware guarantees atomicity in a clause, but not between clauses.

                  One LDS write and 16 lds reads can be put into a minimum of 3 clauses, so the fact that it is using 5 means it is sub-optimal. This is a known issue as is the register bloat and I have brought this to the attention of the compiler team.
                    • Can LDS reads be
                      sgratton

                      Hi Micah,

                      That's very interesting. The memory reads required for matrix multiplication go as number of flops*(1/M + 1/N) where M and N are the submatrix sizes. I was trying to use big submatrices to get this number down; I thought mat-mult in the SDK must be memory limited on both 3800/4800 seeing as it seems to be running at about 200/400 GFLOP/s rather than the theoretical 500/1200GFLOP/s. If this was the case then as (1/256+1/16) is about a quarter of (1/8+1/4) the larger blocks should have helped a lot (assuming the texture caching is efficient). But the other effects mentioned seem to outweigh this saving. I'll have to try the code again with the next release of the compiler!

                      Thanks a lot,
                      Steven.
                    • Can LDS reads be
                      MicahVillmow
                      Steven,
                      You are correct that we are memory limited, but the theoretical peaks are not 500/1200, but closer to 264/640.
                      This can be calculated by 2 * M * N * K / tex_time / (2^30).

                      You can find more information on how to calculate theoretical peaks here:
                      http://coachk.cs.ucf.edu/courses/CDA6938/
                        • Can LDS reads be
                          sgratton
                          Hi Micah,

                          The numbers I mentioned were the ALU peaks for the card; I can reproduce your numbers as being the best the texture units can do assuming an 8x4 submatrix. If you double the submatrix size, to 16x8, these double too, and for a sufficiently large submatrix you should be able to approach the ALU peak, which of course is a hard upper bound.

                          My calculation is as follows. An MxN submatrix is obtained by multiplying an Mxk by a kxN matrix, so needs k(M+N) elements read in via texture. There are (m/M)*(n/N) submatrices, so the total number of elements that need to be read in is (m/M)(n/N)k(M+N)=(2mnk)*0.5*(1/M+1/N), i.e. the number of FLOPs (2mnk) times 0.5*(1/M+1/N).

                          So the texture time is FLOPs*0.5*(1/M+1/N)*Bpp/ (B/s), where Bpp is the number of bytes per element and B/s is the texturing rate in bytes per second.

                          Taking the texture time as a lower limit for the total time, we arrive at:

                          time > FLOPs*0.5*(1/M+1/N)*Bpp/ (B/s), or

                          GFLOP/s < 0.5*(1/M+1/N)*Bpp/ (GB/s)

                          Now I have found it hard to know quite how to understand what the theoretical rate can be (e.g. I read the 3870 can do 64-bit textures at "full speed" whereas the 4870 does them at "half speed", and that the 3870 can do unfiltered and filtered reads at the same time whereas the 4870 can't, but what this means for cal, particularly whether it makes a difference if you're sampling float2's or float4's or double2's, and thus say the formulae presented in the pdf you mention?), but from a "Terascale Graphics Engine" pdf by Hartog I think the 4870 can do 480GB/s peak, and up to the comments above the 3870 about 200 GB/s peak.

                          So for floats, and for an 8x4 subblock, we indeed get the 240/640 numbers you mention. However, for an 8x16 subblock, we already get 480/1280, i.e. around the ALU limit.

                          So, excepting possible latency issues, I don't see why your cards shouldn't really be going at full ALU speed in either single precision or double precision! Do you agree?

                          To really hit this one does of course need all data in the L1 cache at the right time, i.e. needs some sharing between wavefronts, which is tricky for me to know about not knowing the details. I tried bigger subblocks even on a 3870 (see "matmult" and "bothstripedmmm" in here) using a pixel shader reading in partitioned matrices as a guess and outputting to global memory (thus avoiding the 8x4 limit) but it didn't show much advantage.

                          The number of waverfronts needed to avoid texture waits is a problem; hence a) my concern about inefficient register usage and b) the attempted use of LDS to store some info -- I don't know if it is faster than/independent of the regular texturing stuff.

                          Perhaps our whole discussion is an example where I think comprehensive hardware information, particularly about the memory system and about how wavefronts are actually scheduled and ran, would really help developers implementing algorithms, particularly given how long it takes to program in IL! I'm sure AMD would really benefit from any risk taken in this way as it should allow it to be seen that people are really able to take full advantage of the hardware. I think the latter is essential for hpc -- each gpu really has to do much better than a cpu to make using a significant gpu cluster worthwhile. Given the points you've already made I think I can improve performance non-trivially with a few simple tweaks!

                          Best wishes,
                          Steven.

                        • Can LDS reads be
                          MicahVillmow
                          Steven,
                          Yeah, there are some assumptions that i am making that are not public. The simple_matmult example that we have is pretty much optimal for our hardware. It was written in a way that balances the number of wavefronts executing, number of registers used per threads, and L1 cache access pattern. There might be some tricks that can increase the performance by minor percentage points(data tiling), but the algorithm and the structure will be the same.

                          Assuming unlimited registers, then the larger block size is better, but since more registers equals less wavefronts in flight, there is a pretty large performance penalty for using more.
                          I do agree that more hardware specific information would be greatly beneficial and that IL does take awhile to program in, but rest assured that we are working on both of these aspects.

                          Good luck,
                            • Can LDS reads be
                              sgratton

                              Dear Micah,

                              I really do hope AMD basically fully disclose how the memory and caching works with the next SDK, to complement what is already known about the ALU side (though this needs to be updated from 2900 era to 4800 era...). If this doesn't happen, I unfortunately foresee a situation where there will be a "two-tier" development of code for AMD gpus:

                              1) AMD (and perhaps selected partner companies?) who have the information necessary to create great algorithms and libraries for both "pure streaming" and also less trivial gpu applications

                              2) Everyone else who is told enough to be able to write good "pure streaming" applications and okay-performing less trivial applications

                              While this may be okay for home single-desktop use, the great problem is that if AMD haven't happened to have released a program that does just what one wants, if it is at all non-trivial (e.g. say Cholesky factorization, using mpi on a multi-gpu cluster...), one may as well not bother: decent performance basically becomes a lottery, and you need close to optimal performance as opposed to okay performance in order to make gpgpu worthwhile in any non-single-desktop setting. This would be a great shame, because the hardware seems very capable in principle.

                              Best wishes,
                              Steven.