8 Replies Latest reply on Sep 2, 2011 7:36 AM by galmok

    Matrix inversion within single work item

    barno
      Matrix sizes up to 25x25, low performance

      Hi,
      I need to do a matrix inversion within each of my work items. Right now the matrices are 5x5 in size, but eventually they might be up to 25x25. I implemented a simple gauss-jordan elimination method, but it performs pretty bad.

      I've read somewhere that with its current implementation on AMD GPUs, private arrays are not allocated in private memory but global memory instead. Is that sill true and might that be causing the low performance?

       

      Any other ideas?

      Thanks!

        • Matrix inversion within single work item
          jeff_golds

           

          Originally posted by: barno Hi, I need to do a matrix inversion within each of my work items. Right now the matrices are 5x5 in size, but eventually they might be up to 25x25. I implemented a simple gauss-jordan elimination method, but it performs pretty bad.

           

          I've read somewhere that with its current implementation on AMD GPUs, private arrays are not allocated in private memory but global memory instead. Is that sill true and might that be causing the low performance?



          It depends on the size of the private array.  In general, the compiler will attempt to fit the private array into registers, but if you're using a large array, then it will be placed in global memory.

          Jeff

            • Matrix inversion within single work item
              barno

               

              Originally posted by: jeff_golds
              Originally posted by: barno Hi, I need to do a matrix inversion within each of my work items. Right now the matrices are 5x5 in size, but eventually they might be up to 25x25. I implemented a simple gauss-jordan elimination method, but it performs pretty bad.

               

               

               

              I've read somewhere that with its current implementation on AMD GPUs, private arrays are not allocated in private memory but global memory instead. Is that sill true and might that be causing the low performance?



               

              It depends on the size of the private array.  In general, the compiler will attempt to fit the private array into registers, but if you're using a large array, then it will be placed in global memory.

               

              Jeff

               

              I see. Right now the biggest array used during calculation is 50 floats in size. I guess that should fit into registers of my HD6970...?!

              Any other ideas?

            • Matrix inversion within single work item
              notzed

               

              Originally posted by: barno

               

              I've read somewhere that with its current implementation on AMD GPUs, private arrays are not allocated in private memory but global memory instead. Is that sill true and might that be causing the low performance?

               

               

              Any other ideas?

               

              Thanks!

               

              I doubt the memory is the problem on it's own - gj elimination uses loops which depend on the data don't they?  i.e. each `thread' follows divergent paths.  This can't run quickly.

              For the memory issue, you could try a local array and just ensure each 'thread' accesses it's own (non-conflicting) range.  LS is very fast.  Last time I looked 'arrays' can only be registerised if the index can be resolved at compile-time, but maybe AMD have some magic around this (its not something i've ever seen a cpu isa being able to support efficiently).

              But if you have a lot of work it might make sense to have it run some parallel algorithm using a separate kernel step .  Trying to do too much work in a given 'thread' sometimes doesn't work very well - register spillage, running out of LS, or being forced to use a non-ideal work-group topology for that stage.

               

                • Matrix inversion within single work item
                  douglas125

                  Matrix inversion is quite unusual in real applications and has many, many special cases.

                  As pointed out in the previous comment iterative methods will cause flow divergence which means that, in practice, ALL solves will take the LONGEST time in the warp.

                  Are your matrices symmetric positive definite? If yes, you should consider the very stable Cholesky factorization.

                  Even LU factorization won't perform very well because you can't expect stability without pivoting, which will also cause flow divergence.

                   

                  EDIT: What I mean by "matrix inversion is unusual" is that you don't explicitly form the inverse when you want to solve a linear system.

                    • Matrix inversion within single work item
                      galmok

                      My experience with matrix inversion using GPU is that small matrices are next to impossible to invert faster on GPU than on CPU. But if the initial matrix is already on the GPU and the result also has to stay on the GPU, moving the data back to the CPU just for the matrix inversion will probably be too slow.

                      Guass-Jordan will be too slow and is only used to grasp the general principle of matrix inversion. You should rather use LU (with pivoting) or Cholesky (both followed by 2 triangle solvers) which produce quite reliable results.

                      But implementing LU (or Cholesky) on the GPU and have it work well at such small matrices will be difficult I guess.

                      You could take a look at AMD Accelerated Parallel Processing Math Libraries (APPML) http://developer.amd.com/libraries/appmathlibs/Pages/default.aspx that has some of the functions you need. (it has triangle solvers (TRSM) and matrix multiply (GEMM)) but does not have DGETRF (LU factorization) nor DPOTRF (Cholesky). You can however find the fortran sourcecode for both of these funtions (is fairly easy to read) and try to implement the opencl version from that.