cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

barno
Journeyman III

Matrix inversion within single work item

Matrix sizes up to 25x25, low performance

Hi,
I need to do a matrix inversion within each of my work items. Right now the matrices are 5x5 in size, but eventually they might be up to 25x25. I implemented a simple gauss-jordan elimination method, but it performs pretty bad.

I've read somewhere that with its current implementation on AMD GPUs, private arrays are not allocated in private memory but global memory instead. Is that sill true and might that be causing the low performance?

 

Any other ideas?

Thanks!

0 Likes
8 Replies

Originally posted by: barno Hi, I need to do a matrix inversion within each of my work items. Right now the matrices are 5x5 in size, but eventually they might be up to 25x25. I implemented a simple gauss-jordan elimination method, but it performs pretty bad.

 

I've read somewhere that with its current implementation on AMD GPUs, private arrays are not allocated in private memory but global memory instead. Is that sill true and might that be causing the low performance?



It depends on the size of the private array.  In general, the compiler will attempt to fit the private array into registers, but if you're using a large array, then it will be placed in global memory.

Jeff

0 Likes

Originally posted by: jeff_golds
Originally posted by: barno Hi, I need to do a matrix inversion within each of my work items. Right now the matrices are 5x5 in size, but eventually they might be up to 25x25. I implemented a simple gauss-jordan elimination method, but it performs pretty bad.

 

 

 

I've read somewhere that with its current implementation on AMD GPUs, private arrays are not allocated in private memory but global memory instead. Is that sill true and might that be causing the low performance?



 

It depends on the size of the private array.  In general, the compiler will attempt to fit the private array into registers, but if you're using a large array, then it will be placed in global memory.

 

Jeff

 

I see. Right now the biggest array used during calculation is 50 floats in size. I guess that should fit into registers of my HD6970...?!

Any other ideas?

0 Likes

Please run AMD APP profiler and see value of ScratchRegs counter. From that we can conclude whether your kernel required more than what is available regs.

0 Likes

Matrix inversion is notoriously unstable. Why not use something like LU?

0 Likes

Originally posted by: barno

 

I see. Right now the biggest array used during calculation is 50 floats in size. I guess that should fit into registers of my HD6970...?!

 

Any other ideas?

 

Sure, 50 floats should fit, but you need to take into account whether you are using other registers as well.  As mentioned, use the profiler to see if you are using scratch memory.

Jeff

0 Likes
notzed
Challenger

Originally posted by: barno

 

I've read somewhere that with its current implementation on AMD GPUs, private arrays are not allocated in private memory but global memory instead. Is that sill true and might that be causing the low performance?

 

 

Any other ideas?

 

Thanks!

 

I doubt the memory is the problem on it's own - gj elimination uses loops which depend on the data don't they?  i.e. each `thread' follows divergent paths.  This can't run quickly.

For the memory issue, you could try a local array and just ensure each 'thread' accesses it's own (non-conflicting) range.  LS is very fast.  Last time I looked 'arrays' can only be registerised if the index can be resolved at compile-time, but maybe AMD have some magic around this (its not something i've ever seen a cpu isa being able to support efficiently).

But if you have a lot of work it might make sense to have it run some parallel algorithm using a separate kernel step .  Trying to do too much work in a given 'thread' sometimes doesn't work very well - register spillage, running out of LS, or being forced to use a non-ideal work-group topology for that stage.

 

0 Likes

Matrix inversion is quite unusual in real applications and has many, many special cases.

As pointed out in the previous comment iterative methods will cause flow divergence which means that, in practice, ALL solves will take the LONGEST time in the warp.

Are your matrices symmetric positive definite? If yes, you should consider the very stable Cholesky factorization.

Even LU factorization won't perform very well because you can't expect stability without pivoting, which will also cause flow divergence.

 

EDIT: What I mean by "matrix inversion is unusual" is that you don't explicitly form the inverse when you want to solve a linear system.

0 Likes

My experience with matrix inversion using GPU is that small matrices are next to impossible to invert faster on GPU than on CPU. But if the initial matrix is already on the GPU and the result also has to stay on the GPU, moving the data back to the CPU just for the matrix inversion will probably be too slow.

Guass-Jordan will be too slow and is only used to grasp the general principle of matrix inversion. You should rather use LU (with pivoting) or Cholesky (both followed by 2 triangle solvers) which produce quite reliable results.

But implementing LU (or Cholesky) on the GPU and have it work well at such small matrices will be difficult I guess.

You could take a look at AMD Accelerated Parallel Processing Math Libraries (APPML) http://developer.amd.com/libraries/appmathlibs/Pages/default.aspx that has some of the functions you need. (it has triangle solvers (TRSM) and matrix multiply (GEMM)) but does not have DGETRF (LU factorization) nor DPOTRF (Cholesky). You can however find the fortran sourcecode for both of these funtions (is fairly easy to read) and try to implement the opencl version from that.

0 Likes