I need to do a matrix inversion within each of my work items. Right now the matrices are 5x5 in size, but eventually they might be up to 25x25. I implemented a simple gauss-jordan elimination method, but it performs pretty bad.
I've read somewhere that with its current implementation on AMD GPUs, private arrays are not allocated in private memory but global memory instead. Is that sill true and might that be causing the low performance?
Any other ideas?