That depends. You could use Images on input to provide caching, which isn't supported in the current SDK but will be in the near future. If you choose a blocked matrix layout you could have small blocks of 4 or maybe 16 floats which are read at a given point in the matrix and which will mean that each memory read will be a reasonably efficient vector read.
In your case you can probably have multiple whole columns of the dense matrix in LDS at once. You can then read whole rows, or even overlapping multiple rows, of the sparse matrix and multiple them random-access-style with the dense matrix. That assumes your storage is some CSR-based structure. If you then have a blocked CSR you could multiple by multiple dense rows at once getting even better use of LDS. It really depends on the properties of your sparse matrix.
Sparse matrix storage formats always have to be tweaked to get good performance on any architecture, and the AMD GPUs are no different in that. Your question depends on too many parameters. Don't think so much about whether the architecture can do it but rather how you want to define your sparse data structure to ensure that the architecture can do it.
Thanks for the reply. Looking forward to the new SDK release.