The current DGEMM api in ACML-GPU is identical to the BLAS DGEMM api. As such, it accepts pointers to the host memory. The idea is that you can substitute any BLAS library with ACML-GPU without any code modifications.
I understand the situation that you described, where the input matrices are already on the GPU. I will pass on the feature request.