(NOTE : This post is an informational post for users like me. This is not an official AMD post and I do not work with AMD.)
In CAL sdk, there is a sample called simple_matmult and one called double_matmult. If you run the programs the total gflops shown is quite low. On a 3850, the default figures reported for total gflops are disappointingly low at 12-20 gflops even though GPU-only numbers are high.
I looked at the code and it appears that timing is started BEFORE initializing the input matrices with random values. You should move the timer.Start() JUST AFTER SetupUserData() as we are only interested in measuring the time taken for matrix multiplication and not the time taken to initialize the input matrices. Time taken for data transfer etc and rearranging the input matrices is still taken into account of course.
Secondly, the best performance is obtained if sizes are sufficiently large. To achieve this, you can specify the matrix sizes on the command line. example : ./a.out -m 1024
Third, the best performance is achieved when the output matrices are in remote memory. Also the remote memory must be cacheable. Thus find the line with "calResAllocRemote2D" and change the flag being passed from "flag" to CAL_RESALLOC_CAHCHEABLE.
After doing the above code changes and recompiling, run it as follows : ./a.out -m 2048 -o c
You should get more than a 100gflops on "total" for any 3800 series card. You can also make similar changes to double_matmult.