3 Replies Latest reply on May 16, 2014 10:21 AM by timmy.liu

    ACML 6 and A10-7850K HPL performance

    yurtesen

      Hello,

       

      I am trying to run ACML6 / HPL benchmark with A10-7850K APU. I seem to have hit some brick walls...

       

      I am using the: http://www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html calculator...

       

      First problem is the memory usage.I am not able to go over the Spectre allocated memory. Shouldnt ACML6 be able to detect the GPU model and use host memory instead? Does it have to copy stuff around?

       

      Second problem is the performance, I used a case created with the HPL calculator (link above) with 1 nodes, 1 cores/node and 1024MB ram. If I run 1 process, I get about 19gflops and if I run 4 processes I get 21.6gflops. If I run 1 process with gpu inaccessible, I get  18gflops and 4 processes gives 18.5gflops. I tried to use ACML_LOG_FILTER=1 and it seems to have usegpu( 1 )  in the log entries (it is not 1 in all entries).

       

      Anway, what is the best way to get good results? Does anybody have better HPL results?

       

      Thanks!

        • Re: ACML 6 and A10-7850K HPL performance
          timmy.liu

          Hi yurtesen,

           

          Thanks for trying out HPL with ACML 6!

           

          May I ask if you are running the double precision HPL or single precision HPL (MHPL)? I am asking this because the Kaveri's GPU has a pretty good single precision peak performance at 737 gflops, but its double precision performance is not better than CPU's peak flops. (AnandTech Portal | Floating point peak performance of Kaveri and other recent AMD and Intel chips)

           

          I have run both single and double precision HPL on KV and was able to get bigger numbers. One thing I did was making sure "NB" is really big (such as 1024) and "NBMINs" is big (such as 64). This is because the GPU computation works better when the matrix is not too thin or tall. Actually, in the lua scripts (/Spectre/gemm.lua) you can see the logic that if one of m, n, k(which is NB) is smaller than 64, the computation will be offload to CPU at all time. Can you share your choices of N and NB?

           

          A smarter memory management is definitely beneficial for HPL benchmark. Currently by default ACML will copy the memory in between CPU and GPU. Actually in this beta release there is a way to enable "USE_HOST_PTR" by assigning "2" to "memalloc_choice" inside of /Spectre/gemm.lua. Note that this "hack" is under-tested but I think it will allow you to allocate more memory at the host (bigger N).

           

          Looking at the log file you might find the lda, ldb and ldc are quite big while m, n, k are much smaller. OpenCL actually has a API (clEnqueueReadBufferRect) that only copies the useful part of the memory (memalloc_choice = 3 in the lua file), which enables N to be much bigger. However there is a run-time bug related to this API running on Kaveri in the current driver. I have filed an internal bug ticket and it is fixed in the internal drivers. I believe it will be fixed in the public driver soon.

           

          Regards,

          Timmy