2 Replies Latest reply on Aug 8, 2018 4:19 AM by dipak

    Rocm performance low issue with HIP-Examples-master/gpu-burn

    j0hnny

      Hi,

      I met with rocm1.8 +ubuntu18.04 LTS (16.04 same issue)+ 2 cards Ellesmere [Radeon RX 470/480] performance low issue.

      I ran hip example code to burn gpu to check performance, the code I run is the HIP examples: HIP-Examples-master/gpu-burn. from GitHub - ROCm-Developer-Tools/HIP-Examples: Examples for HIP

      While gpu burnning, rocm-smi showed that MCLK has only 500M, but SCLK increased to 1130MHz from 300MHz in idle.

      I could use rocm-smi -d 1 --setmclk 2 to change MCLK to 2000M, rocm-smi showed it changed to 2000M but performance not increase at all, I think GPU not really worked at 2000MHz.

       

       

      So my question is: 1. Why gpu performance is slow under gpu-burn, what can I try to increase performance?

                                       2. Is it related to 500M MCLK  ?

                                     3. MCLK changed to 2000M not really work, right?

      idle:

      GPU Temp AvgPwr SCLK MCLK Fan Perf SCLK OD MCLK OD

      1 35c 30.178W 300Mhz 500Mhz 0.0% manual 0% 0%

      0 34c 32.244W 300Mhz 500Mhz 0.0% manual 0% 0%

       

      While burning:

      GPU Temp AvgPwr SCLK MCLK Fan Perf SCLK OD MCLK OD

      1 45c 104.171W 1130Mhz 500Mhz 0.0% manual 0% 0%

      0 43c 98.192W 1130Mhz 500Mhz 0.0% manual 0% 0%

       

      After rocm-smi -d 1 --setmclk 2

      GPU Temp AvgPwr SCLK MCLK Fan Perf SCLK OD MCLK OD

      1 38c 32.153W 300Mhz 2000Mhz 0.0% manual 0% 0%

      0 36c 34.176W 300Mhz 2000Mhz 0.0% manual 0% 0%

       

      About HIP examples gpu-burn it keeps doing dimension 512x512 matrix mul, like A[i] * B[i] =C[i], and all the 3 matrixs stays in the ddr by calling hipmalloc.  So for this case, gpu compute unit should access ddr memory alot. MCLK is critical for the performance.

      Below is burn kernel example.

       

      int BurnKernel::runComputeKernel()

      {

           int err = 0;

           for (int i = 0; mRunKernel && i < mNumIterations; ++i) {

                hipLaunchKernel(

                     /* Launch params /

                     HIP_KERNEL_NAME(hip_sgemm_kernel),

                     dim3(cRowSize/cBlockSize, cRowSize/cBlockSize, 1),

                     dim3(cBlockSize,cBlockSize,1), 0, 0,

                     / Kernel params /

                     cRowSize, cRowSize, cRowSize, cAlpha,

                     mDeviceAdata, cRowSize,

                     mDeviceBdata, cRowSize,

                     cBeta,

                     mDeviceCdata + icMatrixSize,

                     cRowSize);

           }

           checkError(hipDeviceSynchronize(), "Sync");

           return err;

      }

        • Re: Rocm performance low issue with HIP-Examples-master/gpu-burn
          j0hnny

          Add more findings:

          If my gpu-burn kernel is a matrix mul kernel, like sgemm below, MCLK stays at 500MHZ. But if my kernel is a simple matrix add kernel as below hip_add_kernel, the MCLK could reach 2000MHZ.
          So my question is, why for computationally intensive kernel, MCLK stays at lower value? It also need read alot of matrix items, shouldn't this affect the performance ?

          global void hip_sgemm_kernel(hipLaunchParm lp, const int M,
          const int N, const int K,
          const float alpha,
          float *A, const int lda, float *B,
          const int ldb, const float beta,
          float *C, const int ldc)
          {
          //column major NN
          size_t idx_x = hipBlockIdx_x * hipBlockDim_x + hipThreadIdx_x;
          size_t idx_y = hipBlockIdx_y * hipBlockDim_y + hipThreadIdx_y;
          size_t dim_x = hipGridDim_x * hipBlockDim_x;
          size_t myIdx = idx_y * dim_x + idx_x;

              float local_c = beta * C[myIdx]; for(int k = 0; k < K; k++) { local_c += alpha * A[ idx_y + k * K] * B[ idx_x * K + k]; } C[myIdx] = local_c; 

          }

          global void hip_add_kernel(hipLaunchParm lp, const int M,
          const int N, const int K,
          const float alpha,
          float *A, const int lda, float *B,
          const int ldb, const float beta,
          float *C, const int ldc)
          {
          //column major NN
          size_t idx_x = hipBlockIdx_x * hipBlockDim_x + hipThreadIdx_x;
          size_t idx_y = hipBlockIdx_y * hipBlockDim_y + hipThreadIdx_y;
          size_t dim_x = hipGridDim_x * hipBlockDim_x;
          size_t myIdx = idx_y * dim_x + idx_x;

              float local_c = beta * C[myIdx]; local_c += alpha * A[ idx_y + 0 * K] * B[ idx_x * K + 0]; C[myIdx] = local_c; 

          }