Newcomers Start Here

siddhart · ‎10-04-2021

Hello,

I took the vectorAdd example using HIP from github (https://github.com/ROCm-Developer-Tools/HIP-Examples/blob/master/vectorAdd/vectoradd_hip.cpp) and added C++ timers around compute only part. Aside from this, I also added timers around simple addition on CPU. I see that vadd on cpu is much faster than gpu. Why is that so?

Can someone help me explain what am i missing here?

hipcc -w -g -std=c++14 vectoradd_hip.cpp -o out.exe

./out.exe

System minor 0
System major 9
agent prop name Vega 10 [Radeon Instinct MI25]
hip Device prop succeeded
GPU duration : 104.737 ms
CPU duration : 1.521 ms
PASSED!

Kernel code :

#define WIDTH 1024
#define HEIGHT 1024

#define NUM (WIDTH*HEIGHT)

#define THREADS_PER_BLOCK_X 16
#define THREADS_PER_BLOCK_Y 16
#define THREADS_PER_BLOCK_Z 1

__global__ void
vectoradd_float(float* __restrict__ a, const float* __restrict__ b, const float* __restrict__ c, int width, int height)

{

int x = hipBlockDim_x * hipBlockIdx_x + hipThreadIdx_x;
int y = hipBlockDim_y * hipBlockIdx_y + hipThreadIdx_y;

int i = y * width + x;
if ( i < (width * height)) {
a[i] = b[i] + c[i];
}
}

Host code :

GPU part :

{

auto start = std::chrono::steady_clock::now();
hipLaunchKernelGGL(vectoradd_float,
dim3(WIDTH/THREADS_PER_BLOCK_X, HEIGHT/THREADS_PER_BLOCK_Y),
dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y),
0, 0,
deviceA ,deviceB ,deviceC ,WIDTH ,HEIGHT);
auto stop = std::chrono::steady_clock::now();
std::chrono::duration<double, std::milli> elapsed = stop - start;
printf("GPU duration : %0.3f ms\n", elapsed.count());

}

CPU part :
{
auto start = std::chrono::steady_clock::now();
for (i = 0; i < NUM; i++) {
cpu_out[i] = hostB[i] + hostC[i];
}
auto stop = std::chrono::steady_clock::now();
std::chrono::duration<double, std::milli> elapsed = stop - start;
printf("CPU duration : %0.3f ms\n", elapsed.count());
}

dipak · ‎10-04-2021

Hi @siddhart

Thank you for the query. You have been whitelisted for the AMD Devgurus community.

For HIP related support, its Github site is the right place to post any query/issue. Please use the below links:

https://github.com/ROCm-Developer-Tools/HIP/issues

https://github.com/ROCm-Developer-Tools/HIP-Examples/issues

Just for suggestion, there are a few things to consider when comparing the CPU and GPU timing like kernel launch overhead or latency, deferred resource allocation techniques used by the runtimes (hence first kernel call can take much longer time than subsequent calls), workload and work group size etc.

So, just for experiment, I would suggest to try the below steps to see if it affects the timing.

- Run the kernel in a loop (say 5 times) and see if there is any significant difference in timing between the first kernel call and subsequent calls

- Use larger workload i.e. use larger HEIGHT and WIDTH values

If you still observe the issue or you've any query, please use the above github links.

Thanks.

siddhart · ‎12-03-2021

Hello,

I have posted on the github page as you suggested after running it for 10 iterations where I still see a performance degradation. But I'm still waiting to here back. It's been more than a month. Can someone help to take a look ?

https://github.com/ROCm-Developer-Tools/HIP/issues/2393

Thanks,

Sid

Newcomers Start Here

GPU (Radeon Instinct MI25) vs CPU performance using hip APIs