AnsweredAssumed Answered

Is APU always faster then GPU ?

Question asked by daxh on Nov 26, 2013
Latest reply on Nov 29, 2013 by daxh

Hello to everybody.

 

I am new to OpenCL and I've faced with the following problem: all my kernels works up to 10 times faster on APU. I've always expect that GPU should works faster.

My platform is HP Envy laptop with OpenSuse 12.3, AMD APU A10 4600M 2.3 GHz (3.2 GHz,  Turbo mode) includes 4 cores and AMD Radeon HD 7660G, RAM 8 GB, and discrete GPU AMD Radeon HD 7670M — 2048 Mb.

 

Size of test array is 1024, I am using

 

krnlTest.getWorkGroupInfo< CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE > (device)

 

to define wavefront size. And I've got the following results here:

 

Preffered wg size multiple for Devastator is 64.
Preffered wg size multiple for AMD A10-4600M APU with Radeon(tm) HD Graphics is 1.



So, I've decided to split 1024 into at least 4 work groups (and 256 is a multiple for 64). But I've got the following results:

 

work_group_size = 256

 

-> OpenCL test time is 0.6948ms DEVICE: Devastator

-> OpenCL test time is 0.0802ms DEVICE: AMD A10-4600M APU with Radeon(tm) HD Graphics


work_group_size = 128


--> OpenCL test time is 0.801ms DEVICE: Devastator

--> OpenCL test time is 0.0912ms DEVICE: AMD A10-4600M APU with Radeon(tm) HD Graphics


work_group_size = 64

--> OpenCL test time is 0.6846ms DEVICE: Devastator

--> OpenCL test time is 0.0946ms DEVICE: AMD A10-4600M APU with Radeon(tm) HD Graphics


Also, just now I've realized that possibly there should be three OpenCL devices (APU, integrated GPU and discrete GPU), am I missing something here ? Because when I am using aticonfig to change device priority (between power-saving GPU and high-performance GPU) I get always just 2 OpenCL devices, one of them is always AMD A10-4600M APU with Radeon(tm) HD Graphics, but another could be called Devastator or Tahiti (or something like this).


This is how I am measuring time:



typedef std::chrono::steady_clock::time_point t_time_point;
typedef std::chrono::duration<int,std::milli> t_millisecs;

inline t_time_point getTimePoint()
{
    return std::chrono::steady_clock::now();
}

inline unsigned int getTimeDifference_ms(t_time_point start)
{
    t_time_point end = getTimePoint();
    t_millisecs duration( std::chrono::duration_cast<t_millisecs>(end-start));
    return duration.count();
}

// Performing tests
t_time_point start = getTimePoint();
for(int iTest=0; iTest<numberOfTests;iTest++)
{
    // Creating buffers
    cl::Buffer clVectorA = cl::Buffer(context,
            CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, size * sizeof(cl_int),
            pVectorA);
    cl::Buffer clVectorB = cl::Buffer(context,
            CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, size * sizeof(cl_int),
            pVectorB);
    cl::Buffer clVectorC = cl::Buffer(context,
            CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, size * sizeof(cl_int),
            pVectorC);

    // Set kernel args
    SET_KRNL_ARG_4(krnlTest, clVectorA, clVectorB, clVectorC, size)
    krnlTest.setArg(4, size * sizeof(cl_int), NULL);

    // Launching
    const int global_size = size;
    const int local_size = size/16;
    queue.enqueueNDRangeKernel(
        krnlTest,
        cl::NullRange,
        cl::NDRange(global_size),
        cl::NDRange(local_size));

    // Finishing
    queue.enqueueReadBuffer(clVectorC, CL_FALSE, 0, size * sizeof(cl_int), pVectorC);
    queue.finish();

}
unsigned int ms = getTimeDifference_ms(start);

std::cout << "\n--> OpenCL test time is " <<((double)ms/numberOfTests) << "ms"
          << "\t\t DEVICE: " << (std::string)(device.getInfo< CL_DEVICE_NAME >());


 

This is the kernel :

 

#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable
#pragma OPENCL EXTENSION cl_amd_printf : enable

__kernel void test_kernel(
    __global int* pVectorA,
    __global int* pVectorB,
    __global int* pVectorC,
    int size,
    __local  int* pLocalTmp)
{
    int gid = get_global_id(0);
    int lid = get_local_id(0);
    int gsize = get_global_size(0);
    int lsize = get_local_size(0);
    int grum = get_num_groups(0);

//    if(gid == 0)printf("gsize=%d\tlsize=%d\tgrum=%d\n", gsize, lsize, grum);

    pLocalTmp[gid] = pVectorA[gid] * pVectorB[gid] + 4;

    barrier(CLK_LOCAL_MEM_FENCE);

    pVectorC[(size-1)-gid] = pLocalTmp[gid];
}


 

In attachement you could find the whole sources (this is a QtCreator project)

Outcomes