Archives Discussions

daxh · ‎11-26-2013

Hello to everybody.

I am new to OpenCL and I've faced with the following problem: all my kernels works up to 10 times faster on APU. I've always expect that GPU should works faster.

My platform is HP Envy laptop with OpenSuse 12.3, AMD APU A10 4600M 2.3 GHz (3.2 GHz, Turbo mode) includes 4 cores and AMD Radeon HD 7660G, RAM 8 GB, and discrete GPU AMD Radeon HD 7670M — 2048 Mb.

Size of test array is 1024, I am using

krnlTest.getWorkGroupInfo< CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE > (device)

to define wavefront size. And I've got the following results here:

Preffered wg size multiple for Devastator is 64.

Preffered wg size multiple for AMD A10-4600M APU with Radeon(tm) HD Graphics is 1.

So, I've decided to split 1024 into at least 4 work groups (and 256 is a multiple for 64). But I've got the following results:

work_group_size = 256

-> OpenCL test time is 0.6948ms DEVICE: Devastator

-> OpenCL test time is 0.0802ms DEVICE: AMD A10-4600M APU with Radeon(tm) HD Graphics

work_group_size = 128

--> OpenCL test time is 0.801ms DEVICE: Devastator

--> OpenCL test time is 0.0912ms DEVICE: AMD A10-4600M APU with Radeon(tm) HD Graphics

work_group_size = 64

--> OpenCL test time is 0.6846ms DEVICE: Devastator

--> OpenCL test time is 0.0946ms DEVICE: AMD A10-4600M APU with Radeon(tm) HD Graphics

Also, just now I've realized that possibly there should be three OpenCL devices (APU, integrated GPU and discrete GPU), am I missing something here ? Because when I am using aticonfig to change device priority (between power-saving GPU and high-performance GPU) I get always just 2 OpenCL devices, one of them is always AMD A10-4600M APU with Radeon(tm) HD Graphics, but another could be called Devastator or Tahiti (or something like this).

This is how I am measuring time:




typedef std::chrono::steady_clock::time_point t_time_point;

typedef std::chrono::duration<int,std::milli> t_millisecs;


inline t_time_point getTimePoint()

{

    return std::chrono::steady_clock::now();

}


inline unsigned int getTimeDifference_ms(t_time_point start)

{

    t_time_point end = getTimePoint();

    t_millisecs duration( std::chrono::duration_cast<t_millisecs>(end-start));

    return duration.count();

}


// Performing tests

t_time_point start = getTimePoint();

for(int iTest=0; iTest<numberOfTests;iTest++)

{

    // Creating buffers

    cl::Buffer clVectorA = cl::Buffer(context,

            CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, size * sizeof(cl_int),

            pVectorA);

    cl::Buffer clVectorB = cl::Buffer(context,

            CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, size * sizeof(cl_int),

            pVectorB);

    cl::Buffer clVectorC = cl::Buffer(context,

            CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, size * sizeof(cl_int),

            pVectorC);


    // Set kernel args

    SET_KRNL_ARG_4(krnlTest, clVectorA, clVectorB, clVectorC, size)

    krnlTest.setArg(4, size * sizeof(cl_int), NULL);


    // Launching

    const int global_size = size;

    const int local_size = size/16;

    queue.enqueueNDRangeKernel(

        krnlTest,

        cl::NullRange,

        cl::NDRange(global_size),

        cl::NDRange(local_size));


    // Finishing

    queue.enqueueReadBuffer(clVectorC, CL_FALSE, 0, size * sizeof(cl_int), pVectorC);

    queue.finish();


}

unsigned int ms = getTimeDifference_ms(start);


std::cout << "\n--> OpenCL test time is " <<((double)ms/numberOfTests) << "ms"

          << "\t\t DEVICE: " << (std::string)(device.getInfo< CL_DEVICE_NAME >());

This is the kernel :


#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable

#pragma OPENCL EXTENSION cl_amd_printf : enable


__kernel void test_kernel(

    __global int* pVectorA,

    __global int* pVectorB,

    __global int* pVectorC,

    int size,

    __local  int* pLocalTmp)

{

    int gid = get_global_id(0);

    int lid = get_local_id(0);

    int gsize = get_global_size(0);

    int lsize = get_local_size(0);

    int grum = get_num_groups(0);


//    if(gid == 0)printf("gsize=%d\tlsize=%d\tgrum=%d\n", gsize, lsize, grum);


    pLocalTmp[gid] = pVectorA[gid] * pVectorB[gid] + 4;


    barrier(CLK_LOCAL_MEM_FENCE);


    pVectorC[(size-1)-gid] = pLocalTmp[gid];

}

In attachement you could find the whole sources (this is a QtCreator project)

daxh · ‎11-28-2013

Looks like whole my time measurement methodology was wrong. I still reading "AMD Accelerated Parallel Processing OpenCL Programming Guide" and I've found there a great example "Parallel Min() function" (this is the second example in the whole book, page 41, paragraph 1.10.2). Also this example includes simple benchmarking approach. I've ported this example to OpenCL C++ Wrapper (QtCreator project in attachement to original post added). And now I can see interesting results.

If I enable power-saving mode in amdcccle, then I've got:

DEVICE: GPU, Devastator

COMPUT_UNITS: 6

GLOBAL_WORK_SIZE: 4096

LOCAL_WORK_SIZE: 64

NUM_GROUPS: 64

TIME: 5.29497

B/W 6.34 GB/sec, 64 groups, 4096 threads, count 1024, stride 1

result IS correct

DEVICE: CPU, AMD A10-4600M APU with Radeon(tm) HD Graphics

COMPUT_UNITS: 4

GLOBAL_WORK_SIZE: 4

LOCAL_WORK_SIZE: 1

NUM_GROUPS: 4

TIME: 9.10732

B/W 3.68 GB/sec, 4 groups, 4 threads, count 1048576, stride 4

result IS correct

And If I enable high-performance mode, then I've got the following:

DEVICE: GPU, Turks

COMPUT_UNITS: 6

GLOBAL_WORK_SIZE: 4096

LOCAL_WORK_SIZE: 64

NUM_GROUPS: 64

TIME: 15.6707

B/W 2.14 GB/sec, 64 groups, 4096 threads, count 1024, stride 1

result IS correct

DEVICE: CPU, AMD A10-4600M APU with Radeon(tm) HD Graphics

COMPUT_UNITS: 4

GLOBAL_WORK_SIZE: 4

LOCAL_WORK_SIZE: 1

NUM_GROUPS: 4

TIME: 8.82985

B/W 3.80 GB/sec, 4 groups, 4 threads, count 1048576, stride 4

result IS correct

So, now I see, that I really have 3 OpenCL devices, but due to drivers features (or I don't know why), I could use only two of them at once.

And the performance distributed between them in the following manner:

1) GPU AMD Radeon HD 7660G {Devastator} - is the fastets

2) CPU AMD A10 - is in the middle

3) AMD Radeon HD 7670M {Turks} - is the slowest

View solution in original post

himanshu_gautam · ‎11-27-2013

Hi,

I am not aware much about the hardware. but i came across a forum site where they had discussed about CPU , GPU and APU. hope this may help you to understand the concept

http://forums.anandtech.com/showthread.php?t=2141007

daxh · ‎11-27-2013

Thank you for this info.

himanshu_gautam · ‎11-27-2013

You are always welcome. Keep doing more experiments and share your feedback here

daxh · ‎11-28-2013

Looks like whole my time measurement methodology was wrong. I still reading "AMD Accelerated Parallel Processing OpenCL Programming Guide" and I've found there a great example "Parallel Min() function" (this is the second example in the whole book, page 41, paragraph 1.10.2). Also this example includes simple benchmarking approach. I've ported this example to OpenCL C++ Wrapper (QtCreator project in attachement to original post added). And now I can see interesting results.

If I enable power-saving mode in amdcccle, then I've got:

DEVICE: GPU, Devastator

COMPUT_UNITS: 6

GLOBAL_WORK_SIZE: 4096

LOCAL_WORK_SIZE: 64

NUM_GROUPS: 64

TIME: 5.29497

B/W 6.34 GB/sec, 64 groups, 4096 threads, count 1024, stride 1

result IS correct

DEVICE: CPU, AMD A10-4600M APU with Radeon(tm) HD Graphics

COMPUT_UNITS: 4

GLOBAL_WORK_SIZE: 4

LOCAL_WORK_SIZE: 1

NUM_GROUPS: 4

TIME: 9.10732

B/W 3.68 GB/sec, 4 groups, 4 threads, count 1048576, stride 4

result IS correct

And If I enable high-performance mode, then I've got the following:

DEVICE: GPU, Turks

COMPUT_UNITS: 6

GLOBAL_WORK_SIZE: 4096

LOCAL_WORK_SIZE: 64

NUM_GROUPS: 64

TIME: 15.6707

B/W 2.14 GB/sec, 64 groups, 4096 threads, count 1024, stride 1

result IS correct

DEVICE: CPU, AMD A10-4600M APU with Radeon(tm) HD Graphics

COMPUT_UNITS: 4

GLOBAL_WORK_SIZE: 4

LOCAL_WORK_SIZE: 1

NUM_GROUPS: 4

TIME: 8.82985

B/W 3.80 GB/sec, 4 groups, 4 threads, count 1048576, stride 4

result IS correct

So, now I see, that I really have 3 OpenCL devices, but due to drivers features (or I don't know why), I could use only two of them at once.

And the performance distributed between them in the following manner:

1) GPU AMD Radeon HD 7660G {Devastator} - is the fastets

2) CPU AMD A10 - is in the middle

3) AMD Radeon HD 7670M {Turks} - is the slowest

moozoo · ‎11-28-2013

>Size of test array is 1024, I am using

try this with a test array size of 100,000

I believe with only 1024 the GPU overheads will make up the bulk of the time.

daxh · ‎11-29-2013

I have one more question here. What is the "fair" approach to measure bandwith or algorithm working time? I mean, there are two ways:

1) Approach_1 - "Fair". Allow me to use just some pseudo-code:

int numberOfTests = 1000;

t_time_point start = getTimePoint();

for(int iTest=0; iTest<numberOfTests;iTest++)

{

// 1 Creating buffers

{...}

// 2 Setting kernel arguments

{...}

// 3 EnqueueNDRangeKernel

{...}

// 4 EnqueueReadBuffer

{...}

queue.finish();

}

unsigned int ms = getTimeDifference_ms(start); // algorithm working time in milliseconds

float secs = (float)ms/1000; // algorithm working time in seconds

But according to erlier mentioned by me Parallel Min() function example we can use another approach.

2) Approach_1 - "UN_Fair":

int numberOfTests = 1000;

t_time_point start = getTimePoint();

// 1 Creating buffers

{...}

// 2 Setting kernel arguments

{...}

for(int iTest=0; iTest<numberOfTests;iTest++)

{

// 3 EnqueueNDRangeKernel

{...}

}

// 4 EnqueueReadBuffer

{...}

queue.finish();

unsigned int ms = getTimeDifference_ms(start); // algorithm working time in milliseconds

float secs = (float)ms/1000; // algorithm working time in seconds

And of course results are very different:

FAIR

Preffered wg size multiple for Devastator is 64.

DEVICE: Devastator

TIME: 0.572ms

B/W: 0.0214825 GB/s

Preffered wg size multiple for AMD A10-4600M APU with Radeon(tm) HD Graphics is 1.

DEVICE: AMD A10-4600M APU with Radeon(tm) HD Graphics

TIME: 0.088ms

B/W: 0.139636 GB/s

NOT FAIR

Preffered wg size multiple for Devastator is 64.

DEVICE: Devastator

TIME: 0.01ms

B/W: 1.2288 GB/s

Preffered wg size multiple for AMD A10-4600M APU with Radeon(tm) HD Graphics is 1.

DEVICE: AMD A10-4600M APU with Radeon(tm) HD Graphics

TIME: 0.044ms

B/W: 0.279273 GB/s

In attachment to original post I've added new version of MatricesTest (QtCreator project) that contains related code. I hope you could help me to understand, because according to testing approach the results are absolutely different.

Archives Discussions

Is APU always faster then GPU ?