cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

daxh
Adept I

Is APU always faster then GPU ?

Hello to everybody.

I am new to OpenCL and I've faced with the following problem: all my kernels works up to 10 times faster on APU. I've always expect that GPU should works faster.

My platform is HP Envy laptop with OpenSuse 12.3, AMD APU A10 4600M 2.3 GHz (3.2 GHz,  Turbo mode) includes 4 cores and AMD Radeon HD 7660G, RAM 8 GB, and discrete GPU AMD Radeon HD 7670M — 2048 Mb.

Size of test array is 1024, I am using

krnlTest.getWorkGroupInfo< CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE > (device)

to define wavefront size. And I've got the following results here:


Preffered wg size multiple for Devastator is 64.


Preffered wg size multiple for AMD A10-4600M APU with Radeon(tm) HD Graphics is 1.





So, I've decided to split 1024 into at least 4 work groups (and 256 is a multiple for 64). But I've got the following results:

work_group_size = 256

-> OpenCL test time is 0.6948ms DEVICE: Devastator

-> OpenCL test time is 0.0802ms DEVICE: AMD A10-4600M APU with Radeon(tm) HD Graphics


work_group_size = 128


--> OpenCL test time is 0.801ms DEVICE: Devastator

--> OpenCL test time is 0.0912ms DEVICE: AMD A10-4600M APU with Radeon(tm) HD Graphics


work_group_size = 64

--> OpenCL test time is 0.6846ms DEVICE: Devastator

--> OpenCL test time is 0.0946ms DEVICE: AMD A10-4600M APU with Radeon(tm) HD Graphics


Also, just now I've realized that possibly there should be three OpenCL devices (APU, integrated GPU and discrete GPU), am I missing something here ? Because when I am using aticonfig to change device priority (between power-saving GPU and high-performance GPU) I get always just 2 OpenCL devices, one of them is always AMD A10-4600M APU with Radeon(tm) HD Graphics, but another could be called Devastator or Tahiti (or something like this).


This is how I am measuring time:




typedef std::chrono::steady_clock::time_point t_time_point;


typedef std::chrono::duration<int,std::milli> t_millisecs;



inline t_time_point getTimePoint()


{


    return std::chrono::steady_clock::now();


}



inline unsigned int getTimeDifference_ms(t_time_point start)


{


    t_time_point end = getTimePoint();


    t_millisecs duration( std::chrono::duration_cast<t_millisecs>(end-start));


    return duration.count();


}



// Performing tests


t_time_point start = getTimePoint();


for(int iTest=0; iTest<numberOfTests;iTest++)


{


    // Creating buffers


    cl::Buffer clVectorA = cl::Buffer(context,


            CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, size * sizeof(cl_int),


            pVectorA);


    cl::Buffer clVectorB = cl::Buffer(context,


            CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, size * sizeof(cl_int),


            pVectorB);


    cl::Buffer clVectorC = cl::Buffer(context,


            CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, size * sizeof(cl_int),


            pVectorC);



    // Set kernel args


    SET_KRNL_ARG_4(krnlTest, clVectorA, clVectorB, clVectorC, size)


    krnlTest.setArg(4, size * sizeof(cl_int), NULL);



    // Launching


    const int global_size = size;


    const int local_size = size/16;


    queue.enqueueNDRangeKernel(


        krnlTest,


        cl::NullRange,


        cl::NDRange(global_size),


        cl::NDRange(local_size));



    // Finishing


    queue.enqueueReadBuffer(clVectorC, CL_FALSE, 0, size * sizeof(cl_int), pVectorC);


    queue.finish();



}


unsigned int ms = getTimeDifference_ms(start);



std::cout << "\n--> OpenCL test time is " <<((double)ms/numberOfTests) << "ms"


          << "\t\t DEVICE: " << (std::string)(device.getInfo< CL_DEVICE_NAME >());




This is the kernel :


#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable


#pragma OPENCL EXTENSION cl_amd_printf : enable



__kernel void test_kernel(


    __global int* pVectorA,


    __global int* pVectorB,


    __global int* pVectorC,


    int size,


    __local  int* pLocalTmp)


{


    int gid = get_global_id(0);


    int lid = get_local_id(0);


    int gsize = get_global_size(0);


    int lsize = get_local_size(0);


    int grum = get_num_groups(0);



//    if(gid == 0)printf("gsize=%d\tlsize=%d\tgrum=%d\n", gsize, lsize, grum);



    pLocalTmp[gid] = pVectorA[gid] * pVectorB[gid] + 4;



    barrier(CLK_LOCAL_MEM_FENCE);



    pVectorC[(size-1)-gid] = pLocalTmp[gid];


}




In attachement you could find the whole sources (this is a QtCreator project)

0 Likes
1 Solution
daxh
Adept I

Looks like whole my time measurement methodology was wrong. I still reading "AMD Accelerated Parallel Processing OpenCL Programming Guide" and I've found there a great example "Parallel Min() function" (this is the second example in the whole book, page 41, paragraph 1.10.2). Also this example includes simple benchmarking approach. I've ported this example to OpenCL C++ Wrapper (QtCreator project in attachement to original post added). And now I can see interesting results.

If I enable power-saving mode in amdcccle, then I've got:

DEVICE: GPU, Devastator

COMPUT_UNITS: 6

GLOBAL_WORK_SIZE: 4096

LOCAL_WORK_SIZE: 64

NUM_GROUPS: 64

TIME: 5.29497

B/W 6.34 GB/sec, 64 groups, 4096 threads, count 1024, stride 1

result IS correct

DEVICE: CPU, AMD A10-4600M APU with Radeon(tm) HD Graphics

COMPUT_UNITS: 4

GLOBAL_WORK_SIZE: 4

LOCAL_WORK_SIZE: 1

NUM_GROUPS: 4

TIME: 9.10732

B/W 3.68 GB/sec, 4 groups, 4 threads, count 1048576, stride 4

result IS correct

And If I enable high-performance mode, then I've got the following:

DEVICE: GPU, Turks

COMPUT_UNITS: 6

GLOBAL_WORK_SIZE: 4096

LOCAL_WORK_SIZE: 64

NUM_GROUPS: 64

TIME: 15.6707

B/W 2.14 GB/sec, 64 groups, 4096 threads, count 1024, stride 1

result IS correct

DEVICE: CPU, AMD A10-4600M APU with Radeon(tm) HD Graphics

COMPUT_UNITS: 4

GLOBAL_WORK_SIZE: 4

LOCAL_WORK_SIZE: 1

NUM_GROUPS: 4

TIME: 8.82985

B/W 3.80 GB/sec, 4 groups, 4 threads, count 1048576, stride 4

result IS correct

So, now I see, that I really have 3 OpenCL devices, but due to drivers features (or I don't know why), I could use only two of them at once.

And the performance distributed between them in the following manner:

1) GPU AMD Radeon HD 7660G {Devastator} - is the fastets

2) CPU AMD A10 - is in the middle

3) AMD Radeon HD 7670M {Turks} - is the slowest

View solution in original post

0 Likes
6 Replies
himanshu_gautam
Grandmaster

Hi,

I am not aware much about the hardware. but i came across a forum site where they had discussed about CPU , GPU and APU. hope this may help you to understand the concept

http://forums.anandtech.com/showthread.php?t=2141007

Thank you for this info.

0 Likes

You are always welcome. Keep doing more experiments and share your feedback here

0 Likes
daxh
Adept I

Looks like whole my time measurement methodology was wrong. I still reading "AMD Accelerated Parallel Processing OpenCL Programming Guide" and I've found there a great example "Parallel Min() function" (this is the second example in the whole book, page 41, paragraph 1.10.2). Also this example includes simple benchmarking approach. I've ported this example to OpenCL C++ Wrapper (QtCreator project in attachement to original post added). And now I can see interesting results.

If I enable power-saving mode in amdcccle, then I've got:

DEVICE: GPU, Devastator

COMPUT_UNITS: 6

GLOBAL_WORK_SIZE: 4096

LOCAL_WORK_SIZE: 64

NUM_GROUPS: 64

TIME: 5.29497

B/W 6.34 GB/sec, 64 groups, 4096 threads, count 1024, stride 1

result IS correct

DEVICE: CPU, AMD A10-4600M APU with Radeon(tm) HD Graphics

COMPUT_UNITS: 4

GLOBAL_WORK_SIZE: 4

LOCAL_WORK_SIZE: 1

NUM_GROUPS: 4

TIME: 9.10732

B/W 3.68 GB/sec, 4 groups, 4 threads, count 1048576, stride 4

result IS correct

And If I enable high-performance mode, then I've got the following:

DEVICE: GPU, Turks

COMPUT_UNITS: 6

GLOBAL_WORK_SIZE: 4096

LOCAL_WORK_SIZE: 64

NUM_GROUPS: 64

TIME: 15.6707

B/W 2.14 GB/sec, 64 groups, 4096 threads, count 1024, stride 1

result IS correct

DEVICE: CPU, AMD A10-4600M APU with Radeon(tm) HD Graphics

COMPUT_UNITS: 4

GLOBAL_WORK_SIZE: 4

LOCAL_WORK_SIZE: 1

NUM_GROUPS: 4

TIME: 8.82985

B/W 3.80 GB/sec, 4 groups, 4 threads, count 1048576, stride 4

result IS correct

So, now I see, that I really have 3 OpenCL devices, but due to drivers features (or I don't know why), I could use only two of them at once.

And the performance distributed between them in the following manner:

1) GPU AMD Radeon HD 7660G {Devastator} - is the fastets

2) CPU AMD A10 - is in the middle

3) AMD Radeon HD 7670M {Turks} - is the slowest

0 Likes
moozoo
Adept III

>Size of test array is 1024, I am using

try this with a test array size of 100,000

I believe with only 1024 the GPU overheads will make up the bulk of the time.

daxh
Adept I

I have one more question here. What is the "fair" approach to measure bandwith or algorithm working time? I mean, there are two ways:

1) Approach_1 - "Fair". Allow me to use just some pseudo-code:

int numberOfTests = 1000;

t_time_point start = getTimePoint();

for(int iTest=0; iTest<numberOfTests;iTest++)

{

     // 1 Creating buffers

     {...}

     // 2 Setting kernel arguments

     {...}

     // 3 EnqueueNDRangeKernel

     {...}

     // 4 EnqueueReadBuffer

     {...}

     queue.finish();

}

unsigned int ms = getTimeDifference_ms(start); // algorithm working time in milliseconds

float secs = (float)ms/1000; // algorithm working time in seconds

But according to erlier mentioned by me Parallel Min() function example we can use another approach.

2) Approach_1 - "UN_Fair":

int numberOfTests = 1000;

t_time_point start = getTimePoint();

     // 1 Creating buffers

     {...}

     // 2 Setting kernel arguments

     {...}

for(int iTest=0; iTest<numberOfTests;iTest++)

{

     // 3 EnqueueNDRangeKernel

     {...}

}

     // 4 EnqueueReadBuffer

     {...}

queue.finish();

unsigned int ms = getTimeDifference_ms(start); // algorithm working time in milliseconds

float secs = (float)ms/1000; // algorithm working time in seconds

And of course results are very different:

FAIR

Preffered wg size multiple for Devastator is 64.

DEVICE: Devastator

TIME: 0.572ms

B/W: 0.0214825 GB/s

Preffered wg size multiple for AMD A10-4600M APU with Radeon(tm) HD Graphics is 1.

DEVICE: AMD A10-4600M APU with Radeon(tm) HD Graphics

TIME: 0.088ms

B/W: 0.139636 GB/s

NOT FAIR

Preffered wg size multiple for Devastator is 64.

DEVICE: Devastator

TIME: 0.01ms

B/W: 1.2288 GB/s

Preffered wg size multiple for AMD A10-4600M APU with Radeon(tm) HD Graphics is 1.

DEVICE: AMD A10-4600M APU with Radeon(tm) HD Graphics

TIME: 0.044ms

B/W: 0.279273 GB/s

In attachment to original post I've added new version of MatricesTest (QtCreator project) that contains related code. I hope you could help me to understand, because according to testing approach the results are absolutely different.

0 Likes