Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

OpenCl test: Ati FirePro v4800 vs Nvidia Ion, The Ion is the winner. Why?

I felt, that my new Amd videocard is too slow in OpenCl. I tested it vs. Ion integrated graphic card.


I had bought an Ati FirePro v4800 videocard to my diploma work, but i feel, that my new Amd videocard is too slow in OpenCl. I tested the OpenCl with Ion integrated graphic card (i have just this card and a Radeon HD 2600 pro, but the HD2600 don't know the OpenCl technology), and the Ion was incredble faster than my new Amd professional card.

Test config:
- Intel Atom 330 (with hyperthreading), Ion motherboadr with PCI-E 2.0, 2 GB 800 Mhz high latency DDR2 noname memory.
- Athlon 64 X2 3800+, Nforce 430 with PCI-E 1.1 (dual chanel low latency 4Gb 800 Mhz DDR2 Geil Ultra mem)
- Ati FirePro v4800 videocard, vs the Atom's Ion videocard
Ion has 16 stream processor, it's a cheap integrated card with the system memory .
FirePro v4800 has 400 stream processor, it's a professional card with GDDR5 memory.
- At all the test: i used the same program, that i writed.
- Operation system: Windows 7, Vs2010 (but under Linux + gcc: it's slower)

Progam runtime test results (count of tests results = 6) :
OpenCl FirePro + Athlon X2: 1155-1226 ms
OpenCl FirePro + Atom 330 (Ion videocard was disabled in the bios):  2744-2822 ms (i feel this time)
OpenCl Ion + Atom 330:  338-412 ms
 - Cuda Ion + Atom 330:  324-340 ms

The kernel runtime:
OpenCl FirePro (all the two systems) : 26-67 ms
OpenCl Ion + Atom 330: 148-163 ms
Cuda Ion + Atom 330: 111-131 ms

What is that? - I think, and i do some other tests. I mesure the instructions, and i find, that the first OpenCl instruction is the very slow in FirePro:
OpenCl FirePro + Athlon X2: 750-804 ms
OpenCl FirePro + Atom 330 (Ion videocard was disabled):  1639-1759 ms
OpenCl FirePro + Atom 330 (Ion videocard was enabled):  2246-2265 ms
OpenCl Ion + Atom 330 (FirePro was NOT in the slot, without Amd drivers): 31-44 ms
OpenCl Ion + Atom 330 (FirePro was in the slot, with Amd drivers): 98-123 ms
 - The all init instructions is one instruction in Cuda: Ion + Atom 330:  0 ms

The first instruction was at OpenCl:
  status  = clGetPlatformIDs(1, &(csys->platforms), &numPlatforms);
When i changed the first instruction to find an other OpenCl init way. After the first instruction was changed to an other: the first instruction runtime was not changed, but the time of clGetPlatformIDs reduced very.

I measured the kernel compile-time (the all compile istructions time) :
OpenCl FirePro + Atom 330: 608-648 ms
OpenCl Ion + Atom 330: 1-2 ms
Cuda Ion + Atom 330: there are not compile-time at runtime (but when i build the program: it build too slow)

I don't measured FirePro with Athlon X2 (but i think it's faster than (FirePro + Atom), and very slower than (Ion + Atom) ).

 Can anybody give me some advice to reduce OpenCl runtime in my Ati FirePro videocard? (Now the kernel optimize is not necessary, when the other times are like thats...)
 (Sorry my engilsh.)

19 Replies
Journeyman III

All the test-case: my program's results was correct (the program ran well).


What program are you using? Have you tried running some of our benchmark samples to compare?

I writed a program, that resolve the 7-queens problem in a 7x7 chess table (AI), but this part of the program is irrelevant.
The first functions (with very high runtimes) are the attach code.


See below.


Originally posted by: szabi_h I writed a program, that resolve the 7-queens problem in a 7x7 chess table (AI), but this part of the program is irrelevant.




Not irrelevant at all.  The hardware is differnet, therefore it may need different opencl code in order to achieve high performance - opencl does not abstract the hardware from the equation.

But It seems your post is about start-up time, but that is not a particularly important time to measure as programmes generally keep running once they are started.  Use binary opencl objects if you want to avoid compilation time - amd will then be very fast.





Thank you your answer!

Can you give me some important link, that can help to use binary opencl objects, please?

Irrelevant: i mean, that the slow stat-up time is independet from that parts.


"but that is not a particularly important time to measure as programmes generally keep running once they are started"
My program run 3 sec or just 1/3 sec: this is not important? 🙂
I do some top secret codes (not so simpli, that N-queen problem), and it is very important time for me -and i know other people who need the quick start-up time (and runtime). (I know: there are some programs where it is really not important...)
Amd+OpenCl: the kernels run 90 ms, and the program run 3500 ms... 100/3500 = 1/35 (waste).
Nvidia ion+OpenCl: the kernels run 300 ms, and the program run 450 ms... 300/450 = 2/3.
The general programing really is that i can do any programs, that not sure that will continually run.
I don't want to hurt you or Amd, with this, but it is a problem now, that have to resolve. (I know: "it's not a really probelm" is a confortabel answer for you, but for me and the others: it is unacceptable.)


binary kernels you can ignore that part about striping kernel with readelf.

but even then you should consider another aproach to your problem and don't start program repeatly. starting program is quite overhead at is own.


Thank you your answer. I will try it soon.
The another aproach, that you write: it's not possible, because ther are many opencl code that run sequentially (connect each other with pipe). When they are in a partial code: very easy to switch (and the programs count grow continually). But at Amd: if 10 program run sequentially: it is about minimum 27000 ms (Ion maximum: about 4500 ms) (i didn't measure this).


Originally posted by: szabi_hAmd+OpenCl: the kernels run 90 ms, and the program run 3500 ms... 100/3500 = 1/35 (waste). Nvidia ion+OpenCl: the kernels run 300 ms, and the program run 450 ms... 300/450 = 2/3.

I have seen a performance differential like this as well but it is simply NVidia keeping a cache of compiled program. They seem to be checking the timestamp and then loading the file from a cached version if it hasn't changed. This way the first time you run something new it is much slower to get started. The next run is really fast at least 15x faster in my tests. If you add a space to the file it will be back to slow for one more launch. I confirmed this was occuring by only changing the timestamp on a file on compilation and not the contents and Nvidia will be slow everytime on launch.


But they are still faster than AMDs compilation (confirmed by the hours I sit compiling code side by side.) This really need to be looked at by everyone including Intel as their is the worst by far.


Thanks for your valuable feedback on start-up times.

I would also suggest you to go for offiline compilation, to speed up start-up.


After i used the binary kernel (if i understand well: offline compilation), the OpenCl program speed up:

The kernel compile-time (the all compile istructions time, without the binary kernel load form file) :
 OpenCl FirePro + Atom 330: 3-4 ms

When i load the binary kenel from a file, it is +(2-4 ms), but if someone write a program, that can do a string from the binary kernel source file: it will be 0 ms. Otherwise this 2-4 ms doesn't matter.

It's a big step forward. This solution not so comfortable as using Nvidia card, but for me it is perfect solution to compile time.

The bigger problem: the first OpenCl command time is still problem.


Yes: i know that, but the Nvidia program's first run runtime stay under 500 ms.


I can't find any bechmarks that measure these times.


I decide: i don't unpack the functions. The program (shortly) :

//Main.cpp: #include "MyOpenClSystem.h" #include "iostream" #include "ctime" int MAX_LEVEL; int MAX_DATA_COUNT; size_t D_SIZE; const size_t INT_SIZE = sizeof(int); const size_t STRUCT_SIZE = sizeof(MY_TYPE); int MAX_BLOCK_SIZE; int actLevel = 1; const char *updateKernelParams = ""; //Some kernels: const char *fSource = "\n" \ "__kernel void updateKernelParams(__global int *level, __global int *N)\n" \ "{\n" \ "}\n" \ "\n" \ "__kernel void mainKernel(__global int p1, __global int p2, __global int p3, __global int p4)\n" \ "{\n" \ "}\n" \ "\n"; MY_TYPE firstNode, target, *targets; clSystem myCSys; void initThreadFunc(void* pParams) { cl_ulong maxAllocMemory, maxAllocMemory_s; clGetDeviceInfo(myCSys.devices, CL_DEVICE_MAX_MEM_ALLOC_SIZE, sizeof(cl_ulong), &maxAllocMemory, NULL); clGetDeviceInfo(myCSys.devices, CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(cl_ulong), &COMPUTE_UNITS, NULL); clGetDeviceInfo(myCSys.devices, CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(cl_ulong), &MAX_BLOCK_SIZE, NULL); if(COMPUTE_UNITS < 1) COMPUTE_UNITS = 1; maxAllocMemory_s = (cl_ulong )(maxAllocMemory/ STRUCT_SIZE); MAX_DATA_COUNT = 1; for(int i = 0; i<MAX_LEVEL - 1; ++i) { MAX_DATA_COUNT *= NUMBER_OF_OPERATORS; if(maxAllocMemory_s < MAX_DATA_COUNT) { MAX_DATA_COUNT = maxAllocMemory_s; printf("Warning: very high tree!\n"); break; } } D_SIZE = MAX_DATA_COUNT * STRUCT_SIZE; targets = (MY_TYPE*)malloc(D_SIZE); firstNode.isTarget = false; target.isTarget = false; mallocSpaceAndDataCopyToGpu(&firstNode, &myCSys); } int main(int argc, char ** argv) { clock_t mstart1, mend1, mstart, mend; double mdif; mstart1 = clock(); init_GPU_System(&myCSys); MAX_LEVEL = 8; initThreadFunc(NULL); //mstart = clock(); createKernel(&myCSys); /*mend = clock(); mdif = (double)(mend - mstart); printf("Create kernel: %f \n\n", mdif);//*/ //mstart = clock(); runKernel(&myCSys); /*mend = clock(); mdif = (double)(mend - mstart); printf("Run kernel: %f \n\n", mdif);//*/ dataCopyToHost(&target, &targets, &myCSys); writeAndDestroyThreadFunc(NULL); free_All_System_Resources(&myCSys); mend1 = clock(); mdif = (double)(mend1 - mstart1); printf("Runtime: %f \n\n", mdif);//*/ return 0; } //MyOpenClSystem.h: #include "stdio.h" #include "stdlib.h" #include <CL/cl.h> #include "ctime" #define MY_TYPE Kiralyno #define NUMBER_OF_OPERATORS 7 extern int COMPUTE_UNITS, MAX_BLOCK_SIZE; extern int MAX_LEVEL; extern int MAX_DATA_COUNT; extern size_t D_SIZE; extern const size_t INT_SIZE; extern const size_t STRUCT_SIZE; extern int actLevel; extern const char *fSource; extern const char *updateKernelParams; struct clSystem { cl_platform_id platforms; cl_device_id devices; cl_context context; cl_command_queue cmdQueue; cl_kernel kernelMain, kernelUpdateKernelParams; cl_program programMain; //cl_uint numDevices; cl_mem dev_0, s1, s2; cl_mem *dev_a, *dev_c, *help_var; cl_mem device_target; int host_target; cl_mem dev_level, dev_n; }; inline void init_GPU_System(clSystem *csys) { cl_int status; cl_uint numPlatforms; /*clock_t mstart, mend; double mdif; mstart = clock();//*/ status = clGetPlatformIDs(1, &(csys->platforms), &numPlatforms); /*mend = clock(); mdif = (double)(mend - mstart); printf("First OpenGl function: %f \n\n", mdif);//*/ if(status != CL_SUCCESS) { printf("clGetPlatformIDs failed\n"); exit(-1); } if(numPlatforms == 0) { printf("No platforms detected.\n"); exit(-1); } status = clGetDeviceIDs(csys->platforms, CL_DEVICE_TYPE_GPU, 1, &(csys->devices), NULL); if(status != CL_SUCCESS) { printf("clGetDeviceIDs failed\n"); exit(-1); } csys->context = clCreateContext(NULL, 1, &(csys->devices), NULL, NULL, &status); if(status != CL_SUCCESS || csys->context == NULL) { printf("clCreateContext failed\n"); exit(-1); } } inline void createKernel(clSystem * csys) { cl_int status; csys->programMain = clCreateProgramWithSource(csys->context, 1, (const char**)& fSource, NULL, &status); if(status != CL_SUCCESS) { printf("clCreateProgramWithSource failed\n"); exit(-1); } cl_int buildErr = clBuildProgram(csys->programMain, 0, NULL, NULL, NULL, NULL); if(buildErr != CL_SUCCESS) { printf("Program failed to build.\n"); cl_build_status buildStatus; clGetProgramBuildInfo(csys->programMain, csys->devices, CL_PROGRAM_BUILD_STATUS, sizeof(cl_build_status), &buildStatus, NULL); if(buildStatus != CL_SUCCESS) { char *buildLog; size_t buildLogSize; clGetProgramBuildInfo(csys->programMain, csys->devices, CL_PROGRAM_BUILD_LOG, 0, NULL, &buildLogSize); buildLog = (char*)malloc(buildLogSize); if(buildLog == NULL) { perror("malloc"); exit(-1); } clGetProgramBuildInfo(csys->programMain, csys->devices, CL_PROGRAM_BUILD_LOG, buildLogSize, buildLog, NULL); buildLog[buildLogSize-1] = '\0'; printf("Device %u Build Log:\n%s\n", 0, buildLog); free(buildLog); } exit(0); } csys->kernelMain = clCreateKernel(csys->programMain, "mainKernel", &status); if(status != CL_SUCCESS) { printf("clCreateKernel failed\n"); exit(-1); } csys->kernelUpdateKernelParams = clCreateKernel(csys->programMain, "updateKernelParams", &status); if(status != CL_SUCCESS) { printf("clCreateKernel failed\n"); exit(-1); } }


so you have problem with OpenCL start up time.

only thing i can recomend is look into OpenCL binary  kernels. it can speed up creating and compiling kernels.


Thank you your advice, but i have not problem with OpenCl start up time. I have problem with AMD OpenCl start up time... (Nvidia OpenCl start up time is fine.)


I don't know your OpenCl implementation code, but (perhaps) if there are some data that necessary, but these values don't changes: these must be in a separate territory, and just one pointer have to reach this.
Afterthat: it must a sevice (server) that run at the background. It malloc a new OpenCl Object terrotory and copy all variable data to there when the service run paralel at the operation system startup (copy to all variable value to this territory and a link to the constant territory). When an OpenCl program start: it catch immediately a pointer to this OpenCl Object territory, and after that the service product immediately a new OpenCl Object territory to(for?) the following request (befor the request, and under the previous request).
 The constant territory just one pointer (because of the fast copy and fast service).
 I hope this will be helpful.
+ It must handle a waiting queue (linked list, that can grow without limit). When the first OpenCl program do a ruquest, the service (server) give it a pointer imediatley, and begin product a new OpenCl territory, but when the next OpenCl program is coming, perhaps it will waiting some ms (but
perhaps the user can't start a new OpenCl application so very quick).


Perhaps the above service there has to be a kernel compiler kernel on the device, what translate the kernel program. One instruction per thread. (The formed code will be in the device.)


I tired the "HelloCLVS10" "project" from the (newest: 2.5) amd sdk. Your program runtime speed (with Atom 330 and V4800 - Ion: disabled in bios- ) :
1968-2200 ms (i feel this time, it is exact 😞 ).
Cahnges (//--) :


#include "ctime" //--
clock_t mstart, mend; //--
double mdif; //--
mstart = clock(); //--
    cl_int err;

    // Platform info


    err = queue.finish();
    if (err != CL_SUCCESS) {
        std::cerr << "Event::wait() failed (" << err << ")\n";
mend = clock(); //--
mdif = (double)(mend - mstart); //--
printf("Runtime: %f \n\n", mdif); //--

    std::cout<<"Done\nPassed!\n" << std::endl;
    return SDK_SUCCESS;