19 Replies Latest reply on Dec 18, 2011 5:08 PM by szabi_h

    OpenCl test: Ati FirePro v4800 vs Nvidia Ion, The Ion is the winner. Why?

    szabi_h
      I felt, that my new Amd videocard is too slow in OpenCl. I tested it vs. Ion integrated graphic card.

      Hello!

      I had bought an Ati FirePro v4800 videocard to my diploma work, but i feel, that my new Amd videocard is too slow in OpenCl. I tested the OpenCl with Ion integrated graphic card (i have just this card and a Radeon HD 2600 pro, but the HD2600 don't know the OpenCl technology), and the Ion was incredble faster than my new Amd professional card.

      Test config:
      - Intel Atom 330 (with hyperthreading), Ion motherboadr with PCI-E 2.0, 2 GB 800 Mhz high latency DDR2 noname memory.
      - Athlon 64 X2 3800+, Nforce 430 with PCI-E 1.1 (dual chanel low latency 4Gb 800 Mhz DDR2 Geil Ultra mem)
      - Ati FirePro v4800 videocard, vs the Atom's Ion videocard
      Ion has 16 stream processor, it's a cheap integrated card with the system memory .
      FirePro v4800 has 400 stream processor, it's a professional card with GDDR5 memory.
      - At all the test: i used the same program, that i writed.
      - Operation system: Windows 7, Vs2010 (but under Linux + gcc: it's slower)

      Progam runtime test results (count of tests results = 6) :
      OpenCl FirePro + Athlon X2: 1155-1226 ms
      OpenCl FirePro + Atom 330 (Ion videocard was disabled in the bios):  2744-2822 ms (i feel this time)
      OpenCl Ion + Atom 330:  338-412 ms
       - Cuda Ion + Atom 330:  324-340 ms

      The kernel runtime:
      OpenCl FirePro (all the two systems) : 26-67 ms
      OpenCl Ion + Atom 330: 148-163 ms
      Cuda Ion + Atom 330: 111-131 ms

      What is that? - I think, and i do some other tests. I mesure the instructions, and i find, that the first OpenCl instruction is the very slow in FirePro:
      OpenCl FirePro + Athlon X2: 750-804 ms
      OpenCl FirePro + Atom 330 (Ion videocard was disabled):  1639-1759 ms
      OpenCl FirePro + Atom 330 (Ion videocard was enabled):  2246-2265 ms
      OpenCl Ion + Atom 330 (FirePro was NOT in the slot, without Amd drivers): 31-44 ms
      OpenCl Ion + Atom 330 (FirePro was in the slot, with Amd drivers): 98-123 ms
       - The all init instructions is one instruction in Cuda: Ion + Atom 330:  0 ms

      The first instruction was at OpenCl:
        status  = clGetPlatformIDs(1, &(csys->platforms), &numPlatforms);
      When i changed the first instruction to find an other OpenCl init way. After the first instruction was changed to an other: the first instruction runtime was not changed, but the time of clGetPlatformIDs reduced very.

      I measured the kernel compile-time (the all compile istructions time) :
      OpenCl FirePro + Atom 330: 608-648 ms
      OpenCl Ion + Atom 330: 1-2 ms
      Cuda Ion + Atom 330: there are not compile-time at runtime (but when i build the program: it build too slow)

      I don't measured FirePro with Athlon X2 (but i think it's faster than (FirePro + Atom), and very slower than (Ion + Atom) ).

       Can anybody give me some advice to reduce OpenCl runtime in my Ati FirePro videocard? (Now the kernel optimize is not necessary, when the other times are like thats...)
       (Sorry my engilsh.)

        • OpenCl test: Ati FirePro v4800 vs Nvidia Ion, The Ion is the winer. Why?
          szabi_h

          All the test-case: my program's results was correct (the program ran well).

          • OpenCl test: Ati FirePro v4800 vs Nvidia Ion, The Ion is the winer. Why?
            MicahVillmow
            szabi_h,
            What program are you using? Have you tried running some of our benchmark samples to compare?
              • OpenCl test: Ati FirePro v4800 vs Nvidia Ion, The Ion is the winer. Why?
                szabi_h

                I writed a program, that resolve the 7-queens problem in a 7x7 chess table (AI), but this part of the program is irrelevant.
                The first functions (with very high runtimes) are the attach code.

                 

                See below.

                  • OpenCl test: Ati FirePro v4800 vs Nvidia Ion, The Ion is the winer. Why?
                    notzed

                     

                    Originally posted by: szabi_h I writed a program, that resolve the 7-queens problem in a 7x7 chess table (AI), but this part of the program is irrelevant.

                     

                     

                     

                    Not irrelevant at all.  The hardware is differnet, therefore it may need different opencl code in order to achieve high performance - opencl does not abstract the hardware from the equation.

                    But It seems your post is about start-up time, but that is not a particularly important time to measure as programmes generally keep running once they are started.  Use binary opencl objects if you want to avoid compilation time - amd will then be very fast.

                     

                     

                     

                  • OpenCl test: Ati FirePro v4800 vs Nvidia Ion, The Ion is the winer. Why?
                    szabi_h

                    I can't find any bechmarks that measure these times.

                    • OpenCl test: Ati FirePro v4800 vs Nvidia Ion, The Ion is the winer. Why?
                      szabi_h

                      I decide: i don't unpack the functions. The program (shortly) :

                      //Main.cpp: #include "MyOpenClSystem.h" #include "iostream" #include "ctime" int MAX_LEVEL; int MAX_DATA_COUNT; size_t D_SIZE; const size_t INT_SIZE = sizeof(int); const size_t STRUCT_SIZE = sizeof(MY_TYPE); int MAX_BLOCK_SIZE; int actLevel = 1; const char *updateKernelParams = ""; //Some kernels: const char *fSource = "\n" \ "__kernel void updateKernelParams(__global int *level, __global int *N)\n" \ "{\n" \ "}\n" \ "\n" \ "__kernel void mainKernel(__global int p1, __global int p2, __global int p3, __global int p4)\n" \ "{\n" \ "}\n" \ "\n"; MY_TYPE firstNode, target, *targets; clSystem myCSys; void initThreadFunc(void* pParams) { cl_ulong maxAllocMemory, maxAllocMemory_s; clGetDeviceInfo(myCSys.devices, CL_DEVICE_MAX_MEM_ALLOC_SIZE, sizeof(cl_ulong), &maxAllocMemory, NULL); clGetDeviceInfo(myCSys.devices, CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(cl_ulong), &COMPUTE_UNITS, NULL); clGetDeviceInfo(myCSys.devices, CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(cl_ulong), &MAX_BLOCK_SIZE, NULL); if(COMPUTE_UNITS < 1) COMPUTE_UNITS = 1; maxAllocMemory_s = (cl_ulong )(maxAllocMemory/ STRUCT_SIZE); MAX_DATA_COUNT = 1; for(int i = 0; i<MAX_LEVEL - 1; ++i) { MAX_DATA_COUNT *= NUMBER_OF_OPERATORS; if(maxAllocMemory_s < MAX_DATA_COUNT) { MAX_DATA_COUNT = maxAllocMemory_s; printf("Warning: very high tree!\n"); break; } } D_SIZE = MAX_DATA_COUNT * STRUCT_SIZE; targets = (MY_TYPE*)malloc(D_SIZE); firstNode.isTarget = false; target.isTarget = false; mallocSpaceAndDataCopyToGpu(&firstNode, &myCSys); } int main(int argc, char ** argv) { clock_t mstart1, mend1, mstart, mend; double mdif; mstart1 = clock(); init_GPU_System(&myCSys); MAX_LEVEL = 8; initThreadFunc(NULL); //mstart = clock(); createKernel(&myCSys); /*mend = clock(); mdif = (double)(mend - mstart); printf("Create kernel: %f \n\n", mdif);//*/ //mstart = clock(); runKernel(&myCSys); /*mend = clock(); mdif = (double)(mend - mstart); printf("Run kernel: %f \n\n", mdif);//*/ dataCopyToHost(&target, &targets, &myCSys); writeAndDestroyThreadFunc(NULL); free_All_System_Resources(&myCSys); mend1 = clock(); mdif = (double)(mend1 - mstart1); printf("Runtime: %f \n\n", mdif);//*/ return 0; } //MyOpenClSystem.h: #include "stdio.h" #include "stdlib.h" #include <CL/cl.h> #include "ctime" #define MY_TYPE Kiralyno #define NUMBER_OF_OPERATORS 7 extern int COMPUTE_UNITS, MAX_BLOCK_SIZE; extern int MAX_LEVEL; extern int MAX_DATA_COUNT; extern size_t D_SIZE; extern const size_t INT_SIZE; extern const size_t STRUCT_SIZE; extern int actLevel; extern const char *fSource; extern const char *updateKernelParams; struct clSystem { cl_platform_id platforms; cl_device_id devices; cl_context context; cl_command_queue cmdQueue; cl_kernel kernelMain, kernelUpdateKernelParams; cl_program programMain; //cl_uint numDevices; cl_mem dev_0, s1, s2; cl_mem *dev_a, *dev_c, *help_var; cl_mem device_target; int host_target; cl_mem dev_level, dev_n; }; inline void init_GPU_System(clSystem *csys) { cl_int status; cl_uint numPlatforms; /*clock_t mstart, mend; double mdif; mstart = clock();//*/ status = clGetPlatformIDs(1, &(csys->platforms), &numPlatforms); /*mend = clock(); mdif = (double)(mend - mstart); printf("First OpenGl function: %f \n\n", mdif);//*/ if(status != CL_SUCCESS) { printf("clGetPlatformIDs failed\n"); exit(-1); } if(numPlatforms == 0) { printf("No platforms detected.\n"); exit(-1); } status = clGetDeviceIDs(csys->platforms, CL_DEVICE_TYPE_GPU, 1, &(csys->devices), NULL); if(status != CL_SUCCESS) { printf("clGetDeviceIDs failed\n"); exit(-1); } csys->context = clCreateContext(NULL, 1, &(csys->devices), NULL, NULL, &status); if(status != CL_SUCCESS || csys->context == NULL) { printf("clCreateContext failed\n"); exit(-1); } } inline void createKernel(clSystem * csys) { cl_int status; csys->programMain = clCreateProgramWithSource(csys->context, 1, (const char**)& fSource, NULL, &status); if(status != CL_SUCCESS) { printf("clCreateProgramWithSource failed\n"); exit(-1); } cl_int buildErr = clBuildProgram(csys->programMain, 0, NULL, NULL, NULL, NULL); if(buildErr != CL_SUCCESS) { printf("Program failed to build.\n"); cl_build_status buildStatus; clGetProgramBuildInfo(csys->programMain, csys->devices, CL_PROGRAM_BUILD_STATUS, sizeof(cl_build_status), &buildStatus, NULL); if(buildStatus != CL_SUCCESS) { char *buildLog; size_t buildLogSize; clGetProgramBuildInfo(csys->programMain, csys->devices, CL_PROGRAM_BUILD_LOG, 0, NULL, &buildLogSize); buildLog = (char*)malloc(buildLogSize); if(buildLog == NULL) { perror("malloc"); exit(-1); } clGetProgramBuildInfo(csys->programMain, csys->devices, CL_PROGRAM_BUILD_LOG, buildLogSize, buildLog, NULL); buildLog[buildLogSize-1] = '\0'; printf("Device %u Build Log:\n%s\n", 0, buildLog); free(buildLog); } exit(0); } csys->kernelMain = clCreateKernel(csys->programMain, "mainKernel", &status); if(status != CL_SUCCESS) { printf("clCreateKernel failed\n"); exit(-1); } csys->kernelUpdateKernelParams = clCreateKernel(csys->programMain, "updateKernelParams", &status); if(status != CL_SUCCESS) { printf("clCreateKernel failed\n"); exit(-1); } }

                        • OpenCl test: Ati FirePro v4800 vs Nvidia Ion, The Ion is the winer. Why?
                          nou

                          so you have problem with OpenCL start up time.

                          only thing i can recomend is look into OpenCL binary  kernels. it can speed up creating and compiling kernels.

                            • OpenCl test: Ati FirePro v4800 vs Nvidia Ion, The Ion is the winer. Why?
                              szabi_h

                              Thank you your advice, but i have not problem with OpenCl start up time. I have problem with AMD OpenCl start up time... (Nvidia OpenCl start up time is fine.)

                                • OpenCl test: Ati FirePro v4800 vs Nvidia Ion, The Ion is the winner. Why?
                                  szabi_h

                                  I don't know your OpenCl implementation code, but (perhaps) if there are some data that necessary, but these values don't changes: these must be in a separate territory, and just one pointer have to reach this.
                                  Afterthat: it must a sevice (server) that run at the background. It malloc a new OpenCl Object terrotory and copy all variable data to there when the service run paralel at the operation system startup (copy to all variable value to this territory and a link to the constant territory). When an OpenCl program start: it catch immediately a pointer to this OpenCl Object territory, and after that the service product immediately a new OpenCl Object territory to(for?) the following request (befor the request, and under the previous request).
                                   The constant territory just one pointer (because of the fast copy and fast service).
                                   I hope this will be helpful.
                                  + It must handle a waiting queue (linked list, that can grow without limit). When the first OpenCl program do a ruquest, the service (server) give it a pointer imediatley, and begin product a new OpenCl territory, but when the next OpenCl program is coming, perhaps it will waiting some ms (but
                                  perhaps the user can't start a new OpenCl application so very quick).

                            • OpenCl test: Ati FirePro v4800 vs Nvidia Ion, The Ion is the winer. Why?
                              szabi_h

                              I tired the "HelloCLVS10" "project" from the (newest: 2.5) amd sdk. Your program runtime speed (with Atom 330 and V4800 - Ion: disabled in bios- ) :
                              1968-2200 ms (i feel this time, it is exact :( ).
                              Cahnges (//--) :

                              ...

                              #include
                              #include "ctime" //--
                              int
                              main()
                              {
                              clock_t mstart, mend; //--
                              double mdif; //--
                              mstart = clock(); //--
                                  cl_int err;

                                  // Platform info
                              ...

                              ...

                              ...
                                  err = queue.finish();
                                  if (err != CL_SUCCESS) {
                                      std::cerr << "Event::wait() failed (" << err << ")\n";
                                  }
                              mend = clock(); //--
                              mdif = (double)(mend - mstart); //--
                              printf("Runtime: %f \n\n", mdif); //--

                                  std::cout<<"Done\nPassed!\n" << std::endl;
                                  return SDK_SUCCESS;
                              }