10 Replies Latest reply on Feb 10, 2010 10:01 AM by Sheeep

    Different Results using GPU and CPU

    Sheeep

      I had some problems with large code. I don't get the same results, if I change context from gpu to cpu.

      So I tried a little kernel, but i get the same problem.

       

      If I don't use the global_id, i get the correct result using gpu. But if I use the cpu I get a result, that is much bigger than the correct one.

      __kernel void test1(__global float *a, __global float *b, __global float *c){
          int gis=get_global_size(0);
          for(int j=0;j
              for(int i=0;i<100;i++){
                  c[j]+=a[j]+b[j];   
              }
          }
      }

       

      then i tried this:

      __kernel void test1(__global float *a, __global float *b, __global float *c){
          int gid=get_global_id(0);
          for(int i=0;i<100;i++){
                  c[gid]+=a[gid]+b[gid];   
          }
      }

      and it works on cpu and gpu.

      Does anyone know why?

        • Different Results using GPU and CPU
          eduardoschardong

          The first kernel is buggy, by pure coincidence it works on GPU, all threads reads and writte to same location.

          On the GPU all itens are computed at the same time (if there are too many itens then part of them will be computed first then another part and so), results from a previous thread are overwritten, on the CPU each item is computed at a time, so instead of overwritting previous result it add to the previous result.

           

            • Different Results using GPU and CPU
              Sheeep

              So to compute the second kernel I need worksize 1?

              when my worksize is 20, everything will run 20 times, right?

               

              an other question:

              in openmp i have "omp_get_num_threads()". It returns the same as "get_global_size()" in opencl?

              and omp_get_thread_num() is like "get_global_id()"?

               

               

              EDIT: OK I thing I understand this.

              But running with worksize 4 or 1024, my phenom only use one of its four cores. Using CPU I get 99% Activity. Why does the cpu only use one core?

              In Windows Taskmanger I have 6 Threads.

               

               

                • Different Results using GPU and CPU
                  genaganna

                   

                  Originally posted by: Sheeep

                  when my worksize is 20, everything will run 20 times, right?

                  If you have workgroupsize is 20, 20 threads run same kernel concurrently

                  But running with worksize 4 or 1024, my phenom only use one of its four cores. Using CPU I get 99% Activity. Why does the cpu only use one core?

                   



                  I have a Quad core CPU. I never see such issue. Could you please run CLInfo sample and see value of Max compute units:  x

                  This value should be number cores you have.

                   

                    • Different Results using GPU and CPU
                      Sheeep

                      Running CLInfo I get:

                      Number of platforms:                             1
                        Plaform Profile:                               FULL_PROFILE
                        Plaform Version:                               OpenCL 1.0 ATI-Stream-v2.0.0
                        Plaform Name:                                  ATI Stream
                        Plaform Vendor:                                Advanced Micro Devices, Inc.


                        Plaform Name:                                  ATI Stream
                      Number of devices:                               2
                        Device Type:                                   CL_DEVICE_TYPE_CPU
                        Device ID:                                     4098
                        Max compute units:                             4
                        Max work items dimensions:                     3
                          Max work items[0]:                           1024
                          Max work items[1]:                           1024
                          Max work items[2]:                           1024
                        Max work group size:                           1024

                      (...)

                        Name:                                          AMD Phenom(tm) II X4 940 Processor
                        Vendor:                                        AuthenticAMD
                        Driver version:                                1.0
                        Profile:                                       FULL_PROFILE
                        Version:                                       OpenCL 1.0 ATI-Stream-v2.0.0

                      Number of Max compute units is 4, so I thought OpenCL should run on 4 cores. But is doesn't. It use only one core. Is this normal?

                        • Different Results using GPU and CPU
                          genaganna

                           

                          Originally posted by: SheeepNumber of Max compute units is 4, so I thought OpenCL should run on 4 cores. But is doesn't. It use only one core. Is this normal?


                          It is not normal. How you are concluding that it is running on single core?

                          Please set/export environment variable CPU_MAX_COMPUTE_UNITS to number of core you want to use and see performance difference for 1, 2 and 4. 

                          As a side note : CPU_MAX_COMPUTE_UNITS is not supported officially.

                            • Different Results using GPU and CPU
                              Sheeep

                              In windows taskmanger i have 25% cpu load, one core is 100%, other are 0-1%.

                              Running

                              int cpun;

                              devices[0].getInfo(CL_DEVICE_MAX_COMPUTE_UNITS,&cpun);

                              i get cpun=4.

                              How can I set CPU_MAX_COMPUTE_UNITS?

                                • Different Results using GPU and CPU
                                  genaganna

                                   

                                  Originally posted by: Sheeep In windows taskmanger i have 25% cpu load, one core is 100%, other are 0-1%.

                                   

                                  Running

                                   

                                  int cpun;

                                   

                                  devices[0].getInfo(CL_DEVICE_MAX_COMPUTE_UNITS,&cpun);

                                   

                                  i get cpun=4.

                                   

                                  How can I set CPU_MAX_COMPUTE_UNITS?

                                   

                                  Sheeep,

                                              Please send your complete sample(kernel code and runtime code). I will try on my system and let you know issue.

                                    • Different Results using GPU and CPU
                                      Sheeep

                                      Hi genaganna,

                                      thank you for helping me.

                                      But I can't post the kernel i'm still working. I'm not allowed to make it public. So I have a very simple kernel. its nonsence, but it has the same problem. i have a long loop, so its possible to see the cpu load.

                                       

                                      Kernel:

                                      __kernel void test1(__global float * a,__global float * b,__global float * c,__global const int *d){
                                          int gid=get_global_id(0);
                                          int gsi=get_global_size(0);
                                          int step=d[0]/get_global_size(0);
                                          for(int i=0;i<step;i++){
                                              for(int j=0;j<1000000;j++){
                                                  c[gid+i*gsi]+=a[gid+i*gsi]+b[gid+i*gsi];
                                              }
                                          }
                                      }

                                       

                                      Host:

                                       

                                      #include <cstdio>
                                      #include <cstdlib>
                                      #include <fstream>
                                      #include <iostream>
                                      #include <vector>
                                      #include <iterator>
                                      #define __CL_ENABLE_EXCEPTIONS
                                      #include <CL\cl.hpp>
                                      #include <ctime>
                                      int main(int argc, char** argv){
                                          
                                          cl_int error;
                                          std::string buildlog;
                                          cl::Context context;
                                          cl::Program program;
                                          std::vector<cl::Device> devices;
                                          try{
                                              //get CL platform info
                                              std::vector<cl::Platform> platforms;
                                                  cl::Platform::get(&platforms);
                                                  cl_context_properties platform=NULL;
                                                   std::vector<cl::Platform>::iterator i;
                                                  if(platforms.size() > 0){
                                                      for(i = platforms.begin(); i != platforms.end(); ++i){
                                                          platform=((cl_context_properties)(*i)());
                                                          if(!strcmp((*i).getInfo<CL_PLATFORM_VENDOR>().c_str(), "Advanced Micro Devices, Inc."))break;    
                                                      }
                                                  }

                                              cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM, platform, 0 };
                                              cl_context_properties *cprops =(platform==NULL) ? NULL : cps;
                                              //Creating CL Device;
                                              context=cl::Context(CL_DEVICE_TYPE_CPU,cprops,NULL,NULL,&error);
                                              //getting Device List
                                              devices=context.getInfo<CL_CONTEXT_DEVICES>();
                                              //creating Commandqueue
                                              cl::CommandQueue queue=cl::CommandQueue(context,devices[0]);
                                              //Reading CL Programm from file
                                              std::ifstream file("BastelKernel.cl");    //Kernelname
                                              std::string prog(std::istreambuf_iterator<char>(file),(std::istreambuf_iterator<char>()));
                                              cl::Program::Sources source(1,std::make_pair(prog.c_str(), prog.length()));
                                              //Building CL Programm for Device
                                              program=cl::Program(context,source,&error);
                                              program.build(devices);
                                              //finally Kernels:
                                              cl::Kernel kernel1=cl::Kernel(program,"test1",&error);  
                                             
                                              //Hostmemory
                                              cl_int wsize=8192;
                                              cl_int worksize;
                                              int cpun;
                                              devices[0].getInfo(CL_DEVICE_MAX_WORK_GROUP_SIZE,&worksize);
                                              devices[0].getInfo(CL_DEVICE_MAX_COMPUTE_UNITS,&cpun);
                                              std::cout<<"Max Compute Units: "<<cpun<<std::endl;
                                              cl_float *a=new cl_float[wsize];
                                              cl_float *b=new cl_float[wsize];
                                              cl_float *c=new cl_float[wsize];
                                              cl_int     *d=new cl_int[1]; d[0]=wsize;

                                              //initialing OpenCL Buffer(MemoryObjects)
                                              cl::Buffer CL1=cl::Buffer(context,CL_MEM_READ_ONLY |CL_MEM_USE_HOST_PTR,sizeof(a[0]) * wsize,a,&error);
                                              cl::Buffer CL2=cl::Buffer(context,CL_MEM_READ_ONLY |CL_MEM_USE_HOST_PTR,sizeof(b[0]) * wsize,b,&error);
                                              cl::Buffer CL3=cl::Buffer(context,CL_MEM_READ_WRITE|CL_MEM_USE_HOST_PTR,sizeof(c[0]) * wsize,c,&error);
                                              cl::Buffer CL4=cl::Buffer(context,CL_MEM_READ_ONLY |CL_MEM_USE_HOST_PTR,sizeof(c[0]) * wsize,d,&error);

                                              //set Hostmemory
                                              for(int i=0;i<wsize;i++){
                                                  a=i+0.1;
                                                  b
                                      =wsize-i-0.1;
                                                  c=0;
                                              }
                                             
                                              //set Kernel Arguments
                                              kernel1.setArg(0,CL1);
                                              kernel1.setArg(1,CL2);
                                              kernel1.setArg(2,CL3);
                                              kernel1.setArg(3,CL4);

                                              //Running Kernel
                                              clock_t time;
                                              time=clock();
                                                  queue.finish();
                                                      queue.enqueueNDRangeKernel(kernel1,cl::NullRange,cl::NDRange(worksize,1,1),cl::NDRange(worksize,1,1),NULL,NULL); //queue.enqueueNDRangeKernel(kernelname,cl::NullRange,cl::NDRange(arraylänge),cl::NDRange(1,1),NULL,NULL);
                                                      queue.enqueueReadBuffer (CL3,CL_TRUE,0,sizeof(c[0])*wsize,c);
                                                  queue.finish();
                                              time=clock()-time;

                                              //Ausgabe
                                              std::cout<<std::endl<<"Ergebnis OCL: "<<std::endl<<"";
                                              for(int i=0;i<wsize;i+=100){
                                                  std::cout<<c
                                      <<"    ";
                                              }
                                              std::cout<<std::endl;
                                              std::cout<<std::endl<<"Zeit OCL: "<<time<<"ms"<<std::endl<<std::endl;
                                              delete[] a,b,c;
                                          }catch(cl::Error& error){
                                              std::cout<<"OpenCL-Error: "<<error.what()<<"("<<error.err()<<")"<<std::endl<<std::endl;
                                             
                                          }
                                          std::cout<<std::endl<<"________________________________________________________________________________"<<std::endl;
                                          std::cout<<"Buildlog:"<<std::endl;
                                          buildlog=program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0]);
                                          std::cout<<buildlog<<std::endl;
                                          std::cout<<"________________________________________________________________________________"<<std::endl;;
                                          
                                          return 0;

                                       

                                      it just runs on one core, 100% load, others are 0% load.

                                      On GPU ist runs with gpu-load 99%...

                                       

                                      MFG

                                      Sheeep

                                        • Different Results using GPU and CPU
                                          genaganna

                                           

                                          Originally posted by: Sheeep Hi genaganna,

                                           

                                          thank you for helping me.

                                           

                                          But I can't post the kernel i'm still working. I'm not allowed to make it public. So I have a very simple kernel. its nonsence, but it has the same problem. i have a long loop, so its possible to see the cpu load.

                                           

                                           

                                          Sheeep,

                                                     You have only one workGroup that is why it is using only One core. You should have WorkGroups >= computeUnits inorder to utilize available cores fully.

                                                I made following change in you code and it uses my four codes fully.

                                                queue.enqueueNDRangeKernel(kernel1, cl::NullRange, cl::NDRange(wsize, 1, 1), clNDRange(worksize, 1, 1), NULL, NULL);