cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Sheeep
Journeyman III

Different Results using GPU and CPU

I had some problems with large code. I don't get the same results, if I change context from gpu to cpu.

So I tried a little kernel, but i get the same problem.

 

If I don't use the global_id, i get the correct result using gpu. But if I use the cpu I get a result, that is much bigger than the correct one.

__kernel void test1(__global float *a, __global float *b, __global float *c){
    int gis=get_global_size(0);
    for(int j=0;j
        for(int i=0;i<100;i++){
            c+=a+b;   
        }
    }
}

 

then i tried this:

__kernel void test1(__global float *a, __global float *b, __global float *c){
    int gid=get_global_id(0);
    for(int i=0;i<100;i++){
            c[gid]+=a[gid]+b[gid];   
    }
}

and it works on cpu and gpu.

Does anyone know why?

0 Likes
10 Replies
eduardoschardong
Journeyman III

The first kernel is buggy, by pure coincidence it works on GPU, all threads reads and writte to same location.

On the GPU all itens are computed at the same time (if there are too many itens then part of them will be computed first then another part and so), results from a previous thread are overwritten, on the CPU each item is computed at a time, so instead of overwritting previous result it add to the previous result.

 

0 Likes

So to compute the second kernel I need worksize 1?

when my worksize is 20, everything will run 20 times, right?

 

an other question:

in openmp i have "omp_get_num_threads()". It returns the same as "get_global_size()" in opencl?

and omp_get_thread_num() is like "get_global_id()"?

 

 

EDIT: OK I thing I understand this.

But running with worksize 4 or 1024, my phenom only use one of its four cores. Using CPU I get 99% Activity. Why does the cpu only use one core?

In Windows Taskmanger I have 6 Threads.

 

 

0 Likes

Originally posted by: Sheeep

when my worksize is 20, everything will run 20 times, right?

If you have workgroupsize is 20, 20 threads run same kernel concurrently

But running with worksize 4 or 1024, my phenom only use one of its four cores. Using CPU I get 99% Activity. Why does the cpu only use one core?



I have a Quad core CPU. I never see such issue. Could you please run CLInfo sample and see value of Max compute units:  x

This value should be number cores you have.

 

0 Likes

Running CLInfo I get:

Number of platforms:                             1
  Plaform Profile:                               FULL_PROFILE
  Plaform Version:                               OpenCL 1.0 ATI-Stream-v2.0.0
  Plaform Name:                                  ATI Stream
  Plaform Vendor:                                Advanced Micro Devices, Inc.


  Plaform Name:                                  ATI Stream
Number of devices:                               2
  Device Type:                                   CL_DEVICE_TYPE_CPU
  Device ID:                                     4098
  Max compute units:                             4
  Max work items dimensions:                     3
    Max work items[0]:                           1024
    Max work items[1]:                           1024
    Max work items[2]:                           1024
  Max work group size:                           1024

(...)

  Name:                                          AMD Phenom(tm) II X4 940 Processor
  Vendor:                                        AuthenticAMD
  Driver version:                                1.0
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 1.0 ATI-Stream-v2.0.0

Number of Max compute units is 4, so I thought OpenCL should run on 4 cores. But is doesn't. It use only one core. Is this normal?

0 Likes

Originally posted by: SheeepNumber of Max compute units is 4, so I thought OpenCL should run on 4 cores. But is doesn't. It use only one core. Is this normal?


It is not normal. How you are concluding that it is running on single core?

Please set/export environment variable CPU_MAX_COMPUTE_UNITS to number of core you want to use and see performance difference for 1, 2 and 4. 

As a side note : CPU_MAX_COMPUTE_UNITS is not supported officially.

0 Likes

In windows taskmanger i have 25% cpu load, one core is 100%, other are 0-1%.

Running

int cpun;

devices[0].getInfo(CL_DEVICE_MAX_COMPUTE_UNITS,&cpun);

i get cpun=4.

How can I set CPU_MAX_COMPUTE_UNITS?

0 Likes

Originally posted by: Sheeep In windows taskmanger i have 25% cpu load, one core is 100%, other are 0-1%.

 

Running

 

int cpun;

 

devices[0].getInfo(CL_DEVICE_MAX_COMPUTE_UNITS,&cpun);

 

i get cpun=4.

 

How can I set CPU_MAX_COMPUTE_UNITS?

 

Sheeep,

            Please send your complete sample(kernel code and runtime code). I will try on my system and let you know issue.

0 Likes

Hi genaganna,

thank you for helping me.

But I can't post the kernel i'm still working. I'm not allowed to make it public. So I have a very simple kernel. its nonsence, but it has the same problem. i have a long loop, so its possible to see the cpu load.

 

Kernel:

__kernel void test1(__global float * a,__global float * b,__global float * c,__global const int *d){
    int gid=get_global_id(0);
    int gsi=get_global_size(0);
    int step=d[0]/get_global_size(0);
    for(int i=0;i<step;i++){
        for(int j=0;j<1000000;j++){
            c[gid+i*gsi]+=a[gid+i*gsi]+b[gid+i*gsi];
        }
    }
}

 

Host:

 

#include <cstdio>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <vector>
#include <iterator>
#define __CL_ENABLE_EXCEPTIONS
#include <CL\cl.hpp>
#include <ctime>
int main(int argc, char** argv){
    
    cl_int error;
    std::string buildlog;
    cl::Context context;
    cl::Program program;
    std::vector<cl::Device> devices;
    try{
        //get CL platform info
        std::vector<cl::Platform> platforms;
            cl::Platform::get(&platforms);
            cl_context_properties platform=NULL;
             std::vector<cl::Platform>::iterator i;
            if(platforms.size() > 0){
                for(i = platforms.begin(); i != platforms.end(); ++i){
                    platform=((cl_context_properties)(*i)());
                    if(!strcmp((*i).getInfo<CL_PLATFORM_VENDOR>().c_str(), "Advanced Micro Devices, Inc."))break;    
                }
            }

        cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM, platform, 0 };
        cl_context_properties *cprops =(platform==NULL) ? NULL : cps;
        //Creating CL Device;
        context=cl::Context(CL_DEVICE_TYPE_CPU,cprops,NULL,NULL,&error);
        //getting Device List
        devices=context.getInfo<CL_CONTEXT_DEVICES>();
        //creating Commandqueue
        cl::CommandQueue queue=cl::CommandQueue(context,devices[0]);
        //Reading CL Programm from file
        std::ifstream file("BastelKernel.cl");    //Kernelname
        std::string prog(std::istreambuf_iterator<char>(file),(std::istreambuf_iterator<char>()));
        cl::Program::Sources source(1,std::make_pair(prog.c_str(), prog.length()));
        //Building CL Programm for Device
        program=cl::Program(context,source,&error);
        program.build(devices);
        //finally Kernels:
        cl::Kernel kernel1=cl::Kernel(program,"test1",&error);  
       
        //Hostmemory
        cl_int wsize=8192;
        cl_int worksize;
        int cpun;
        devices[0].getInfo(CL_DEVICE_MAX_WORK_GROUP_SIZE,&worksize);
        devices[0].getInfo(CL_DEVICE_MAX_COMPUTE_UNITS,&cpun);
        std::cout<<"Max Compute Units: "<<cpun<<std::endl;
        cl_float *a=new cl_float[wsize];
        cl_float *b=new cl_float[wsize];
        cl_float *c=new cl_float[wsize];
        cl_int     *d=new cl_int[1]; d[0]=wsize;

        //initialing OpenCL Buffer(MemoryObjects)
        cl::Buffer CL1=cl::Buffer(context,CL_MEM_READ_ONLY |CL_MEM_USE_HOST_PTR,sizeof(a[0]) * wsize,a,&error);
        cl::Buffer CL2=cl::Buffer(context,CL_MEM_READ_ONLY |CL_MEM_USE_HOST_PTR,sizeof(b[0]) * wsize,b,&error);
        cl::Buffer CL3=cl::Buffer(context,CL_MEM_READ_WRITE|CL_MEM_USE_HOST_PTR,sizeof(c[0]) * wsize,c,&error);
        cl::Buffer CL4=cl::Buffer(context,CL_MEM_READ_ONLY |CL_MEM_USE_HOST_PTR,sizeof(c[0]) * wsize,d,&error);

        //set Hostmemory
        for(int i=0;i<wsize;i++){
            a=i+0.1;
            b
=wsize-i-0.1;
            c=0;
        }
       
        //set Kernel Arguments
        kernel1.setArg(0,CL1);
        kernel1.setArg(1,CL2);
        kernel1.setArg(2,CL3);
        kernel1.setArg(3,CL4);

        //Running Kernel
        clock_t time;
        time=clock();
            queue.finish();
                queue.enqueueNDRangeKernel(kernel1,cl::NullRange,cl::NDRange(worksize,1,1),cl::NDRange(worksize,1,1),NULL,NULL); //queue.enqueueNDRangeKernel(kernelname,cl::NullRange,cl::NDRange(arraylänge),cl::NDRange(1,1),NULL,NULL);
                queue.enqueueReadBuffer (CL3,CL_TRUE,0,sizeof(c[0])*wsize,c);
            queue.finish();
        time=clock()-time;

        //Ausgabe
        std::cout<<std::endl<<"Ergebnis OCL: "<<std::endl<<"";
        for(int i=0;i<wsize;i+=100){
            std::cout<<c
<<"    ";
        }
        std::cout<<std::endl;
        std::cout<<std::endl<<"Zeit OCL: "<<time<<"ms"<<std::endl<<std::endl;
        delete[] a,b,c;
    }catch(cl::Error& error){
        std::cout<<"OpenCL-Error: "<<error.what()<<"("<<error.err()<<")"<<std::endl<<std::endl;
       
    }
    std::cout<<std::endl<<"________________________________________________________________________________"<<std::endl;
    std::cout<<"Buildlog:"<<std::endl;
    buildlog=program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0]);
    std::cout<<buildlog<<std::endl;
    std::cout<<"________________________________________________________________________________"<<std::endl;;
    
    return 0;

 

it just runs on one core, 100% load, others are 0% load.

On GPU ist runs with gpu-load 99%...

 

MFG

Sheeep

0 Likes

Originally posted by: Sheeep Hi genaganna,

 

thank you for helping me.

 

But I can't post the kernel i'm still working. I'm not allowed to make it public. So I have a very simple kernel. its nonsence, but it has the same problem. i have a long loop, so its possible to see the cpu load.

 

Sheeep,

           You have only one workGroup that is why it is using only One core. You should have WorkGroups >= computeUnits inorder to utilize available cores fully.

      I made following change in you code and it uses my four codes fully.

      queue.enqueueNDRangeKernel(kernel1, cl::NullRange, cl::NDRange(wsize, 1, 1), clNDRange(worksize, 1, 1), NULL, NULL);



0 Likes

Hi,

thank you for help...

it works and very fast....faster than before...

0 Likes