I had some problems with large code. I don't get the same results, if I change context from gpu to cpu.
So I tried a little kernel, but i get the same problem.
If I don't use the global_id, i get the correct result using gpu. But if I use the cpu I get a result, that is much bigger than the correct one.
__kernel void test1(__global float *a, __global float *b, __global float *c){
int gis=get_global_size(0);
for(int j=0;j
for(int i=0;i<100;i++){
c
}
}
}
then i tried this:
__kernel void test1(__global float *a, __global float *b, __global float *c){
int gid=get_global_id(0);
for(int i=0;i<100;i++){
c[gid]+=a[gid]+b[gid];
}
}
and it works on cpu and gpu.
Does anyone know why?
The first kernel is buggy, by pure coincidence it works on GPU, all threads reads and writte to same location.
On the GPU all itens are computed at the same time (if there are too many itens then part of them will be computed first then another part and so), results from a previous thread are overwritten, on the CPU each item is computed at a time, so instead of overwritting previous result it add to the previous result.
So to compute the second kernel I need worksize 1?
when my worksize is 20, everything will run 20 times, right?
an other question:
in openmp i have "omp_get_num_threads()". It returns the same as "get_global_size()" in opencl?
and omp_get_thread_num() is like "get_global_id()"?
EDIT: OK I thing I understand this.
But running with worksize 4 or 1024, my phenom only use one of its four cores. Using CPU I get 99% Activity. Why does the cpu only use one core?
In Windows Taskmanger I have 6 Threads.
Originally posted by: Sheeep
when my worksize is 20, everything will run 20 times, right?
If you have workgroupsize is 20, 20 threads run same kernel concurrently
But running with worksize 4 or 1024, my phenom only use one of its four cores. Using CPU I get 99% Activity. Why does the cpu only use one core?
I have a Quad core CPU. I never see such issue. Could you please run CLInfo sample and see value of Max compute units: x
This value should be number cores you have.
Running CLInfo I get:
Number of platforms: 1
Plaform Profile: FULL_PROFILE
Plaform Version: OpenCL 1.0 ATI-Stream-v2.0.0
Plaform Name: ATI Stream
Plaform Vendor: Advanced Micro Devices, Inc.
Plaform Name: ATI Stream
Number of devices: 2
Device Type: CL_DEVICE_TYPE_CPU
Device ID: 4098
Max compute units: 4
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 1024
(...)
Name: AMD Phenom(tm) II X4 940 Processor
Vendor: AuthenticAMD
Driver version: 1.0
Profile: FULL_PROFILE
Version: OpenCL 1.0 ATI-Stream-v2.0.0
Number of Max compute units is 4, so I thought OpenCL should run on 4 cores. But is doesn't. It use only one core. Is this normal?
Originally posted by: SheeepNumber of Max compute units is 4, so I thought OpenCL should run on 4 cores. But is doesn't. It use only one core. Is this normal?
It is not normal. How you are concluding that it is running on single core?
Please set/export environment variable CPU_MAX_COMPUTE_UNITS to number of core you want to use and see performance difference for 1, 2 and 4.
As a side note : CPU_MAX_COMPUTE_UNITS is not supported officially.
In windows taskmanger i have 25% cpu load, one core is 100%, other are 0-1%.
Running
int cpun;
devices[0].getInfo(CL_DEVICE_MAX_COMPUTE_UNITS,&cpun);
i get cpun=4.
How can I set CPU_MAX_COMPUTE_UNITS?
Originally posted by: Sheeep In windows taskmanger i have 25% cpu load, one core is 100%, other are 0-1%.
Running
int cpun;
devices[0].getInfo(CL_DEVICE_MAX_COMPUTE_UNITS,&cpun);
i get cpun=4.
How can I set CPU_MAX_COMPUTE_UNITS?
Sheeep,
Please send your complete sample(kernel code and runtime code). I will try on my system and let you know issue.
Hi genaganna,
thank you for helping me.
But I can't post the kernel i'm still working. I'm not allowed to make it public. So I have a very simple kernel. its nonsence, but it has the same problem. i have a long loop, so its possible to see the cpu load.
Kernel:
__kernel void test1(__global float * a,__global float * b,__global float * c,__global const int *d){
int gid=get_global_id(0);
int gsi=get_global_size(0);
int step=d[0]/get_global_size(0);
for(int i=0;i<step;i++){
for(int j=0;j<1000000;j++){
c[gid+i*gsi]+=a[gid+i*gsi]+b[gid+i*gsi];
}
}
}
Host:
#include <cstdio>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <vector>
#include <iterator>
#define __CL_ENABLE_EXCEPTIONS
#include <CL\cl.hpp>
#include <ctime>
int main(int argc, char** argv){
cl_int error;
std::string buildlog;
cl::Context context;
cl::Program program;
std::vector<cl::Device> devices;
try{
//get CL platform info
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
cl_context_properties platform=NULL;
std::vector<cl::Platform>::iterator i;
if(platforms.size() > 0){
for(i = platforms.begin(); i != platforms.end(); ++i){
platform=((cl_context_properties)(*i)());
if(!strcmp((*i).getInfo<CL_PLATFORM_VENDOR>().c_str(), "Advanced Micro Devices, Inc."))break;
}
}
cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM, platform, 0 };
cl_context_properties *cprops =(platform==NULL) ? NULL : cps;
//Creating CL Device;
context=cl::Context(CL_DEVICE_TYPE_CPU,cprops,NULL,NULL,&error);
//getting Device List
devices=context.getInfo<CL_CONTEXT_DEVICES>();
//creating Commandqueue
cl::CommandQueue queue=cl::CommandQueue(context,devices[0]);
//Reading CL Programm from file
std::ifstream file("BastelKernel.cl"); //Kernelname
std::string prog(std::istreambuf_iterator<char>(file),(std::istreambuf_iterator<char>()));
cl::Program::Sources source(1,std::make_pair(prog.c_str(), prog.length()));
//Building CL Programm for Device
program=cl::Program(context,source,&error);
program.build(devices);
//finally Kernels:
cl::Kernel kernel1=cl::Kernel(program,"test1",&error);
//Hostmemory
cl_int wsize=8192;
cl_int worksize;
int cpun;
devices[0].getInfo(CL_DEVICE_MAX_WORK_GROUP_SIZE,&worksize);
devices[0].getInfo(CL_DEVICE_MAX_COMPUTE_UNITS,&cpun);
std::cout<<"Max Compute Units: "<<cpun<<std::endl;
cl_float *a=new cl_float[wsize];
cl_float *b=new cl_float[wsize];
cl_float *c=new cl_float[wsize];
cl_int *d=new cl_int[1]; d[0]=wsize;
//initialing OpenCL Buffer(MemoryObjects)
cl::Buffer CL1=cl::Buffer(context,CL_MEM_READ_ONLY |CL_MEM_USE_HOST_PTR,sizeof(a[0]) * wsize,a,&error);
cl::Buffer CL2=cl::Buffer(context,CL_MEM_READ_ONLY |CL_MEM_USE_HOST_PTR,sizeof(b[0]) * wsize,b,&error);
cl::Buffer CL3=cl::Buffer(context,CL_MEM_READ_WRITE|CL_MEM_USE_HOST_PTR,sizeof(c[0]) * wsize,c,&error);
cl::Buffer CL4=cl::Buffer(context,CL_MEM_READ_ONLY |CL_MEM_USE_HOST_PTR,sizeof(c[0]) * wsize,d,&error);
//set Hostmemory
for(int i=0;i<wsize;i++){
a=i+0.1;
b=wsize-i-0.1;
c=0;
}
//set Kernel Arguments
kernel1.setArg(0,CL1);
kernel1.setArg(1,CL2);
kernel1.setArg(2,CL3);
kernel1.setArg(3,CL4);
//Running Kernel
clock_t time;
time=clock();
queue.finish();
queue.enqueueNDRangeKernel(kernel1,cl::NullRange,cl::NDRange(worksize,1,1),cl::NDRange(worksize,1,1),NULL,NULL); //queue.enqueueNDRangeKernel(kernelname,cl::NullRange,cl::NDRange(arraylänge),cl::NDRange(1,1),NULL,NULL);
queue.enqueueReadBuffer (CL3,CL_TRUE,0,sizeof(c[0])*wsize,c);
queue.finish();
time=clock()-time;
//Ausgabe
std::cout<<std::endl<<"Ergebnis OCL: "<<std::endl<<"";
for(int i=0;i<wsize;i+=100){
std::cout<<c<<" ";
}
std::cout<<std::endl;
std::cout<<std::endl<<"Zeit OCL: "<<time<<"ms"<<std::endl<<std::endl;
delete[] a,b,c;
}catch(cl::Error& error){
std::cout<<"OpenCL-Error: "<<error.what()<<"("<<error.err()<<")"<<std::endl<<std::endl;
}
std::cout<<std::endl<<"________________________________________________________________________________"<<std::endl;
std::cout<<"Buildlog:"<<std::endl;
buildlog=program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0]);
std::cout<<buildlog<<std::endl;
std::cout<<"________________________________________________________________________________"<<std::endl;;
return 0;
}
it just runs on one core, 100% load, others are 0% load.
On GPU ist runs with gpu-load 99%...
MFG
Sheeep
Originally posted by: Sheeep Hi genaganna,
thank you for helping me.
But I can't post the kernel i'm still working. I'm not allowed to make it public. So I have a very simple kernel. its nonsence, but it has the same problem. i have a long loop, so its possible to see the cpu load.
Sheeep,
You have only one workGroup that is why it is using only One core. You should have WorkGroups >= computeUnits inorder to utilize available cores fully.
I made following change in you code and it uses my four codes fully.
queue.enqueueNDRangeKernel(kernel1, cl::NullRange, cl::NDRange(wsize, 1, 1), clNDRange(worksize, 1, 1), NULL, NULL);
Hi,
thank you for help...
it works and very fast....faster than before...