Hi,
i tried to allocate an array in the a kernel.
the array's length is a kernel parameter:
__kernel void array(__global int *length){
int a[length[0 ]];
(...);
}
when i run the kernel i get an error: clProgrammBuild(-11).
i changed type of a to __local int a[length[0]]. and it works.
but i don't want a __local array in this code.
What can i do?
Sheeep
I think the register allocation is at compile time but the data is available at kernel execution time only, so it won't work.
Also, private space arrays are spilled out to global memory right now so you are better off with local memory if you want good performance.
Originally posted by: n0thing I think the register allocation is at compile time but the data is available at kernel execution time only, so it won't work.
Also, private space arrays are spilled out to global memory right now so you are better off with local memory if you want good performance.
Assuming that private variables are still allocated in global memory in current 2.01 OpenCL implementation, I've found on my 5770 that they don't degrade the performance of my kernels, on condition that they are small (that is, they refer to a single value and not a vector). I think this is due to the fact that global memory caching is working (through texture caches, if I remember correctly) on 5xxx videocards.
yes on 5xxx card is used VFETCH instruction which is cached. but when i read kernel ISA on 5xxx card i get LDS_WRITE and LDS_READ instruction which is IMHO local memory.
Originally posted by: nou yes on 5xxx card is used VFETCH instruction which is cached. but when i read kernel ISA on 5xxx card i get LDS_WRITE and LDS_READ instruction which is IMHO local memory.
Are you affirming that private variables are emulated in local memory in current version of AMD OpenCL implementation??
Originally posted by: Sheeep Hi,
i tried to allocate an array in the a kernel.
the array's length is a kernel parameter:
__kernel void array(__global int *length){
int a[length[0 ]];
(...);
}
when i run the kernel i get an error: clProgrammBuild(-11).
i changed type of a to __local int a[length[0]]. and it works.
but i don't want a __local array in this code.
What can i do?
Variable length arrays not supported as per the spec.
Could you please give us kernel which works with __local array with variable size and gives correct results.
You can pass variable size to kernel using #define and compile kernel with -D flag.
__local array only runs on one compute unit? So my local worksize should be my global worksize?
And when i have two different local arrays, do they run on two different compute units or do I have to make them run parallel?
the kernel which works (simplified):
void func(__local int *a,int length){
for(int i=0;i
a+=1;
}
};
__kernel void array(__global int *inarray, __global const int *wlength, __global int *gws){
int gid=get_global_id(0);
gws[0]=get_global_size(0);
__local int a[gws[0]][wlength[0]];
__local int b[gws[0]][wlength[0]];
int lid=get_local_id(0);
for(int i=0;i
a[lid]=inarray[i+gid*wlength[0]];
}
func(a[lid],wlength[0]);
for(int i=0;i
inarray[i+lid*wlength[0]]=a[lid];
}
}
EDIT:
I tried to run the same code on nvidia device. I get an error, opencl does not support variable array length.
@genaganna:
You can pass variable size to kernel using #define and compile kernel with -D flag.
when i #define alength 9 i can't run the same kernel with different arraylength. so what can i do to allocate an array with variable length?
i think only why is, using __global array as kernel parameter. but there i can't use 2dArrays. so i can't run the function in the kernel above.
does anyone now a why run a function like this in parrallel way?
what is the -D flag? can you please explain me, how I can pass variable size to kernel using #define and compile kernel with -D flag.
Originally posted by: Sheeep
__local array only runs on one compute unit?
Each compute unit has a shared memory. Based on __local array request per group, Compute unit able to run more than one workgroup.
You should think __local array as per workgroup not per compute unit.
And when i have two different local arrays, do they run on two different compute units or do I have to make them run parallel?
You can have any number of __local array's per group. all __local array's are allocated on one compute unit and you no need to do some extra if you have more than one __local array. It is exactly same as having one __lcoal array.
what do you mean by "Do i have to make them run parallel"?
the kernel which works (simplified):
void func(__local int *a,int length){ for(int i=0;i a+=1; } }; __kernel void array(__global int *inarray, __global const int *wlength, __global int *gws){ int gid=get_global_id(0); gws[0]=get_global_size(0); __local int a[gws[0]][wlength[0]]; __local int b[gws[0]][wlength[0]]; int lid=get_local_id(0); for(int i=0;i a[lid]=inarray[i+gid*wlength[0]]; } func(a[lid],wlength[0]);
for(int i=0;i inarray[i+lid*wlength[0]]=a[lid]; } }
EDIT:
I tried to run the same code on nvidia device. I get an error, opencl does not support variable array length.
Thanks for reporting this issue.
i think only why is, using __global array as kernel parameter. but there i can't use 2dArrays. so i can't run the function in the kernel above.
does anyone now a why run a function like this in parrallel way?
You have a pointer so you can use same pointer as 1D array, 2D array and also as N-dimensional array.
what is the -D flag? can you please explain me, how I can pass variable size to kernel using #define and compile kernel with -D flag.
Define kernel as follows
__kernel void func(...)
{
__local float a[A_LENGTH];
__local float b[B_LENGHT];
.....
......
}
Build above kernel like following
clBuildProgram(program, 1, devices, "-D A_LENGTH=9 -D B_LENGTH=10", NULL, NULL);
Note : Compiler options you can specify throught fourth parameter of clBuildProgram
Hi,
You have a pointer so you can use same pointer as 1D array, 2D array and also as N-dimensional array.
I can uses a pointer to a 1d array as a pointer to a 2d array?
so sizeof(1d array)=sizeof (outer 2d array)*(inner 2d array)?
MFG Sheeep
Originally posted by: Sheeep
Hi,
You have a pointer so you can use same pointer as 1D array, 2D array and also as N-dimensional array.
I can uses a pointer to a 1d array as a pointer to a 2d array?
so sizeof(1d array)=sizeof (outer 2d array)*(inner 2d array)?
MFG Sheeep
I have pointer like this
float* ptr
//One dimensional array access
ptr;
//Two dimensional array access
ptr[i * width + j] ; //Where width is width of 2 dimensional array
I hope you are familiar with C/C++ pointers.
Originally posted by: genaganna
I have pointer like this
float* ptr
//One dimensional array access
ptr;
//Two dimensional array access
ptr[i * width + j] ; //Where width is width of 2 dimensional array
Yes, thanks, thats what i already used above...
I thought you want to use a function like malloc or new int[][] in kernel...
i wonder about this, because i think that is not possible in opencl...
Originally posted by: genaganna
I hope you are familiar with C/C++ pointers.
yes i am
MFG Sheeep
Originally posted by: Sheeep Originally posted by: genaganna
I have pointer like this
float* ptr
//One dimensional array access
ptr;
//Two dimensional array access
ptr[i * width + j] ; //Where width is width of 2 dimensional array
Yes, thanks, thats what i already used above...
I thought you want to use a function like malloc or new int[][] in kernel...
i wonder about this, because i think that is not possible in opencl...
C Standard library is not supported in OpenCL.
Dynamic allocation is not possible in kernel.
Yes, I know, I have read it in opencl specs.
But I have an other question:
I can run this kernel on gpu device without problem, but on cpu i get an error, if num_group_size bigger than 6.
__kernel void localWork(__global int *a,__global int *b){
int gid=get_global_id(0);
int lid=get_local_id(0);
__local int la[10]; //10 for example == local_work_size(0)
__local int lb[10];
la[lid]=a[gid];
lb[lid]=la[lid]+1;
b[gid]=lb[lid];
}
And there is no problem, running (on cpu) it with local_size 1 and global_size 100. but running with local_size 2 and global_size 200 it will crash. cpu's Max local_worksize ist 1024.
why does it crash on cpu, but runs fine on gpu?
EDIT:
with local_size 1 and global_size 200 it also runs fine.
oh no LDS_READ and LDS_WRITE is when i use local memory.
Originally posted by: MicahVillmow Fr4nz, arrays declared in the __local address space in a kernel use local memory, arrays not declared in the local address space use emulated private memory.
I knew that, it was nou's post that wasn't clear to me (in fact I was suprised by his sentence)
Using GPU device I get error -54 if my localworksize is bigger than 64.
I think errer 54 is wrong num_group_size, but my globalsize=x*localsize...
CL_Info.exe tells my max worksize is 256.
I don't understand why. I can run the kernel with localsize 10 and globalsize 100 fine, but not with localsize 100 and globalsize 1000.
On CPU ist doen't work with correctly if localworksize is not 1. I don't get a CL-error, app crashs.
I think problem is I don't understand it complete. But I don't know what I do wrong.
MFG Sheeep
Kernel:
__kernel void localWork(__global int *a,__global int *b){
int gid=get_global_id(0);
int lid=get_local_id(0);
__local int la[10]; //10 for example == local_work_size(0)
__local int lb[10];
la[lid]=a[gid];
lb[lid]=la[lid]+1;
b[gid]=lb[lid];
}
Originally posted by: Sheeep Using GPU device I get error -54 if my localworksize is bigger than 64.
I think errer 54 is wrong num_group_size, but my globalsize=x*localsize...
x must be a integer value.
CL_Info.exe tells my max worksize is 256.
I don't understand why. I can run the kernel with localsize 10 and globalsize 100 fine, but not with localsize 100 and globalsize 1000.
In below kernel, your local array size is 10 but you are accessing 100 elements if localsize 100. GPU donot handle exceptions that is why you donot see crash.
On CPU ist doen't work with correctly if localworksize is not 1. I don't get a CL-error, app crashs.
I think problem is I don't understand it complete. But I don't know what I do wrong.
MFG Sheeep
Kernel: __kernel void localWork(__global int *a,__global int *b){ int gid=get_global_id(0); int lid=get_local_id(0); __local int la[10]; //10 for example == local_work_size(0) __local int lb[10]; la[lid]=a[gid]; lb[lid]=la[lid]+1; b[gid]=lb[lid]; }
Give us run time code also to reply quickly.
Originally posted by: genaganna
In below kernel, your local array size is 10 but you are accessing 100 elements if localsize 100. GPU donot handle exceptions that is why you donot see crash.
of course, i changed it, when i used localsize 100.
My host code:
#include <cstdio>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <vector>
#include <iterator>
#define __CL_ENABLE_EXCEPTIONS
#include <CL\cl.hpp>
#include <ctime>
int main(int argc, char** argv){
cl_int error;
std::string buildlog;
cl::Context context;
cl::Program program;
std::vector<cl::Device> devices;
try{
//get CL platform info
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
cl_context_properties platform=NULL;
std::vector<cl::Platform>::iterator i;
if(platforms.size() > 0){
for(i = platforms.begin(); i != platforms.end(); ++i){
platform=((cl_context_properties)(*i)());
if(!strcmp((*i).getInfo<CL_PLATFORM_VENDOR>().c_str(), "Advanced Micro Devices, Inc."))break;
}
}
cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM, platform, 0 };
cl_context_properties *cprops =(platform==NULL) ? NULL : cps;
//Creating CL Device;
context=cl::Context(CL_DEVICE_TYPE_GPU,cprops,NULL,NULL,&error);
//getting Device List
devices=context.getInfo<CL_CONTEXT_DEVICES>();
//creating Commandqueue
cl::CommandQueue queue=cl::CommandQueue(context,devices[0]);
//Reading CL Programm from file
std::ifstream file("Kernel_localWork2.cl"); //Kernelname
std::string prog(std::istreambuf_iterator<char>(file),(std::istreambuf_iterator<char>()));
cl::Program::Sources source(1,std::make_pair(prog.c_str(), prog.length()));
//Building CL Programm for Device
program=cl::Program(context,source,&error);
program.build(devices);
//finally Kernels:
cl::Kernel kernel1=cl::Kernel(program,"localWork2",&error);
//Hostmemory
cl_int local_work_size;
devices[0].getInfo(CL_DEVICE_MAX_WORK_GROUP_SIZE,&local_work_size); //get max local_work_size
cl_int num_group_size=10;
cl_int global_work_size=num_group_size*local_work_size;
int cpun;
devices[0].getInfo(CL_DEVICE_MAX_COMPUTE_UNITS,&cpun);
std::cout<<"MAX_COMPUTE_UNITS: "<<cpun<<std::endl;
cl_int *a=new cl_int[num_group_size];
cl_int *b=new cl_int[num_group_size];
//initialing OpenCL Buffer(MemoryObjects)
cl::Buffer CL1=cl::Buffer(context,CL_MEM_READ_ONLY |CL_MEM_USE_HOST_PTR,sizeof(a[0]) * num_group_size,a,&error);
cl::Buffer CL2=cl::Buffer(context,CL_MEM_WRITE_ONLY |CL_MEM_USE_HOST_PTR,sizeof(b[0]) * num_group_size,b,&error);
//set Hostmemory
for(int i=0;i<num_group_size;i++){
a=i;
b=0;
}
//set Kernel Arguments
kernel1.setArg(0,CL1);
kernel1.setArg(1,CL2);
//Running Kernel
clock_t time;
time=clock();
queue.finish();
queue.enqueueNDRangeKernel(kernel1,cl::NullRange,cl::NDRange(global_work_size,1,1),cl::NDRange(local_work_size,1,1),NULL,NULL);
queue.enqueueReadBuffer (CL2,CL_TRUE,0,sizeof(b[0])*num_group_size,b);
queue.finish();
time=clock()-time;
//Ausgabe
std::cout<<std::endl<<"Ergebnis OCL: "<<std::endl<<"";
for(int i=0;i<num_group_size;i++){
std::cout<<b<<" ";
}
std::cout<<std::endl;
std::cout<<std::endl<<"Zeit OCL: "<<time<<"ms"<<std::endl<<std::endl;
delete[] a,b;
}catch(cl::Error& error){
std::cout<<"OpenCL-Error: "<<error.what()<<"("<<error.err()<<")"<<std::endl<<std::endl;
}
buildlog=program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0]);
if(buildlog!=""){
std::cout<<std::endl<<"________________________________________________________________________________"<<std::endl;
std::cout<<"Buildlog:"<<std::endl;
std::cout<<buildlog<<std::endl;
std::cout<<"________________________________________________________________________________"<<std::endl;
}
return 0;
}
and the Kernel (same above):
__kernel void localWork(__global int *a,__global int *b){
int gid=get_global_id(0);
int lid=get_local_id(0);
__local int la[get_local_size(0)];
la[lid]=a[gid];
la[lid]+=1;
b[gid]=la[lid];
}
strange is:
I tried it on an nvidia device. I used globalsize 512 and localsize 256 it works without problems...
I also tried localsize 512 and globalsize 5120 (nVidia device supports localsize 512). It works, too. But why doesn't it work on ATI device? Localsize 256 should work, but I can only run it with localsize 64 or lower.
I get the error clEnqueueNDRangeKernel(-54). But globalworksize=2*localworksize should work, doesn't it?
When I run the same code on CPU, it crashs with windows error (Applicaton does not respond), if localworksize not 1. but it should support localsize 1024.
Edit: There is no problem to run localsize 256, if I don't use any local variables. But when there is one variable declared as local, it doesn't work. But what do I do wrong?
Edit2: On CPU it always crash....
Originally posted by: Sheeep strange is:
I tried it on an nvidia device. I used globalsize 512 and localsize 256 it works without problems...
I also tried localsize 512 and globalsize 5120 (nVidia device supports localsize 512). It works, too. But why doesn't it work on ATI device? Localsize 256 should work, but I can only run it with localsize 64 or lower.
I get the error clEnqueueNDRangeKernel(-54). But globalworksize=2*localworksize should work, doesn't it?
When I run the same code on CPU, it crashs with windows error (Applicaton does not respond), if localworksize not 1. but it should support localsize 1024.
Edit: There is no problem to run localsize 256, if I don't use any local variables. But when there is one variable declared as local, it doesn't work. But what do I do wrong?
Edit2: On CPU it always crash....
Sheeep,
Variable length arrays not supported as per the OpenCL spec. I feel it should give appropriate error instead of crash.
Yes, but it's not the problem.
The kernel:
__kernel void localWork(__global int *a,__global int *b){
int gid=get_global_id(0);
int lid=get_local_id(0);
__local int la[256];
la[lid]=a[gid];
la[lid]+=1;
b[gid]=la[lid];
}
is not working, too. Same errors. clEnqueueNDRangeKernel(-54) on gpu and "Applicaton does not respond" on cpu.
Edit:
I just allocate the __local array on gpu - same error:
Kernels I tried:
__kernel void localWork2(__global int *a,__global int *b,__local int la[256]){
}
or
__kernel void localWork2(__global int *a,__global int *b){ __local int la[256];
}
But what can be wrong on this kernel?
hostcode works, if I do not allocate a local array...
max local_size I can use is 64 on gpu and 1 on cpu...
Sheeep,
I don't see anything wrong with your kernel. I am able to compile and run it without any issues.
What are your global and local work sizes? Error -54 , CL_INVALID_WORK_GROUP_SIZE can occur if these sizes are not proper.
Could you post your host side code?
Hi omkaranathan,
i try to run the kernel with localworksize 256 and globalworksize x*256. x is an integer. it doesn't work. I tried other worksize. If the localworksize is smaller than 64, I can run it. If localworksize bigger I can't run it.
localsize = 64 globalsize=10*64 works.
If I change localsize to 65 or 128 or 256 it does not work. LocalArraysize in kernel I changed to localworksize. I pay attention, that localworksize *x = globalworksize. x is an int.
On nvidia device same code (and same binary) does not have any problems. My GPU is a Radeon 4870, and I have a netbook with nvidia ion.
What is your device?
That's my hostcode:
#include <cstdio> #include <cstdlib> #include <fstream> #include <iostream> #include <vector> #include <iterator> #define __CL_ENABLE_EXCEPTIONS #include <CL\cl.hpp> #include <ctime> int main(int argc, char** argv){ cl_int error; std::string buildlog; cl::Context context; cl::Program program; std::vector<cl::Device> devices; try{ //get CL platform info std::vector<cl::Platform> platforms; cl::Platform::get(&platforms); cl_context_properties platform=NULL; std::vector<cl::Platform>::iterator i; if(platforms.size() > 0){ for(i = platforms.begin(); i != platforms.end(); ++i){ platform=((cl_context_properties)(*i)()); if(!strcmp((*i).getInfo<CL_PLATFORM_VENDOR>().c_str(), "Advanced Micro Devices, Inc."))break; } } cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM, platform, 0 }; cl_context_properties *cprops =(platform==NULL) ? NULL : cps; //Creating CL Device; context=cl::Context(CL_DEVICE_TYPE_GPU,cprops,NULL,NULL,&error); //getting Device List devices=context.getInfo<CL_CONTEXT_DEVICES>(); //creating Commandqueue cl::CommandQueue queue=cl::CommandQueue(context,devices[0]); //Reading CL Programm from file std::ifstream file("Kernel_localWork2.cl"); //Kernelname std::string prog(std::istreambuf_iterator<char>(file),(std::istreambuf_iterator<char>())); cl::Program::Sources source(1,std::make_pair(prog.c_str(), prog.length())); //Building CL Programm for Device program=cl::Program(context,source,&error); program.build(devices); //finally Kernels: cl::Kernel kernel1=cl::Kernel(program,"localWork2",&error); //Hostmemory cl_int local_work_size=64; //devices[0].getInfo(CL_DEVICE_MAX_WORK_GROUP_SIZE,&local_work_size); //get max local_work_size cl_int num_group_size=10; cl_int global_work_size=num_group_size*local_work_size; int cpun; devices[0].getInfo(CL_DEVICE_MAX_COMPUTE_UNITS,&cpun); std::cout<<"MAX_COMPUTE_UNITS: "<<cpun<<std::endl; cl_int *a=new cl_int[num_group_size]; cl_int *b=new cl_int[num_group_size]; //initialing OpenCL Buffer(MemoryObjects) cl::Buffer CL1=cl::Buffer(context,CL_MEM_READ_ONLY |CL_MEM_USE_HOST_PTR,sizeof(a[0]) * num_group_size,a,&error); cl::Buffer CL2=cl::Buffer(context,CL_MEM_WRITE_ONLY |CL_MEM_USE_HOST_PTR,sizeof(b[0]) * num_group_size,b,&error); //set Hostmemory for(int i=0;i<num_group_size;i++){ a=i; b=0; } //set Kernel Arguments kernel1.setArg(0,CL1); kernel1.setArg(1,CL2); kernel1.setArg(2,sizeof(int)*local_work_size,NULL); //Running Kernel clock_t time; time=clock(); queue.finish(); queue.enqueueNDRangeKernel(kernel1,cl::NullRange,cl::NDRange(global_work_size,1,1),cl::NDRange(local_work_size,1,1),NULL,NULL); queue.enqueueReadBuffer (CL2,CL_TRUE,0,sizeof(b[0])*num_group_size,b); queue.finish(); time=clock()-time; //Ausgabe std::cout<<std::endl<<"Ergebnis OCL: "<<std::endl<<""; for(int i=0;i<num_group_size;i++){ std::cout<<b<<" "; } std::cout<<std::endl; std::cout<<std::endl<<"Zeit OCL: "<<time<<"ms"<<std::endl<<std::endl; delete[] a,b; }catch(cl::Error& error){ std::cout<<"OpenCL-Error: "<<error.what()<<"("<<error.err()<<")"<<std::endl<<std::endl; } buildlog=program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0]); if(buildlog!=""){ std::cout<<std::endl<<"________________________________________________________________________________"<<std::endl; std::cout<<"Buildlog:"<<std::endl; std::cout<<buildlog<<std::endl; std::cout<<"________________________________________________________________________________"<<std::endl; } return 0; }
Query work group info on your kernel with CL_KERNEL_WORK_GROUP_SIZE enum and see what it retruns. Any work-group size above this number will cause this error.
Ok, CL_KERNEL_WORK_GROUP_SIZE =64.
so it can't run with local work size greater than 64...
but why is there a difference between CL_KERNEL_WORK_GROUP_SIZE and CL_DEVICE_MAX_WORK_GROUP_SIZE?
Using CPU ist the same, but using GPU CL_KERNEL_WORK_GROUP_SIZE=64 and CL_DEVICE_MAX_WORK_GROUP_SIZE=256.
EDIT:
Why does it crash on CPU, if num_Group_size greater than 6?
MFG
SHEEEP
CL_KERNEL_WORK_GROUP_SIZE returns max work-group size based on your kernel (resource usage, instructions used), whereas CL_DEVICE_MAX_WORK_GROUP_SIZE is the maximum work-group size allowed for the device.
Thanks, for help, I understand.
What I do not understand is: on ATI Device is CL_KERNEL_WORK_GROUP_SIZE=64 and CL_DEVICE_MAX_WORK_GROUP_SIZE=256.
On nVidia is CL_KERNEL_WORK_GROUP_SIZE=512and CL_DEVICE_MAX_WORK_GROUP_SIZE=512.
Why is there a difference between ATI and Nvidia? I used same Binary Code
What I do not understand is: on ATI Device is CL_KERNEL_WORK_GROUP_SIZE=64 and CL_DEVICE_MAX_WORK_GROUP_SIZE=256.
On nVidia is CL_KERNEL_WORK_GROUP_SIZE=512and CL_DEVICE_MAX_WORK_GROUP_SIZE=512.
Why is there a difference between ATI and Nvidia? I used same Binary Code
Different GPUs have different capabilities. Even if you use the same code, work-group size requirements might be different depending on the device on which you have compiled your kernel.
Yes, it's a HD 4870