31 Replies Latest reply on Mar 3, 2010 12:30 AM by gaurav.garg

    Allocate Array in kernel

    Sheeep

      Hi,

      i tried to allocate an array in the a kernel.

      the array's length is a kernel parameter:

      __kernel void array(__global int *length){

          int a[length[0 ]];

                        (...);

      }

      when i run the kernel i get an error: clProgrammBuild(-11).

       

      i changed type of a to __local int a[length[0]]. and it works.

      but i don't want a __local array in this code.

      What can i do?

       

      Sheeep

        • Allocate Array in kernel
          n0thing

          I think the register allocation is at compile time but the data is available at kernel execution time only, so it won't work.

           Also, private space arrays are spilled out to global memory right now so you are better off with local memory if you want good performance.

            • Allocate Array in kernel
              Fr4nz

               

              Originally posted by: n0thing I think the register allocation is at compile time but the data is available at kernel execution time only, so it won't work.

               

               Also, private space arrays are spilled out to global memory right now so you are better off with local memory if you want good performance.



              Assuming that private variables are still allocated in global memory in current 2.01 OpenCL implementation, I've found on my 5770 that they don't degrade the performance of my kernels, on condition that they are small (that is, they refer to a single value and not a vector). I think this is due to the fact that global memory caching is working (through texture caches, if I remember correctly) on 5xxx videocards.

                • Allocate Array in kernel
                  nou

                  yes on 5xxx card is used VFETCH instruction which is cached. but when i read kernel ISA on 5xxx card i get LDS_WRITE and LDS_READ instruction which is IMHO local memory.

                    • Allocate Array in kernel
                      Fr4nz

                       

                      Originally posted by: nou yes on 5xxx card is used VFETCH instruction which is cached. but when i read kernel ISA on 5xxx card i get LDS_WRITE and LDS_READ instruction which is IMHO local memory.

                       

                       

                      Are you affirming that private variables are emulated in local memory in current version of AMD OpenCL implementation??

                • Allocate Array in kernel
                  genaganna

                   

                  Originally posted by: Sheeep Hi,

                  i tried to allocate an array in the a kernel.

                  the array's length is a kernel parameter:

                  __kernel void array(__global int *length){

                      int a[length[0 ]];

                                    (...);

                  }

                  when i run the kernel i get an error: clProgrammBuild(-11).

                  i changed type of a to __local int a[length[0]]. and it works.

                  but i don't want a __local array in this code.

                  What can i do?

                  Variable length arrays not supported as per the spec. 

                  Could you please give us kernel which works with __local array with variable size and gives correct results.

                  You can pass variable size to kernel using #define  and compile kernel with -D flag.

                    • Allocate Array in kernel
                      Sheeep

                       

                      __local array only runs on one compute unit? So my local worksize should be my global worksize? 

                      And when i have two different local arrays, do they run on two different compute units or do I have to make them run parallel?

                      the kernel which works (simplified):

                      void func(__local int *a,int length){
                          for(int i=0;i
                              a+=1;
                          }
                      };
                      __kernel void array(__global int *inarray, __global const int *wlength, __global int *gws){
                          int gid=get_global_id(0);
                          gws[0]=get_global_size(0);
                          __local int a[gws[0]][wlength[0]];
                          __local int b[gws[0]][wlength[0]];
                          int lid=get_local_id(0);
                          for(int i=0;i
                              a[lid]
                      =inarray[i+gid*wlength[0]];
                        
                          }

                          func(a[lid],wlength[0]);


                          for(int i=0;i
                              inarray[i+lid*wlength[0]]=a[lid]
                      ;
                          }
                      }

                       

                       

                      EDIT:

                      I tried to run the same code on nvidia device. I get an error, opencl does not support variable array length.

                      @genaganna:

                      You can pass variable size to kernel using #define and compile kernel with -D flag.

                      when i #define alength 9 i can't run the same kernel with different arraylength. so what can i do to allocate an array with variable length?

                      i think only why is, using __global array as kernel parameter. but there i can't use 2dArrays. so i can't run the function in the kernel above.

                      does anyone now a why run a function like this in parrallel way?

                       

                      what is the -D flag? can you please explain me, how I can pass variable size to kernel using #define and compile kernel with -D flag.

                        • Allocate Array in kernel
                          genaganna

                           

                          Originally posted by: Sheeep  

                          __local array only runs on one compute unit?

                             Each compute unit has a shared memory.  Based on __local array request per group, Compute unit able to run more than one workgroup.

                          You should think __local array as per workgroup not per compute unit.

                          And when i have two different local arrays, do they run on two different compute units or do I have to make them run parallel?

                          You can have any number of __local array's per group. all __local array's are allocated on one compute unit and you no need to do some extra if you have more than one __local array. It is exactly same as having one __lcoal array.

                          what do you mean by "Do i have to make them run parallel"?

                          the kernel which works (simplified):

                          void func(__local int *a,int length){     for(int i=0;i         a+=1;     } }; __kernel void array(__global int *inarray, __global const int *wlength, __global int *gws){     int gid=get_global_id(0);     gws[0]=get_global_size(0);     __local int a[gws[0]][wlength[0]];     __local int b[gws[0]][wlength[0]];     int lid=get_local_id(0);     for(int i=0;i         a[lid]=inarray[i+gid*wlength[0]];        }     func(a[lid],wlength[0]);

                           

                              for(int i=0;i         inarray[i+lid*wlength[0]]=a[lid];     } }

                          EDIT:

                          I tried to run the same code on nvidia device. I get an error, opencl does not support variable array length.

                          Thanks for reporting this issue. 

                          i think only why is, using __global array as kernel parameter. but there i can't use 2dArrays. so i can't run the function in the kernel above.

                          does anyone now a why run a function like this in parrallel way?

                          You have a pointer so you can use same pointer as 1D array, 2D array and also as N-dimensional array.

                           



                          what is the -D flag? can you please explain me, how I can pass variable size to kernel using #define and compile kernel with -D flag.

                          Define kernel as follows

                            __kernel void func(...)

                          {

                            __local float a[A_LENGTH];

                            __local float b[B_LENGHT];

                            .....

                            ......

                          }

                           

                          Build above kernel like following

                          clBuildProgram(program, 1, devices, "-D A_LENGTH=9 -D B_LENGTH=10", NULL, NULL);

                          Note : Compiler options you can specify throught fourth parameter of clBuildProgram

                            • Allocate Array in kernel
                              Sheeep

                               

                              Hi,

                              You have a pointer so you can use same pointer as 1D array, 2D array and also as N-dimensional array.

                               



                              I can uses a pointer to a 1d array as a pointer to a 2d array?

                              so sizeof(1d array)=sizeof (outer 2d array)*(inner 2d array)?

                              MFG Sheeep

                                • Allocate Array in kernel
                                  genaganna

                                   

                                  Originally posted by: Sheeep  

                                   

                                  Hi,

                                   

                                   

                                  You have a pointer so you can use same pointer as 1D array, 2D array and also as N-dimensional array.

                                   

                                   



                                   

                                  I can uses a pointer to a 1d array as a pointer to a 2d array?

                                   

                                  so sizeof(1d array)=sizeof (outer 2d array)*(inner 2d array)?

                                   

                                  MFG Sheeep

                                   

                                  I have pointer like this

                                    float* ptr

                                    //One dimensional array access

                                   ptr;

                                  //Two dimensional array access

                                  ptr[i * width + j] ; //Where width is width of 2 dimensional array

                                   

                                  I hope you are familiar with C/C++ pointers.

                                    • Allocate Array in kernel
                                      Sheeep

                                       

                                      Originally posted by: genaganna 

                                      I have pointer like this

                                        float* ptr

                                        //One dimensional array access

                                       ptr;

                                      //Two dimensional array access

                                      ptr[i * width + j] ; //Where width is width of 2 dimensional array

                                       



                                      Yes, thanks, thats what i already used above...

                                      I thought you want to use a function like malloc or new int[][] in kernel...

                                      i wonder about this, because i think that is not possible in opencl...

                                       

                                      Originally posted by: genaganna

                                      I hope you are familiar with C/C++ pointers.

                                       



                                      yes i am


                                      MFG Sheeep

                                        • Allocate Array in kernel
                                          genaganna

                                           

                                          Originally posted by: Sheeep
                                          Originally posted by: genaganna 

                                           

                                          I have pointer like this

                                           

                                            float* ptr

                                           

                                            //One dimensional array access

                                           

                                           ptr;

                                           

                                          //Two dimensional array access

                                           

                                          ptr[i * width + j] ; //Where width is width of 2 dimensional array

                                           



                                           

                                          Yes, thanks, thats what i already used above...

                                           

                                          I thought you want to use a function like malloc or new int[][] in kernel...

                                           

                                          i wonder about this, because i think that is not possible in opencl...

                                           

                                          C Standard library is not supported in OpenCL.

                                          Dynamic allocation is not possible in kernel.

                                            • Allocate Array in kernel
                                              Sheeep

                                              Yes, I know, I have read it in opencl specs.

                                              But I have an other question:

                                              I can run this kernel on gpu device without problem, but on cpu i get an error, if num_group_size bigger than 6.

                                              __kernel void localWork(__global int *a,__global int *b){
                                                  int gid=get_global_id(0);   
                                                  int lid=get_local_id(0);
                                                  __local int la[10];  //10 for example == local_work_size(0)
                                                  __local int lb[10];
                                                  la[lid]=a[gid];
                                                  lb[lid]=la[lid]+1;
                                                  b[gid]=lb[lid];

                                              }

                                              And there is no problem, running (on cpu) it with local_size 1 and global_size 100. but running with  local_size 2 and global_size 200 it will crash. cpu's Max local_worksize ist 1024.

                                              why does it crash on cpu, but runs fine on gpu?

                                              EDIT:

                                              with local_size 1 and global_size 200 it also runs fine.

                                               

                                • Allocate Array in kernel
                                  MicahVillmow
                                  Fr4nz,
                                  arrays declared in the __local address space in a kernel use local memory, arrays not declared in the local address space use emulated private memory.
                                    • Allocate Array in kernel
                                      nou

                                      oh no LDS_READ and LDS_WRITE is when i use local memory.

                                      • Allocate Array in kernel
                                        Fr4nz

                                         

                                        Originally posted by: MicahVillmow Fr4nz, arrays declared in the __local address space in a kernel use local memory, arrays not declared in the local address space use emulated private memory.


                                        I knew that, it was nou's post that wasn't clear to me (in fact I was suprised by his sentence)

                                          • Allocate Array in kernel
                                            Sheeep

                                            Using GPU device I get error -54 if my localworksize is bigger than 64.

                                            I think errer 54 is wrong num_group_size, but my globalsize=x*localsize...

                                            CL_Info.exe tells my max worksize is 256.

                                            I don't understand why. I can run the kernel with localsize 10 and globalsize 100 fine, but not with localsize 100 and globalsize 1000.

                                             

                                            On CPU ist doen't work with correctly if localworksize is not 1. I don't get a CL-error, app crashs.

                                            I think problem is I don't understand it complete. But I don't know what I do wrong.

                                            MFG Sheeep

                                            Kernel:
                                            __kernel void localWork(__global int *a,__global int *b){
                                                int gid=get_global_id(0);   
                                                int lid=get_local_id(0);
                                                __local int la[10];  //10 for example == local_work_size(0)
                                                __local int lb[10];
                                                la[lid]=a[gid];
                                                lb[lid]=la[lid]+1;
                                                b[gid]=lb[lid];

                                            }

                                             

                                              • Allocate Array in kernel
                                                genaganna

                                                 

                                                Originally posted by: Sheeep Using GPU device I get error -54 if my localworksize is bigger than 64.

                                                 

                                                I think errer 54 is wrong num_group_size, but my globalsize=x*localsize...

                                                x must be a integer value.

                                                 

                                                CL_Info.exe tells my max worksize is 256.

                                                 

                                                I don't understand why. I can run the kernel with localsize 10 and globalsize 100 fine, but not with localsize 100 and globalsize 1000.

                                                    In below kernel, your local array size is 10 but you are accessing 100 elements if localsize 100.  GPU donot handle exceptions that is why you donot see crash. 

                                                 

                                                 

                                                On CPU ist doen't work with correctly if localworksize is not 1. I don't get a CL-error, app crashs.

                                                 

                                                I think problem is I don't understand it complete. But I don't know what I do wrong.

                                                 

                                                MFG Sheeep

                                                 

                                                Kernel: __kernel void localWork(__global int *a,__global int *b){     int gid=get_global_id(0);        int lid=get_local_id(0);     __local int la[10];  //10 for example == local_work_size(0)     __local int lb[10];     la[lid]=a[gid];     lb[lid]=la[lid]+1;     b[gid]=lb[lid]; }

                                                 

                                                 

                                                Give us run time code also to reply quickly.

                                                  • Allocate Array in kernel
                                                    Sheeep

                                                     

                                                    Originally posted by: genaganna 

                                                       In below kernel, your local array size is 10 but you are accessing 100 elements if localsize 100.  GPU donot handle exceptions that is why you donot see crash.

                                                     

                                                    of course, i changed it, when i used localsize 100.


                                                    My host code:

                                                    #include <cstdio>
                                                    #include <cstdlib>
                                                    #include <fstream>
                                                    #include <iostream>
                                                    #include <vector>
                                                    #include <iterator>
                                                    #define __CL_ENABLE_EXCEPTIONS
                                                    #include <CL\cl.hpp>
                                                    #include <ctime>
                                                    int main(int argc, char** argv){
                                                       
                                                        cl_int error;
                                                        std::string buildlog;
                                                        cl::Context context;
                                                        cl::Program program;
                                                        std::vector<cl::Device> devices;
                                                        try{
                                                            //get CL platform info
                                                            std::vector<cl::Platform> platforms;
                                                                cl::Platform::get(&platforms);
                                                                cl_context_properties platform=NULL;
                                                                 std::vector<cl::Platform>::iterator i;
                                                                if(platforms.size() > 0){
                                                                    for(i = platforms.begin(); i != platforms.end(); ++i){
                                                                        platform=((cl_context_properties)(*i)());
                                                                        if(!strcmp((*i).getInfo<CL_PLATFORM_VENDOR>().c_str(), "Advanced Micro Devices, Inc."))break;   
                                                                    }
                                                                }
                                                            cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM, platform, 0 };
                                                            cl_context_properties *cprops =(platform==NULL) ? NULL : cps;
                                                            //Creating CL Device;
                                                            context=cl::Context(CL_DEVICE_TYPE_GPU,cprops,NULL,NULL,&error);
                                                            //getting Device List
                                                            devices=context.getInfo<CL_CONTEXT_DEVICES>();
                                                            //creating Commandqueue
                                                            cl::CommandQueue queue=cl::CommandQueue(context,devices[0]);
                                                            //Reading CL Programm from file
                                                            std::ifstream file("Kernel_localWork2.cl");    //Kernelname
                                                            std::string prog(std::istreambuf_iterator<char>(file),(std::istreambuf_iterator<char>()));
                                                            cl::Program::Sources source(1,std::make_pair(prog.c_str(), prog.length()));
                                                            //Building CL Programm for Device
                                                            program=cl::Program(context,source,&error);
                                                            program.build(devices);
                                                            //finally Kernels:
                                                            cl::Kernel kernel1=cl::Kernel(program,"localWork2",&error);  
                                                           
                                                            //Hostmemory
                                                            cl_int local_work_size;
                                                            devices[0].getInfo(CL_DEVICE_MAX_WORK_GROUP_SIZE,&local_work_size); //get max local_work_size
                                                            cl_int num_group_size=10;
                                                            cl_int global_work_size=num_group_size*local_work_size;
                                                            int cpun;
                                                            devices[0].getInfo(CL_DEVICE_MAX_COMPUTE_UNITS,&cpun);
                                                            std::cout<<"MAX_COMPUTE_UNITS: "<<cpun<<std::endl;
                                                            cl_int *a=new cl_int[num_group_size];
                                                            cl_int *b=new cl_int[num_group_size];
                                                            //initialing OpenCL Buffer(MemoryObjects)
                                                            cl::Buffer CL1=cl::Buffer(context,CL_MEM_READ_ONLY |CL_MEM_USE_HOST_PTR,sizeof(a[0]) * num_group_size,a,&error);
                                                            cl::Buffer CL2=cl::Buffer(context,CL_MEM_WRITE_ONLY |CL_MEM_USE_HOST_PTR,sizeof(b[0]) * num_group_size,b,&error);
                                                            //set Hostmemory
                                                            for(int i=0;i<num_group_size;i++){
                                                                a=i;
                                                                b
                                                    =0;
                                                            }
                                                            //set Kernel Arguments
                                                            kernel1.setArg(0,CL1);
                                                            kernel1.setArg(1,CL2);
                                                            //Running Kernel
                                                            clock_t time;
                                                            time=clock();
                                                                queue.finish();
                                                                    queue.enqueueNDRangeKernel(kernel1,cl::NullRange,cl::NDRange(global_work_size,1,1),cl::NDRange(local_work_size,1,1),NULL,NULL);
                                                                    queue.enqueueReadBuffer (CL2,CL_TRUE,0,sizeof(b[0])*num_group_size,b);
                                                                queue.finish();
                                                            time=clock()-time;
                                                            //Ausgabe
                                                            std::cout<<std::endl<<"Ergebnis OCL: "<<std::endl<<"";
                                                            for(int i=0;i<num_group_size;i++){
                                                                std::cout<<b<<"    ";
                                                            }
                                                            std::cout<<std::endl;
                                                            std::cout<<std::endl<<"Zeit OCL: "<<time<<"ms"<<std::endl<<std::endl;
                                                            delete[] a,b;
                                                        }catch(cl::Error& error){
                                                            std::cout<<"OpenCL-Error: "<<error.what()<<"("<<error.err()<<")"<<std::endl<<std::endl;
                                                           
                                                        }
                                                        buildlog=program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0]);
                                                        if(buildlog!=""){
                                                            std::cout<<std::endl<<"________________________________________________________________________________"<<std::endl;
                                                            std::cout<<"Buildlog:"<<std::endl;
                                                            std::cout<<buildlog<<std::endl;
                                                            std::cout<<"________________________________________________________________________________"<<std::endl;
                                                        }
                                                        return 0;
                                                    }

                                                     

                                                    and the Kernel (same above):

                                                    __kernel void localWork(__global int *a,__global int *b){
                                                        int gid=get_global_id(0);   
                                                        int lid=get_local_id(0);
                                                        __local int la[get_local_size(0)];
                                                        la[lid]=a[gid];
                                                        la[lid]+=1;
                                                        b[gid]=la[lid];
                                                    }

                                                     


                                                      • Allocate Array in kernel
                                                        Sheeep

                                                        strange is:

                                                        I tried it on an nvidia device. I used globalsize 512 and localsize 256  it works without problems...

                                                        I also tried localsize 512 and globalsize 5120 (nVidia device supports localsize 512). It works, too. But why doesn't it work on ATI device? Localsize 256 should work, but I can only run it with localsize 64 or lower.

                                                        I get the error  clEnqueueNDRangeKernel(-54). But globalworksize=2*localworksize should work, doesn't it?

                                                        When I run the same code on CPU, it crashs with windows error (Applicaton does not respond), if localworksize not 1. but it should support localsize 1024.

                                                         

                                                        Edit: There is no problem to run localsize 256, if I don't use any local variables. But when there is one variable declared as local, it doesn't work. But what do I do wrong?

                                                        Edit2: On CPU it always crash....

                                                         

                                                          • Allocate Array in kernel
                                                            genaganna

                                                             

                                                            Originally posted by: Sheeep strange is:

                                                             

                                                            I tried it on an nvidia device. I used globalsize 512 and localsize 256  it works without problems...

                                                             

                                                            I also tried localsize 512 and globalsize 5120 (nVidia device supports localsize 512). It works, too. But why doesn't it work on ATI device? Localsize 256 should work, but I can only run it with localsize 64 or lower.

                                                             

                                                            I get the error  clEnqueueNDRangeKernel(-54). But globalworksize=2*localworksize should work, doesn't it?

                                                             

                                                            When I run the same code on CPU, it crashs with windows error (Applicaton does not respond), if localworksize not 1. but it should support localsize 1024.

                                                             

                                                             

                                                             

                                                            Edit: There is no problem to run localsize 256, if I don't use any local variables. But when there is one variable declared as local, it doesn't work. But what do I do wrong?

                                                             

                                                            Edit2: On CPU it always crash....

                                                             

                                                             

                                                             

                                                            Sheeep,

                                                                       Variable length arrays not supported as per the OpenCL spec.  I feel it should give appropriate error instead of crash.

                                                              • Allocate Array in kernel
                                                                Sheeep

                                                                Yes, but it's not the problem.

                                                                The kernel:

                                                                __kernel void localWork(__global int *a,__global int *b){
                                                                    int gid=get_global_id(0);   
                                                                    int lid=get_local_id(0);
                                                                    __local int la[256];
                                                                    la[lid]=a[gid];
                                                                    la[lid]+=1;
                                                                    b[gid]=la[lid];
                                                                }

                                                                is not working, too. Same errors. clEnqueueNDRangeKernel(-54) on gpu and "Applicaton does not respond" on cpu.

                                                                Edit:

                                                                I just allocate the __local array on gpu - same error:

                                                                Kernels I tried:

                                                                __kernel void localWork2(__global int *a,__global int *b,__local int la[256]){

                                                                }

                                                                or

                                                                __kernel void localWork2(__global int *a,__global int *b){ __local int la[256];
                                                                }

                                                                But what can be wrong on this kernel?

                                                                hostcode works, if I do not allocate a local array...

                                                                max local_size I can use is 64 on gpu and 1 on cpu...

                                                                  • Allocate Array in kernel
                                                                    omkaranathan

                                                                    Sheeep,

                                                                    I don't see anything wrong with your kernel. I am able to compile and run it without any issues. 

                                                                    What are your global and local work sizes? Error -54 , CL_INVALID_WORK_GROUP_SIZE can occur if these sizes are not proper.

                                                                    Could you post your host side code?

                                                                      • Allocate Array in kernel
                                                                        Sheeep

                                                                        Hi omkaranathan,

                                                                        i try to run the kernel with localworksize 256 and globalworksize x*256. x is an integer. it doesn't work. I tried other worksize. If the localworksize is smaller than 64, I can run it. If localworksize bigger I can't run it.

                                                                        localsize = 64 globalsize=10*64  works.

                                                                        If I change localsize to 65 or 128 or 256 it does not work. LocalArraysize in kernel I changed to localworksize. I pay attention, that localworksize *x = globalworksize. x is an int.

                                                                        On nvidia device same code (and same binary) does not have any problems. My GPU is a Radeon 4870, and I have a netbook with nvidia ion.

                                                                        What is your device?

                                                                        That's my hostcode:


                                                                         

                                                                        #include <cstdio> #include <cstdlib> #include <fstream> #include <iostream> #include <vector> #include <iterator> #define __CL_ENABLE_EXCEPTIONS #include <CL\cl.hpp> #include <ctime> int main(int argc, char** argv){ cl_int error; std::string buildlog; cl::Context context; cl::Program program; std::vector<cl::Device> devices; try{ //get CL platform info std::vector<cl::Platform> platforms; cl::Platform::get(&platforms); cl_context_properties platform=NULL; std::vector<cl::Platform>::iterator i; if(platforms.size() > 0){ for(i = platforms.begin(); i != platforms.end(); ++i){ platform=((cl_context_properties)(*i)()); if(!strcmp((*i).getInfo<CL_PLATFORM_VENDOR>().c_str(), "Advanced Micro Devices, Inc."))break; } } cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM, platform, 0 }; cl_context_properties *cprops =(platform==NULL) ? NULL : cps; //Creating CL Device; context=cl::Context(CL_DEVICE_TYPE_GPU,cprops,NULL,NULL,&error); //getting Device List devices=context.getInfo<CL_CONTEXT_DEVICES>(); //creating Commandqueue cl::CommandQueue queue=cl::CommandQueue(context,devices[0]); //Reading CL Programm from file std::ifstream file("Kernel_localWork2.cl"); //Kernelname std::string prog(std::istreambuf_iterator<char>(file),(std::istreambuf_iterator<char>())); cl::Program::Sources source(1,std::make_pair(prog.c_str(), prog.length())); //Building CL Programm for Device program=cl::Program(context,source,&error); program.build(devices); //finally Kernels: cl::Kernel kernel1=cl::Kernel(program,"localWork2",&error); //Hostmemory cl_int local_work_size=64; //devices[0].getInfo(CL_DEVICE_MAX_WORK_GROUP_SIZE,&local_work_size); //get max local_work_size cl_int num_group_size=10; cl_int global_work_size=num_group_size*local_work_size; int cpun; devices[0].getInfo(CL_DEVICE_MAX_COMPUTE_UNITS,&cpun); std::cout<<"MAX_COMPUTE_UNITS: "<<cpun<<std::endl; cl_int *a=new cl_int[num_group_size]; cl_int *b=new cl_int[num_group_size]; //initialing OpenCL Buffer(MemoryObjects) cl::Buffer CL1=cl::Buffer(context,CL_MEM_READ_ONLY |CL_MEM_USE_HOST_PTR,sizeof(a[0]) * num_group_size,a,&error); cl::Buffer CL2=cl::Buffer(context,CL_MEM_WRITE_ONLY |CL_MEM_USE_HOST_PTR,sizeof(b[0]) * num_group_size,b,&error); //set Hostmemory for(int i=0;i<num_group_size;i++){ a=i; b=0; } //set Kernel Arguments kernel1.setArg(0,CL1); kernel1.setArg(1,CL2); kernel1.setArg(2,sizeof(int)*local_work_size,NULL); //Running Kernel clock_t time; time=clock(); queue.finish(); queue.enqueueNDRangeKernel(kernel1,cl::NullRange,cl::NDRange(global_work_size,1,1),cl::NDRange(local_work_size,1,1),NULL,NULL); queue.enqueueReadBuffer (CL2,CL_TRUE,0,sizeof(b[0])*num_group_size,b); queue.finish(); time=clock()-time; //Ausgabe std::cout<<std::endl<<"Ergebnis OCL: "<<std::endl<<""; for(int i=0;i<num_group_size;i++){ std::cout<<b<<" "; } std::cout<<std::endl; std::cout<<std::endl<<"Zeit OCL: "<<time<<"ms"<<std::endl<<std::endl; delete[] a,b; }catch(cl::Error& error){ std::cout<<"OpenCL-Error: "<<error.what()<<"("<<error.err()<<")"<<std::endl<<std::endl; } buildlog=program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0]); if(buildlog!=""){ std::cout<<std::endl<<"________________________________________________________________________________"<<std::endl; std::cout<<"Buildlog:"<<std::endl; std::cout<<buildlog<<std::endl; std::cout<<"________________________________________________________________________________"<<std::endl; } return 0; }

                                                                          • Allocate Array in kernel
                                                                            gaurav.garg

                                                                            Query work group info on your kernel with CL_KERNEL_WORK_GROUP_SIZE enum and see what it retruns. Any work-group size above this number will cause this error.

                                                                              • Allocate Array in kernel
                                                                                Sheeep

                                                                                Ok, CL_KERNEL_WORK_GROUP_SIZE =64.

                                                                                so it can't run with local work size greater than 64...

                                                                                 

                                                                                but why is there a difference between CL_KERNEL_WORK_GROUP_SIZE and CL_DEVICE_MAX_WORK_GROUP_SIZE?

                                                                                Using CPU ist the same, but using GPU CL_KERNEL_WORK_GROUP_SIZE=64 and CL_DEVICE_MAX_WORK_GROUP_SIZE=256.

                                                                                EDIT:

                                                                                Why does it crash on CPU, if num_Group_size greater than 6?

                                                                                MFG

                                                                                SHEEEP

                                                                                  • Allocate Array in kernel
                                                                                    gaurav.garg

                                                                                    CL_KERNEL_WORK_GROUP_SIZE returns max work-group size based on your kernel (resource usage, instructions used), whereas CL_DEVICE_MAX_WORK_GROUP_SIZE is the maximum work-group size allowed for the device.

                                                                                      • Allocate Array in kernel
                                                                                        Sheeep

                                                                                        Thanks, for help, I understand.

                                                                                         

                                                                                        What I do not understand is: on ATI Device is CL_KERNEL_WORK_GROUP_SIZE=64 and CL_DEVICE_MAX_WORK_GROUP_SIZE=256.

                                                                                        On nVidia is CL_KERNEL_WORK_GROUP_SIZE=512and CL_DEVICE_MAX_WORK_GROUP_SIZE=512.

                                                                                        Why is there a difference between ATI and Nvidia? I used same Binary  Code

                                                                                          • Allocate Array in kernel
                                                                                            gaurav.garg

                                                                                             

                                                                                            What I do not understand is: on ATI Device is CL_KERNEL_WORK_GROUP_SIZE=64 and CL_DEVICE_MAX_WORK_GROUP_SIZE=256.

                                                                                             

                                                                                            On nVidia is CL_KERNEL_WORK_GROUP_SIZE=512and CL_DEVICE_MAX_WORK_GROUP_SIZE=512.

                                                                                             

                                                                                            Why is there a difference between ATI and Nvidia? I used same Binary  Code



                                                                                            Different GPUs have different capabilities. Even if you use the same code, work-group size requirements might be different depending on the device on which you have compiled your kernel.

                                                                • Allocate Array in kernel
                                                                  MicahVillmow
                                                                  Sheep,
                                                                  Is this being run on a 4XXX series of cards?
                                                                  • Allocate Array in kernel
                                                                    MicahVillmow
                                                                    Sheep,
                                                                    The problem is that on the 4XXX series, if a barrier is used, then we do not run more than a single wavefront per work-group. This is because the 4XXX series barrier is not a hardware barrier but a software barrier and there are corner cases that can cause it to never work correctly. However, since your kernel does not use barrier, it should not have triggered the reduction in kernel work group size. This is something that has been fixed for the next release.