cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Sheeep
Journeyman III

Allocate Array in kernel

Hi,

i tried to allocate an array in the a kernel.

the array's length is a kernel parameter:

__kernel void array(__global int *length){

    int a[length[0 ]];

                  (...);

}

when i run the kernel i get an error: clProgrammBuild(-11).

 

i changed type of a to __local int a[length[0]]. and it works.

but i don't want a __local array in this code.

What can i do?

 

Sheeep

0 Likes
31 Replies
n0thing
Journeyman III

I think the register allocation is at compile time but the data is available at kernel execution time only, so it won't work.

 Also, private space arrays are spilled out to global memory right now so you are better off with local memory if you want good performance.

0 Likes

Originally posted by: n0thing I think the register allocation is at compile time but the data is available at kernel execution time only, so it won't work.

 

 Also, private space arrays are spilled out to global memory right now so you are better off with local memory if you want good performance.



Assuming that private variables are still allocated in global memory in current 2.01 OpenCL implementation, I've found on my 5770 that they don't degrade the performance of my kernels, on condition that they are small (that is, they refer to a single value and not a vector). I think this is due to the fact that global memory caching is working (through texture caches, if I remember correctly) on 5xxx videocards.

0 Likes

yes on 5xxx card is used VFETCH instruction which is cached. but when i read kernel ISA on 5xxx card i get LDS_WRITE and LDS_READ instruction which is IMHO local memory.

0 Likes
Fr4nz
Journeyman III

Originally posted by: nou yes on 5xxx card is used VFETCH instruction which is cached. but when i read kernel ISA on 5xxx card i get LDS_WRITE and LDS_READ instruction which is IMHO local memory.

 

 

Are you affirming that private variables are emulated in local memory in current version of AMD OpenCL implementation??

0 Likes
genaganna
Journeyman III

Originally posted by: Sheeep Hi,

i tried to allocate an array in the a kernel.

the array's length is a kernel parameter:

__kernel void array(__global int *length){

    int a[length[0 ]];

                  (...);

}

when i run the kernel i get an error: clProgrammBuild(-11).

i changed type of a to __local int a[length[0]]. and it works.

but i don't want a __local array in this code.

What can i do?

Variable length arrays not supported as per the spec. 

Could you please give us kernel which works with __local array with variable size and gives correct results.

You can pass variable size to kernel using #define  and compile kernel with -D flag.

0 Likes

 

__local array only runs on one compute unit? So my local worksize should be my global worksize? 

And when i have two different local arrays, do they run on two different compute units or do I have to make them run parallel?

the kernel which works (simplified):

void func(__local int *a,int length){
    for(int i=0;i
        a+=1;
    }
};
__kernel void array(__global int *inarray, __global const int *wlength, __global int *gws){
    int gid=get_global_id(0);
    gws[0]=get_global_size(0);
    __local int a[gws[0]][wlength[0]];
    __local int b[gws[0]][wlength[0]];
    int lid=get_local_id(0);
    for(int i=0;i
        a[lid]
=inarray[i+gid*wlength[0]];
  
    }

    func(a[lid],wlength[0]);


    for(int i=0;i
        inarray[i+lid*wlength[0]]=a[lid]
;
    }
}

 

 

EDIT:

I tried to run the same code on nvidia device. I get an error, opencl does not support variable array length.

@genaganna:

You can pass variable size to kernel using #define and compile kernel with -D flag.

when i #define alength 9 i can't run the same kernel with different arraylength. so what can i do to allocate an array with variable length?

i think only why is, using __global array as kernel parameter. but there i can't use 2dArrays. so i can't run the function in the kernel above.

does anyone now a why run a function like this in parrallel way?

 

what is the -D flag? can you please explain me, how I can pass variable size to kernel using #define and compile kernel with -D flag.

0 Likes

Originally posted by: Sheeep  

__local array only runs on one compute unit?

   Each compute unit has a shared memory.  Based on __local array request per group, Compute unit able to run more than one workgroup.

You should think __local array as per workgroup not per compute unit.

And when i have two different local arrays, do they run on two different compute units or do I have to make them run parallel?

You can have any number of __local array's per group. all __local array's are allocated on one compute unit and you no need to do some extra if you have more than one __local array. It is exactly same as having one __lcoal array.

what do you mean by "Do i have to make them run parallel"?

the kernel which works (simplified):

void func(__local int *a,int length){     for(int i=0;i         a+=1;     } }; __kernel void array(__global int *inarray, __global const int *wlength, __global int *gws){     int gid=get_global_id(0);     gws[0]=get_global_size(0);     __local int a[gws[0]][wlength[0]];     __local int b[gws[0]][wlength[0]];     int lid=get_local_id(0);     for(int i=0;i         a[lid]=inarray[i+gid*wlength[0]];        }     func(a[lid],wlength[0]);

 

    for(int i=0;i         inarray[i+lid*wlength[0]]=a[lid];     } }

EDIT:

I tried to run the same code on nvidia device. I get an error, opencl does not support variable array length.

Thanks for reporting this issue. 

i think only why is, using __global array as kernel parameter. but there i can't use 2dArrays. so i can't run the function in the kernel above.

does anyone now a why run a function like this in parrallel way?

You have a pointer so you can use same pointer as 1D array, 2D array and also as N-dimensional array.

 



what is the -D flag? can you please explain me, how I can pass variable size to kernel using #define and compile kernel with -D flag.

Define kernel as follows

  __kernel void func(...)

{

  __local float a[A_LENGTH];

  __local float b[B_LENGHT];

  .....

  ......

}

 

Build above kernel like following

clBuildProgram(program, 1, devices, "-D A_LENGTH=9 -D B_LENGTH=10", NULL, NULL);

Note : Compiler options you can specify throught fourth parameter of clBuildProgram

0 Likes

 

Hi,

You have a pointer so you can use same pointer as 1D array, 2D array and also as N-dimensional array.

 



I can uses a pointer to a 1d array as a pointer to a 2d array?

so sizeof(1d array)=sizeof (outer 2d array)*(inner 2d array)?

MFG Sheeep

0 Likes

Originally posted by: Sheeep  

 

Hi,

 

 

You have a pointer so you can use same pointer as 1D array, 2D array and also as N-dimensional array.

 

 



 

I can uses a pointer to a 1d array as a pointer to a 2d array?

 

so sizeof(1d array)=sizeof (outer 2d array)*(inner 2d array)?

 

MFG Sheeep

 

I have pointer like this

  float* ptr

  //One dimensional array access

 ptr;

//Two dimensional array access

ptr[i * width + j] ; //Where width is width of 2 dimensional array

 

I hope you are familiar with C/C++ pointers.

0 Likes

Originally posted by: genaganna 

I have pointer like this

  float* ptr

  //One dimensional array access

 ptr;

//Two dimensional array access

ptr[i * width + j] ; //Where width is width of 2 dimensional array



Yes, thanks, thats what i already used above...

I thought you want to use a function like malloc or new int[][] in kernel...

i wonder about this, because i think that is not possible in opencl...

Originally posted by: genaganna

I hope you are familiar with C/C++ pointers.



yes i am


MFG Sheeep

0 Likes

Originally posted by: Sheeep
Originally posted by: genaganna 

 

I have pointer like this

 

  float* ptr

 

  //One dimensional array access

 

 ptr;

 

//Two dimensional array access

 

ptr[i * width + j] ; //Where width is width of 2 dimensional array

 



 

Yes, thanks, thats what i already used above...

 

I thought you want to use a function like malloc or new int[][] in kernel...

 

i wonder about this, because i think that is not possible in opencl...

 

C Standard library is not supported in OpenCL.

Dynamic allocation is not possible in kernel.

0 Likes

Yes, I know, I have read it in opencl specs.

But I have an other question:

I can run this kernel on gpu device without problem, but on cpu i get an error, if num_group_size bigger than 6.

__kernel void localWork(__global int *a,__global int *b){
    int gid=get_global_id(0);   
    int lid=get_local_id(0);
    __local int la[10];  //10 for example == local_work_size(0)
    __local int lb[10];
    la[lid]=a[gid];
    lb[lid]=la[lid]+1;
    b[gid]=lb[lid];

}

And there is no problem, running (on cpu) it with local_size 1 and global_size 100. but running with  local_size 2 and global_size 200 it will crash. cpu's Max local_worksize ist 1024.

why does it crash on cpu, but runs fine on gpu?

EDIT:

with local_size 1 and global_size 200 it also runs fine.

 

0 Likes

Fr4nz,
arrays declared in the __local address space in a kernel use local memory, arrays not declared in the local address space use emulated private memory.
0 Likes

oh no LDS_READ and LDS_WRITE is when i use local memory.

0 Likes

Originally posted by: MicahVillmow Fr4nz, arrays declared in the __local address space in a kernel use local memory, arrays not declared in the local address space use emulated private memory.


I knew that, it was nou's post that wasn't clear to me (in fact I was suprised by his sentence)

0 Likes

Using GPU device I get error -54 if my localworksize is bigger than 64.

I think errer 54 is wrong num_group_size, but my globalsize=x*localsize...

CL_Info.exe tells my max worksize is 256.

I don't understand why. I can run the kernel with localsize 10 and globalsize 100 fine, but not with localsize 100 and globalsize 1000.

 

On CPU ist doen't work with correctly if localworksize is not 1. I don't get a CL-error, app crashs.

I think problem is I don't understand it complete. But I don't know what I do wrong.

MFG Sheeep

Kernel:
__kernel void localWork(__global int *a,__global int *b){
    int gid=get_global_id(0);   
    int lid=get_local_id(0);
    __local int la[10];  //10 for example == local_work_size(0)
    __local int lb[10];
    la[lid]=a[gid];
    lb[lid]=la[lid]+1;
    b[gid]=lb[lid];

}

 

0 Likes

Originally posted by: Sheeep Using GPU device I get error -54 if my localworksize is bigger than 64.

 

I think errer 54 is wrong num_group_size, but my globalsize=x*localsize...

x must be a integer value.

 

CL_Info.exe tells my max worksize is 256.

 

I don't understand why. I can run the kernel with localsize 10 and globalsize 100 fine, but not with localsize 100 and globalsize 1000.

    In below kernel, your local array size is 10 but you are accessing 100 elements if localsize 100.  GPU donot handle exceptions that is why you donot see crash. 

 

 

On CPU ist doen't work with correctly if localworksize is not 1. I don't get a CL-error, app crashs.

 

I think problem is I don't understand it complete. But I don't know what I do wrong.

 

MFG Sheeep

 

Kernel: __kernel void localWork(__global int *a,__global int *b){     int gid=get_global_id(0);        int lid=get_local_id(0);     __local int la[10];  //10 for example == local_work_size(0)     __local int lb[10];     la[lid]=a[gid];     lb[lid]=la[lid]+1;     b[gid]=lb[lid]; }

 

 

Give us run time code also to reply quickly.

0 Likes

Originally posted by: genaganna 

   In below kernel, your local array size is 10 but you are accessing 100 elements if localsize 100.  GPU donot handle exceptions that is why you donot see crash.

 

of course, i changed it, when i used localsize 100.


My host code:

#include <cstdio>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <vector>
#include <iterator>
#define __CL_ENABLE_EXCEPTIONS
#include <CL\cl.hpp>
#include <ctime>
int main(int argc, char** argv){
   
    cl_int error;
    std::string buildlog;
    cl::Context context;
    cl::Program program;
    std::vector<cl::Device> devices;
    try{
        //get CL platform info
        std::vector<cl::Platform> platforms;
            cl::Platform::get(&platforms);
            cl_context_properties platform=NULL;
             std::vector<cl::Platform>::iterator i;
            if(platforms.size() > 0){
                for(i = platforms.begin(); i != platforms.end(); ++i){
                    platform=((cl_context_properties)(*i)());
                    if(!strcmp((*i).getInfo<CL_PLATFORM_VENDOR>().c_str(), "Advanced Micro Devices, Inc."))break;   
                }
            }
        cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM, platform, 0 };
        cl_context_properties *cprops =(platform==NULL) ? NULL : cps;
        //Creating CL Device;
        context=cl::Context(CL_DEVICE_TYPE_GPU,cprops,NULL,NULL,&error);
        //getting Device List
        devices=context.getInfo<CL_CONTEXT_DEVICES>();
        //creating Commandqueue
        cl::CommandQueue queue=cl::CommandQueue(context,devices[0]);
        //Reading CL Programm from file
        std::ifstream file("Kernel_localWork2.cl");    //Kernelname
        std::string prog(std::istreambuf_iterator<char>(file),(std::istreambuf_iterator<char>()));
        cl::Program::Sources source(1,std::make_pair(prog.c_str(), prog.length()));
        //Building CL Programm for Device
        program=cl::Program(context,source,&error);
        program.build(devices);
        //finally Kernels:
        cl::Kernel kernel1=cl::Kernel(program,"localWork2",&error);  
       
        //Hostmemory
        cl_int local_work_size;
        devices[0].getInfo(CL_DEVICE_MAX_WORK_GROUP_SIZE,&local_work_size); //get max local_work_size
        cl_int num_group_size=10;
        cl_int global_work_size=num_group_size*local_work_size;
        int cpun;
        devices[0].getInfo(CL_DEVICE_MAX_COMPUTE_UNITS,&cpun);
        std::cout<<"MAX_COMPUTE_UNITS: "<<cpun<<std::endl;
        cl_int *a=new cl_int[num_group_size];
        cl_int *b=new cl_int[num_group_size];
        //initialing OpenCL Buffer(MemoryObjects)
        cl::Buffer CL1=cl::Buffer(context,CL_MEM_READ_ONLY |CL_MEM_USE_HOST_PTR,sizeof(a[0]) * num_group_size,a,&error);
        cl::Buffer CL2=cl::Buffer(context,CL_MEM_WRITE_ONLY |CL_MEM_USE_HOST_PTR,sizeof(b[0]) * num_group_size,b,&error);
        //set Hostmemory
        for(int i=0;i<num_group_size;i++){
            a=i;
            b
=0;
        }
        //set Kernel Arguments
        kernel1.setArg(0,CL1);
        kernel1.setArg(1,CL2);
        //Running Kernel
        clock_t time;
        time=clock();
            queue.finish();
                queue.enqueueNDRangeKernel(kernel1,cl::NullRange,cl::NDRange(global_work_size,1,1),cl::NDRange(local_work_size,1,1),NULL,NULL);
                queue.enqueueReadBuffer (CL2,CL_TRUE,0,sizeof(b[0])*num_group_size,b);
            queue.finish();
        time=clock()-time;
        //Ausgabe
        std::cout<<std::endl<<"Ergebnis OCL: "<<std::endl<<"";
        for(int i=0;i<num_group_size;i++){
            std::cout<<b<<"    ";
        }
        std::cout<<std::endl;
        std::cout<<std::endl<<"Zeit OCL: "<<time<<"ms"<<std::endl<<std::endl;
        delete[] a,b;
    }catch(cl::Error& error){
        std::cout<<"OpenCL-Error: "<<error.what()<<"("<<error.err()<<")"<<std::endl<<std::endl;
       
    }
    buildlog=program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0]);
    if(buildlog!=""){
        std::cout<<std::endl<<"________________________________________________________________________________"<<std::endl;
        std::cout<<"Buildlog:"<<std::endl;
        std::cout<<buildlog<<std::endl;
        std::cout<<"________________________________________________________________________________"<<std::endl;
    }
    return 0;
}

and the Kernel (same above):

__kernel void localWork(__global int *a,__global int *b){
    int gid=get_global_id(0);   
    int lid=get_local_id(0);
    __local int la[get_local_size(0)];
    la[lid]=a[gid];
    la[lid]+=1;
    b[gid]=la[lid];
}

 


0 Likes

strange is:

I tried it on an nvidia device. I used globalsize 512 and localsize 256  it works without problems...

I also tried localsize 512 and globalsize 5120 (nVidia device supports localsize 512). It works, too. But why doesn't it work on ATI device? Localsize 256 should work, but I can only run it with localsize 64 or lower.

I get the error  clEnqueueNDRangeKernel(-54). But globalworksize=2*localworksize should work, doesn't it?

When I run the same code on CPU, it crashs with windows error (Applicaton does not respond), if localworksize not 1. but it should support localsize 1024.

 

Edit: There is no problem to run localsize 256, if I don't use any local variables. But when there is one variable declared as local, it doesn't work. But what do I do wrong?

Edit2: On CPU it always crash....

 

0 Likes

Originally posted by: Sheeep strange is:

 

I tried it on an nvidia device. I used globalsize 512 and localsize 256  it works without problems...

 

I also tried localsize 512 and globalsize 5120 (nVidia device supports localsize 512). It works, too. But why doesn't it work on ATI device? Localsize 256 should work, but I can only run it with localsize 64 or lower.

 

I get the error  clEnqueueNDRangeKernel(-54). But globalworksize=2*localworksize should work, doesn't it?

 

When I run the same code on CPU, it crashs with windows error (Applicaton does not respond), if localworksize not 1. but it should support localsize 1024.

 

 

 

Edit: There is no problem to run localsize 256, if I don't use any local variables. But when there is one variable declared as local, it doesn't work. But what do I do wrong?

 

Edit2: On CPU it always crash....

 

 

 

Sheeep,

           Variable length arrays not supported as per the OpenCL spec.  I feel it should give appropriate error instead of crash.

0 Likes

Yes, but it's not the problem.

The kernel:

__kernel void localWork(__global int *a,__global int *b){
    int gid=get_global_id(0);   
    int lid=get_local_id(0);
    __local int la[256];
    la[lid]=a[gid];
    la[lid]+=1;
    b[gid]=la[lid];
}

is not working, too. Same errors. clEnqueueNDRangeKernel(-54) on gpu and "Applicaton does not respond" on cpu.

Edit:

I just allocate the __local array on gpu - same error:

Kernels I tried:

__kernel void localWork2(__global int *a,__global int *b,__local int la[256]){

}

or

__kernel void localWork2(__global int *a,__global int *b){ __local int la[256];
}

But what can be wrong on this kernel?

hostcode works, if I do not allocate a local array...

max local_size I can use is 64 on gpu and 1 on cpu...

0 Likes

Sheeep,

I don't see anything wrong with your kernel. I am able to compile and run it without any issues. 

What are your global and local work sizes? Error -54 , CL_INVALID_WORK_GROUP_SIZE can occur if these sizes are not proper.

Could you post your host side code?

0 Likes

Hi omkaranathan,

i try to run the kernel with localworksize 256 and globalworksize x*256. x is an integer. it doesn't work. I tried other worksize. If the localworksize is smaller than 64, I can run it. If localworksize bigger I can't run it.

localsize = 64 globalsize=10*64  works.

If I change localsize to 65 or 128 or 256 it does not work. LocalArraysize in kernel I changed to localworksize. I pay attention, that localworksize *x = globalworksize. x is an int.

On nvidia device same code (and same binary) does not have any problems. My GPU is a Radeon 4870, and I have a netbook with nvidia ion.

What is your device?

That's my hostcode:


 

#include <cstdio> #include <cstdlib> #include <fstream> #include <iostream> #include <vector> #include <iterator> #define __CL_ENABLE_EXCEPTIONS #include <CL\cl.hpp> #include <ctime> int main(int argc, char** argv){ cl_int error; std::string buildlog; cl::Context context; cl::Program program; std::vector<cl::Device> devices; try{ //get CL platform info std::vector<cl::Platform> platforms; cl::Platform::get(&platforms); cl_context_properties platform=NULL; std::vector<cl::Platform>::iterator i; if(platforms.size() > 0){ for(i = platforms.begin(); i != platforms.end(); ++i){ platform=((cl_context_properties)(*i)()); if(!strcmp((*i).getInfo<CL_PLATFORM_VENDOR>().c_str(), "Advanced Micro Devices, Inc."))break; } } cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM, platform, 0 }; cl_context_properties *cprops =(platform==NULL) ? NULL : cps; //Creating CL Device; context=cl::Context(CL_DEVICE_TYPE_GPU,cprops,NULL,NULL,&error); //getting Device List devices=context.getInfo<CL_CONTEXT_DEVICES>(); //creating Commandqueue cl::CommandQueue queue=cl::CommandQueue(context,devices[0]); //Reading CL Programm from file std::ifstream file("Kernel_localWork2.cl"); //Kernelname std::string prog(std::istreambuf_iterator<char>(file),(std::istreambuf_iterator<char>())); cl::Program::Sources source(1,std::make_pair(prog.c_str(), prog.length())); //Building CL Programm for Device program=cl::Program(context,source,&error); program.build(devices); //finally Kernels: cl::Kernel kernel1=cl::Kernel(program,"localWork2",&error); //Hostmemory cl_int local_work_size=64; //devices[0].getInfo(CL_DEVICE_MAX_WORK_GROUP_SIZE,&local_work_size); //get max local_work_size cl_int num_group_size=10; cl_int global_work_size=num_group_size*local_work_size; int cpun; devices[0].getInfo(CL_DEVICE_MAX_COMPUTE_UNITS,&cpun); std::cout<<"MAX_COMPUTE_UNITS: "<<cpun<<std::endl; cl_int *a=new cl_int[num_group_size]; cl_int *b=new cl_int[num_group_size]; //initialing OpenCL Buffer(MemoryObjects) cl::Buffer CL1=cl::Buffer(context,CL_MEM_READ_ONLY |CL_MEM_USE_HOST_PTR,sizeof(a[0]) * num_group_size,a,&error); cl::Buffer CL2=cl::Buffer(context,CL_MEM_WRITE_ONLY |CL_MEM_USE_HOST_PTR,sizeof(b[0]) * num_group_size,b,&error); //set Hostmemory for(int i=0;i<num_group_size;i++){ a=i; b=0; } //set Kernel Arguments kernel1.setArg(0,CL1); kernel1.setArg(1,CL2); kernel1.setArg(2,sizeof(int)*local_work_size,NULL); //Running Kernel clock_t time; time=clock(); queue.finish(); queue.enqueueNDRangeKernel(kernel1,cl::NullRange,cl::NDRange(global_work_size,1,1),cl::NDRange(local_work_size,1,1),NULL,NULL); queue.enqueueReadBuffer (CL2,CL_TRUE,0,sizeof(b[0])*num_group_size,b); queue.finish(); time=clock()-time; //Ausgabe std::cout<<std::endl<<"Ergebnis OCL: "<<std::endl<<""; for(int i=0;i<num_group_size;i++){ std::cout<<b<<" "; } std::cout<<std::endl; std::cout<<std::endl<<"Zeit OCL: "<<time<<"ms"<<std::endl<<std::endl; delete[] a,b; }catch(cl::Error& error){ std::cout<<"OpenCL-Error: "<<error.what()<<"("<<error.err()<<")"<<std::endl<<std::endl; } buildlog=program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0]); if(buildlog!=""){ std::cout<<std::endl<<"________________________________________________________________________________"<<std::endl; std::cout<<"Buildlog:"<<std::endl; std::cout<<buildlog<<std::endl; std::cout<<"________________________________________________________________________________"<<std::endl; } return 0; }

0 Likes

Query work group info on your kernel with CL_KERNEL_WORK_GROUP_SIZE enum and see what it retruns. Any work-group size above this number will cause this error.

0 Likes

Ok, CL_KERNEL_WORK_GROUP_SIZE =64.

so it can't run with local work size greater than 64...

 

but why is there a difference between CL_KERNEL_WORK_GROUP_SIZE and CL_DEVICE_MAX_WORK_GROUP_SIZE?

Using CPU ist the same, but using GPU CL_KERNEL_WORK_GROUP_SIZE=64 and CL_DEVICE_MAX_WORK_GROUP_SIZE=256.

EDIT:

Why does it crash on CPU, if num_Group_size greater than 6?

MFG

SHEEEP

0 Likes

CL_KERNEL_WORK_GROUP_SIZE returns max work-group size based on your kernel (resource usage, instructions used), whereas CL_DEVICE_MAX_WORK_GROUP_SIZE is the maximum work-group size allowed for the device.

0 Likes

Thanks, for help, I understand.

 

What I do not understand is: on ATI Device is CL_KERNEL_WORK_GROUP_SIZE=64 and CL_DEVICE_MAX_WORK_GROUP_SIZE=256.

On nVidia is CL_KERNEL_WORK_GROUP_SIZE=512and CL_DEVICE_MAX_WORK_GROUP_SIZE=512.

Why is there a difference between ATI and Nvidia? I used same Binary  Code

0 Likes

What I do not understand is: on ATI Device is CL_KERNEL_WORK_GROUP_SIZE=64 and CL_DEVICE_MAX_WORK_GROUP_SIZE=256.

 

On nVidia is CL_KERNEL_WORK_GROUP_SIZE=512and CL_DEVICE_MAX_WORK_GROUP_SIZE=512.

 

Why is there a difference between ATI and Nvidia? I used same Binary  Code



Different GPUs have different capabilities. Even if you use the same code, work-group size requirements might be different depending on the device on which you have compiled your kernel.

0 Likes

Sheep,
Is this being run on a 4XXX series of cards?
0 Likes

Yes, it's a HD 4870

0 Likes

Sheep,
The problem is that on the 4XXX series, if a barrier is used, then we do not run more than a single wavefront per work-group. This is because the 4XXX series barrier is not a hardware barrier but a software barrier and there are corner cases that can cause it to never work correctly. However, since your kernel does not use barrier, it should not have triggered the reduction in kernel work group size. This is something that has been fixed for the next release.
0 Likes