Archives Discussions

shreedhar_pawar · ‎03-17-2013

In a simple addition vector addition program I have given the Array_Size of both arrays a and b to be 131072 and declared them as unsigned int, also the I have given the local work size as 512 and global work size same as the Array_Size. The code pauses in between and a box appears saying:

"Unhandled exception at 0x011dbd27 in helloworld1.exe: 0xC00000FD: Stack overflow. "

If the Array_Size is given as 65536 or some less value(in the powers of 2), the code gets executed. What is the exact problem..? Is it that the number of groups for a particular GPU device can't exceed a certain value(in my case: 65536/512 = 128), if not this then what's the problem...?

Thanks in Advance.

himanshu_gautam · ‎03-18-2013

You are using :

unsigned int result[ARRAY_SIZE];

unsigned int a[ARRAY_SIZE];

unsigned int b[ARRAY_SIZE];

Do not create such large arrays on stack. Stack overflow is bound to happen. Instead use Dynamic memory allocation. I could run your code allocating these arrays dynamically.

View solution in original post

himanshu_gautam · ‎03-18-2013

Hi,

512 local size?? that is not supported on AMD GPUs. Are you running it on CPU?

131072 is 2^17 uints that is 2^19Bytes (512KB), which is not huge.

Can you share the code? Can you try comparing your code with Template SDK Example?

shreedhar_pawar · ‎03-18-2013

Hi Himanshu,

Firstly I want to thank you for solving many of my doubts.

I am using an Nvidia Quadro FX 880. The maximum local work size supported is 512, I have verified this by usnig

"clGetKernelWorkGroupInfo(kernel, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &workgroup_size, NULL)" and also my code functions properly for local work size 512.

My doubt is regarding the maximum global work size.. is there anything such as maximum number of work groups or maximum global work size for a particular device..? If not so why does the error appear. I have posted the host code as well as the kernel code below...

#include <iostream>

#include <fstream>

#include <sstream>

#include <ctime>

#ifdef __APPLE__

#include <OpenCL/cl.h>

#else

#include <CL\cl.h>

#endif

int x;

///

// Constants

//

const unsigned int ARRAY_SIZE =131072;

///

// Create an OpenCL context on the first available platform using

// either a GPU or CPU depending on what is available.

//

cl_context CreateContext()

{

cl_int errNum;

cl_uint numPlatforms;

cl_platform_id firstPlatformId;

cl_context context = NULL;

// First, select an OpenCL platform to run on. For this example, we

// simply choose the first available platform. Normally, you would

// query for all available platforms and select the most appropriate one.

errNum = clGetPlatformIDs(1, &firstPlatformId, &numPlatforms);

if (errNum != CL_SUCCESS || numPlatforms <= 0)

{

std::cerr << "Failed to find any OpenCL platforms." << std::endl;

return NULL;

}

// Next, create an OpenCL context on the platform. Attempt to

// create a GPU-based context, and if that fails, try to create

// a CPU-based context.

cl_context_properties contextProperties[] =

{

CL_CONTEXT_PLATFORM,

(cl_context_properties)firstPlatformId,

0

};

context = clCreateContextFromType(contextProperties, CL_DEVICE_TYPE_GPU,

NULL, NULL, &errNum);

if (errNum != CL_SUCCESS)

{

std::cout << "Could not create GPU context, trying CPU..." << std::endl;

context = clCreateContextFromType(contextProperties, CL_DEVICE_TYPE_CPU,

NULL, NULL, &errNum);

if (errNum != CL_SUCCESS)

{

std::cerr << "Failed to create an OpenCL GPU or CPU context." << std::endl;

return NULL;

}

return context;

}

///

// Create a command queue on the first device available on the

// context

//

cl_command_queue CreateCommandQueue(cl_context context, cl_device_id *device)

{

cl_int errNum;

cl_device_id *devices;

cl_command_queue commandQueue = NULL;

size_t deviceBufferSize = -1;

// First get the size of the devices buffer

errNum = clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &deviceBufferSize);

if (errNum != CL_SUCCESS)

{

std::cerr << "Failed call to clGetContextInfo(...,GL_CONTEXT_DEVICES,...)";

return NULL;

}

if (deviceBufferSize <= 0)

{

std::cerr << "No devices available.";

return NULL;

}

// Allocate memory for the devices buffer

devices = new cl_device_id[deviceBufferSize / sizeof(cl_device_id)];

errNum = clGetContextInfo(context, CL_CONTEXT_DEVICES, deviceBufferSize, devices, NULL);

if (errNum != CL_SUCCESS)

{

delete [] devices;

std::cerr << "Failed to get device IDs";

return NULL;

}

// In this example, we just choose the first available device. In a

// real program, you would likely use all available devices or choose

// the highest performance device based on OpenCL device queries

commandQueue = clCreateCommandQueue(context, devices[0], 0, NULL);

if (commandQueue == NULL)

{

delete [] devices;

std::cerr << "Failed to create commandQueue for device 0";

return NULL;

}

*device = devices[0];

delete [] devices;

return commandQueue;

}

///

// Create an OpenCL program from the kernel source file

//

cl_program CreateProgram(cl_context context, cl_device_id device, const char* fileName)

{

cl_int errNum;

cl_program program;

std::ifstream kernelFile(fileName, std::ios::in);

if (!kernelFile.is_open())

{

std::cerr << "Failed to open file for reading: " << fileName << std::endl;

return NULL;

}

std::ostringstream oss;

oss << kernelFile.rdbuf();

std::string srcStdStr = oss.str();

const char *srcStr = srcStdStr.c_str();

program = clCreateProgramWithSource(context, 1,

(const char**)&srcStr,

NULL, NULL);

if (program == NULL)

{

std::cerr << "Failed to create CL program from source." << std::endl;

return NULL;

}

errNum = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

if (errNum != CL_SUCCESS)

{

// Determine the reason for the error

char buildLog[16384];

clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG,

sizeof(buildLog), buildLog, NULL);

std::cerr << "Error in kernel: " << std::endl;

std::cerr << buildLog;

clReleaseProgram(program);

return NULL;

}

return program;

}

///

// Create memory objects used as the arguments to the kernel

// The kernel takes three arguments: result (output), a (input),

// and b (input)

//

bool CreateMemObjects(cl_command_queue commandQueue,cl_context context, cl_mem memObjects[2], cl_mem *memResult,

unsigned int*a, unsigned int*b)

{

cl_mem tempObj;

tempObj=clCreateBuffer(context,CL_MEM_READ_ONLY| CL_MEM_COPY_HOST_PTR,

sizeof( long) *ARRAY_SIZE, a, NULL);

memObjects[0] = clCreateBuffer(context, CL_MEM_READ_ONLY,

sizeof( unsigned int) * ARRAY_SIZE, NULL, NULL);

clEnqueueCopyBuffer(commandQueue, tempObj, memObjects[0], 0, 0,

sizeof( unsigned int) * ARRAY_SIZE, 0, NULL, NULL);

memObjects[1] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,

sizeof( unsigned int) * ARRAY_SIZE, b, NULL);

memResult[0] = clCreateBuffer(context, CL_MEM_READ_WRITE,

sizeof( unsigned int) * ARRAY_SIZE, NULL, NULL);

if (memObjects[0] == NULL || memObjects[1] == NULL || memObjects[2] == NULL)

{

std::cerr << "Error creating memory objects." << std::endl;

return false;

}

return true;

}

///

// Cleanup any created OpenCL resources

//

void Cleanup(cl_context context, cl_command_queue commandQueue,

cl_program program, cl_kernel kernel, cl_mem memObjects[2],cl_mem*memResult)

{

for (int i = 0; i < 2; i++)

{

if (memObjects != 0)

clReleaseMemObject(memObjects);

}

if (*memResult != 0)

clReleaseMemObject(*memResult);

if (commandQueue != 0)

clReleaseCommandQueue(commandQueue);

if (kernel != 0)

clReleaseKernel(kernel);

if (program != 0)

clReleaseProgram(program);

if (context != 0)

clReleaseContext(context);

}

///

// main() for HelloWorld example

//

int main(int argc, char** argv)

{

clock_t tStart = clock();

printf("Time: %.2fs\n", (double)(tStart)/CLOCKS_PER_SEC);

cl_context context = 0;

cl_command_queue commandQueue = 0;

cl_program program = 0;

cl_device_id device = 0;

cl_kernel kernel = 0;

cl_mem memObjects[2] = { 0, 0 };

cl_mem memResult;

cl_int errNum;

size_t workgroup_size;

// Create an OpenCL context on first available platform

context = CreateContext();

if (context == NULL)

{

std::cerr << "Failed to create OpenCL context." << std::endl;

return 1;

}

// Create a command-queue on the first device available

// on the created context

commandQueue = CreateCommandQueue(context, &device);

if (commandQueue == NULL)

{

Cleanup(context, commandQueue, program, kernel, memObjects,&memResult);

return 1;

}

// Create OpenCL program from HelloWorld.cl kernel source

program = CreateProgram(context, device, "HelloWorld.cl");

if (program == NULL)

{

Cleanup(context, commandQueue, program, kernel, memObjects,&memResult);

return 1;

}

// Create OpenCL kernel

kernel = clCreateKernel(program, "hello_kernel", NULL);

if (kernel == NULL)

{

std::cerr << "Failed to create kernel" << std::endl;

Cleanup(context, commandQueue, program, kernel, memObjects,&memResult);

return 1;

}

clGetKernelWorkGroupInfo(kernel, device,CL_KERNEL_WORK_GROUP_SIZE,sizeof(size_t), &workgroup_size, NULL);

// Create memory objects that will be used as arguments to

// kernel. First create host memory arrays that will be

// used to store the arguments to the kernel

unsigned int result[ARRAY_SIZE];

unsigned int a[ARRAY_SIZE];

unsigned int b[ARRAY_SIZE];

for ( unsigned int i = 0; i < ARRAY_SIZE; i++)

{

a = i;

b = (i * 2);

}

if (!CreateMemObjects(commandQueue,context, memObjects,&memResult, a, b))

{

Cleanup(context, commandQueue, program, kernel, memObjects,&memResult);

return 1;

}

// Set the kernel arguments (result, a, b)

errNum = clSetKernelArg(kernel, 0, sizeof(cl_mem), &memObjects[0]);

errNum |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &memObjects[1]);

errNum |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &memResult);

if (errNum != CL_SUCCESS)

{

std::cerr << "Error setting kernel arguments." << std::endl;

Cleanup(context, commandQueue, program, kernel, memObjects,&memResult);

return 1;

}

size_t globalWorkSize[1] = {ARRAY_SIZE };

size_t localWorkSize[1] = { 512 };

// Queue the kernel up for execution across the array

errNum = clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL,

globalWorkSize, localWorkSize,

0, NULL, NULL);

if (errNum != CL_SUCCESS)

{

std::cerr << "Error queuing kernel for execution." << std::endl;

Cleanup(context, commandQueue, program, kernel, memObjects,&memResult);

return 1;

}

// Read the output buffer back to the Host

errNum = clEnqueueReadBuffer(commandQueue, memResult, CL_TRUE,

0, ARRAY_SIZE * sizeof(unsigned int), result,

0, NULL, NULL);

if (errNum != CL_SUCCESS)

{

std::cerr << "Error reading result buffer." << std::endl;

Cleanup(context, commandQueue, program, kernel, memObjects,&memResult);

return 1;

}

// Output the result buffer

for ( unsigned int i = 0; i < ARRAY_SIZE; i++)

{

std::cout << result << std::endl;

}

std::cout << std::endl;

std::cout << "Executed program succesfully." << std::endl;

Cleanup(context, commandQueue, program, kernel, memObjects,&memResult);

printf("WorkGroup_Size is %d\n",workgroup_size);

printf("Time taken: %.2fs\n", (double)(clock() - tStart)/CLOCKS_PER_SEC);

scanf("%d",& x);

return 0;

}

// Kernel code

__kernel void hello_kernel(__global const uint *a,

__global const uint *b,

__global uint *r)

{

uint gid = get_global_id(0);

r[gid] = a[gid] + b[gid]

}

Waiting for your reply. Thanks in advance

himanshu_gautam · ‎03-18-2013

You are using :

unsigned int result[ARRAY_SIZE];

unsigned int a[ARRAY_SIZE];

unsigned int b[ARRAY_SIZE];

Do not create such large arrays on stack. Stack overflow is bound to happen. Instead use Dynamic memory allocation. I could run your code allocating these arrays dynamically.

shreedhar_pawar · ‎03-20-2013

Thanks Himanshu, Your answers always help...!

Archives Discussions

Stack Overflow error in Vector Addition Program