9 Replies Latest reply on Mar 31, 2010 11:29 AM by notyou

    Confused about global/local id

    notyou

      I'm trying to perform matrix multiplication and I have the code correctly running on the CPU when the arrays are flattened.

      When I run it on the GPU however, it only works for every 16th array index, which if I'm not mistaken, means it's not running it on every local_id thread.

      Can someone walk through my thought process below and let me know where I'm wrong.

      I am testing with an int array of size 256 (16x16). The max work group size is 256. Here, when I use row (= get_global_id(0)), it comes back with 0-15 => so it's creating 16 work groups with 16 threads each. Why doesn't it create one work group with 256 threads?

      Then, when I use col (= get_local_id(0)), it only gets the first thread's id and so it only runs for the single column. Can someone explain to me what exactly I'm doing wrong, and why it's not getting the local_id for every thread? Thanks.

      -Matt

      __kernel void global_MM(__global int *A, __global int *B, __global int *C, int dimensions) { int value = 0; int row = get_global_id(0); int col = get_local_id(0); for(int i = 0; i < dimensions; i++) value += A[row * dimensions + i] * B[i * dimensions + col]; C[row * dimensions + col] = value; }

        • Confused about global/local id
          nou

          change get_local_id(0) to get_global_id(1);

            • Confused about global/local id
              notyou

              Thanks for the suggestion. It did not work however.

              Wouldn't using get_global_id(1) only work if the array was 2 dimensional? In my case, it the array is 1D (a 2D array that has been flattened). Would it still count as a 2D array, even if it's been flattened? Going off this, is it possible to pass a 2D array?

              Also, do I need to set the global and local size in the main portion of code (differently than what I have below, or if there is a better way, I'm open to that)? Currently I have

              size_t local;

               size_t global = NUM_ELEMENTS;

              error = clGetKernelWorkGroupInfo(kernel, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(int), &local, NULL);

              error = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, &global, &local, 0, NULL, NULL);

                • Confused about global/local id
                  notyou

                  I managed to get the program working (for the most part). But when I increase the x and y dimensions to 256, the display driver crashes and the visual studio will often give me an error about access memory at location ??? (seemingly random each time). I thought it was because I was using an int array and that I was overstepping the max number allowed, but changing it to long int did not make a difference.

                  Attached is my entire code, can anyone see anything that looks off?

                  //main.cpp #include <CL/cl.hpp> #include <iostream> #include <iomanip> #include <omp.h> #include <fstream> #include <string> using namespace std; #define DIMENSIONS 256 #define NUM_RUNS 5 #define oned //using a 1D array //minimum local size must be 256, so NUM_ELEMENTS must be at least that #define NUM_ELEMENTS DIMENSIONS*DIMENSIONS //16777216 - max size of int array, otherwise we're trying to allocate more memory than we have void printDeviceInfo(cl_device_id); bool errorCheck(cl_int, string); //returns true if there is an error (and prints appropriate message) void cpu_MM(); void verify(); #ifdef oned int A[NUM_ELEMENTS]; int B[NUM_ELEMENTS]; int C[NUM_ELEMENTS]; int CPU[NUM_ELEMENTS]; #endif int dimensions = DIMENSIONS; double gpu_avg = 0.0; double cpu_avg = 0.0; #ifndef oned int AA[DIMENSIONS][DIMENSIONS]; int BB[DIMENSIONS][DIMENSIONS]; int CC[DIMENSIONS][DIMENSIONS]; int CPUCPU[DIMENSIONS][DIMENSIONS]; #endif int main(int argc, char *argv[]) { double start, end, cpu_elapsed, gpu_elapsed; cout<<setprecision(5)<<fixed; //space to allocate = DIMENSIONS * DIMENSIONS = NUM_ELEMENTS #ifdef oned #pragma omp parallel for schedule(static, NUM_ELEMENTS/omp_get_num_threads()) for(int i = 0; i < NUM_ELEMENTS; i++) { A[i] = i; B[i] = i; C[i] = 0; CPU[i] = 0; } #endif #ifndef oned int count = 0; for(int i = 0; i < DIMENSIONS; i++) { for(int j = 0; j < DIMENSIONS; j++) { AA[i][j] = count; BB[i][j] = count; CC[i][j] = 0; CPUCPU[i][j] = 0; count++; } } #endif //initialization cl_int error; //check to see if we error our during most steps cl_platform_id platform; cl_uint numPlatforms; cl_uint num_devices_returned = 0; //get a list of all the platforms error = clGetPlatformIDs(0, NULL, &numPlatforms); if (0 < numPlatforms) { cl_platform_id* platforms = new cl_platform_id[numPlatforms]; error = clGetPlatformIDs(numPlatforms, platforms, NULL); for (unsigned i = 0; i < numPlatforms; ++i) { //char pbuf[100]; //error = clGetPlatformInfo(platforms[i], CL_PLATFORM_VENDOR, sizeof(pbuf), pbuf, NULL); platform = platforms[i]; } delete[] platforms; } //get our GPU device cl_device_id device; error = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL); //error = clGetDeviceIDs(platform, CL_DEVICE_TYPE_CPU, 1, &device, NULL); if(!errorCheck(error, "Getting device info")) printDeviceInfo(device); //create a context with our devices cl_context context; context = clCreateContext(NULL, 1, &device, NULL, NULL, &error); errorCheck(error, "Creating context"); //create a command queue cl_command_queue cmd_queue; cmd_queue = clCreateCommandQueue(context, device, 0, &error); errorCheck(error, "Creating command queue"); //create our program char *fileName = "F:\\Shared\\Documents\\Comp 4560\\OpenCL\\Matrix Multiplication\\Matrix Multiplication\\Matrix Multiplication\\test.cl"; const char *source = ""; string line; string sourceFile = ""; ifstream file (fileName); if(file.is_open()) { while(!file.eof()) { getline(file, line); sourceFile.append(line); } } source = sourceFile.c_str(); file.close(); cl_program program; program = clCreateProgramWithSource(context, 1, &source, NULL, &error); errorCheck(error, "Creating program with source"); cout<<"Building Program"<<endl; //build our program error = clBuildProgram(program, 1, &device, "", NULL, NULL); errorCheck(error, "Building program"); char logFile[2048]; error = clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, sizeof(logFile), &logFile, 0); errorCheck(error, "Getting Program Info"); cout<<"\n--------------------Build Log--------------------\n\n"<<logFile<<"\n--------------------End Build Log--------------------\n\n"<<endl; //create kernel objects for all kernel functions in the program object cl_kernel kernel; cl_uint numKernels; error = clCreateKernelsInProgram(program, 1, &kernel, &numKernels); errorCheck(error, "Creating kernel in program"); #ifdef oned cl_mem input_matrix_A = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(A), NULL, &error); errorCheck(error, "Creating input matrix A"); cl_mem input_matrix_B = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(B), NULL, &error); errorCheck(error, "Creating input matrix B"); cl_mem output_matrix_C = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(C), NULL, &error); errorCheck(error, "Creating output matrix C"); cl_event cmd_event; error = clEnqueueWriteBuffer(cmd_queue, input_matrix_A, CL_TRUE, 0, sizeof(A), &A, 0, NULL, &cmd_event); errorCheck(error, "Enqueue input matrix A"); error = clEnqueueWriteBuffer(cmd_queue, input_matrix_B, CL_TRUE, 0, sizeof(B), &B, 0, NULL, &cmd_event); errorCheck(error, "Enqueue input matrix B"); error = clEnqueueWriteBuffer(cmd_queue, output_matrix_C, CL_TRUE, 0, sizeof(C), &C, 0, NULL, &cmd_event); error = clSetKernelArg(kernel, 0, sizeof(input_matrix_A), &input_matrix_A); errorCheck(error, "Setting kernel arg [0]"); error = clSetKernelArg(kernel, 1, sizeof(input_matrix_B), &input_matrix_B); errorCheck(error, "Setting kernel arg [1]"); error = clSetKernelArg(kernel, 2, sizeof(output_matrix_C), &output_matrix_C); errorCheck(error, "Setting kernel arg [2]"); error = clSetKernelArg(kernel, 3, sizeof(dimensions), &dimensions); #endif #ifndef oned cl_mem input_matrix_A = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(AA), NULL, &error); errorCheck(error, "Creating input matrix A"); cl_mem input_matrix_B = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(BB), NULL, &error); errorCheck(error, "Creating input matrix B"); cl_mem output_matrix_C = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(CC), NULL, &error); errorCheck(error, "Creating input matrix C"); cl_event cmd_event; error = clEnqueueWriteBuffer(cmd_queue, input_matrix_A, CL_TRUE, 0, sizeof(AA), &AA, 0, NULL, &cmd_event); errorCheck(error, "Enqueue input matrix A"); error = clEnqueueWriteBuffer(cmd_queue, input_matrix_B, CL_TRUE, 0, sizeof(BB), &BB, 0, NULL, &cmd_event); errorCheck(error, "Enqueue input matrix B"); error = clEnqueueWriteBuffer(cmd_queue, output_matrix_C, CL_TRUE, 0, sizeof(CC), &CC, 0, NULL, &cmd_event); error = clSetKernelArg(kernel, 0, sizeof(input_matrix_A), &input_matrix_A); errorCheck(error, "Setting kernel arg [0]"); error = clSetKernelArg(kernel, 1, sizeof(input_matrix_B), &input_matrix_B); errorCheck(error, "Setting kernel arg [1]"); error = clSetKernelArg(kernel, 2, sizeof(output_matrix_C), &output_matrix_C); errorCheck(error, "Setting kernel arg [2]"); error = clSetKernelArg(kernel, 3, sizeof(dimensions), &dimensions); #endif //get the maximum work group size for executing the kernel on the device //size_t local; size_t global = NUM_ELEMENTS; //error = clGetKernelWorkGroupInfo(kernel, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(int), &local, NULL); errorCheck(error, "Getting kernel work group info"); for(int x = 0; x < NUM_RUNS; x++) { start = omp_get_wtime(); //enqueue our kernel to execute on the device error = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, &global, NULL, 0, NULL, NULL); errorCheck(error, "Enqueuing ND Range Kernel"); //wait for execution to finish clFinish(cmd_queue); end = omp_get_wtime(); cout<<"GPU Finished run: "<<(x+1)<<endl; gpu_elapsed = end - start; gpu_avg += gpu_elapsed; } #ifdef oned //read the results from the device cl_event resultsEvent; error = clEnqueueReadBuffer(cmd_queue, output_matrix_C, CL_TRUE, 0, sizeof(C), &C, 0, NULL, &resultsEvent); errorCheck(error, "Reading Results Buffer1"); #endif #ifndef oned cl_event resultsEvent; error = clEnqueueReadBuffer(cmd_queue, output_matrix_C, CL_TRUE, 0, sizeof(CC), &CC, 0, NULL, &resultsEvent); errorCheck(error, "Reading Results Buffer1"); #endif for(int x = 0; x < NUM_RUNS; x++) { start = omp_get_wtime(); cpu_MM(); end = omp_get_wtime(); cout<<"CPU Finished run: "<<(x+1)<<endl; cpu_elapsed = end - start; cpu_avg += cpu_elapsed; } cout<<"Avg Parallel CPU Time elapsed: "<<(cpu_avg/NUM_RUNS)<<" for "<<NUM_RUNS<<" runs of num elements: "<<NUM_ELEMENTS<<endl; cout<<"Avg GPU Time elapsed: "<<(gpu_avg/NUM_RUNS)<<" for "<<NUM_RUNS<<" runs of num elements: "<<NUM_ELEMENTS<<endl; verify(); system("pause"); clReleaseMemObject(input_matrix_A); clReleaseMemObject(input_matrix_B); clReleaseMemObject(output_matrix_C); clReleaseProgram(program); clReleaseKernel(kernel); clReleaseCommandQueue(cmd_queue); clReleaseContext(context); return 0; } void cpu_MM() { #ifdef oned int k, j; #pragma omp parallel for private(k, j) schedule(static, NUM_ELEMENTS/omp_get_num_procs()) for(int i = 0; i < DIMENSIONS; i++) { for(k = 0; k < DIMENSIONS; k++) { CPU[i * DIMENSIONS + k] = 0; for(j = 0; j < DIMENSIONS; j++) CPU[i * DIMENSIONS + k] += A[i * DIMENSIONS + j] * B[j * DIMENSIONS + k]; } } #endif #ifndef oned for(int i = 0; i < DIMENSIONS; i++) { for(int k = 0; k < DIMENSIONS; k++) { CPUCPU[i][k] = 0; for(int j = 0; j < DIMENSIONS; j++) CPUCPU[i][k] += AA[i][j] * BB[j][k]; } } #endif } void verify() { bool error = false; #ifdef oned for(int i = 0; i < NUM_ELEMENTS; i++) if(CPU[i] != C[i]) { cout<<"CPU[i]="<<CPU[i]<<"\tC[i]="<<C[i]<<"\ti="<<i<<endl; error = true; cout<<"The output arrays do not match at i="<<i<<endl; break; } //else // cout<<"CPU[i]="<<CPU[i]<<"\tC[i]="<<C[i]<<"\ti="<<i<<endl; if(!error) cout<<"No difference between CPU and GPU calculations."<<endl; #endif #ifndef oned for(int i = 0; i < NUM_ELEMENTS; i++) if(CPUCPU[i] != CC[i]) { cout<<"CPUCPU[i]="<<CPUCPU[i]<<"\tCC[i]="<<CC[i]<<"\ti="<<i<<endl; error = true; //cout<<"The output arrays do not match at i="<<i<<endl; //break; } //else // cout<<"CPU[i]="<<CPU[i]<<"\tC[i]="<<C[i]<<"\ti="<<i<<endl; if(!error) cout<<"No difference between CPU and GPU calculations."<<endl; #endif } bool errorCheck(cl_int error, string dataPoint) { bool errorOccurred = true; if(error == CL_BUILD_PROGRAM_FAILURE) cout<<"CL_BUILD_PROGRAM_FAILURE"<<endl; else if(error == CL_COMPILER_NOT_AVAILABLE) cout<<"CL_COMPILER_NOT_AVAILABLE"<<endl; else if(error == CL_DEVICE_NOT_AVAILABLE) cout<<"CL_DEVICE_NOT_AVAILABLE"<<endl; else if(error == CL_DEVICE_NOT_FOUND) cout<<"CL_DEVICE_NOT_FOUND"<<endl; else if(error == CL_INVALID_ARG_INDEX) cout<<"CL_INVALID_ARG_INDEX"<<endl; else if(error == CL_INVALID_ARG_SIZE) cout<<"CL_INVALID_ARG_SIZE"<<endl; else if(error == CL_INVALID_ARG_VALUE) cout<<"CL_INVALID_ARG_VALUE"<<endl; else if(error == CL_INVALID_BINARY) cout<<"CL_INVALID_BINARY"<<endl; else if(error == CL_INVALID_BUFFER_SIZE) cout<<"CL_INVALID_BUFFER_SIZE"<<endl; else if(error == CL_INVALID_BUILD_OPTIONS) cout<<"CL_INVALID_BUILD_OPTIONS"<<endl; else if(error == CL_INVALID_COMMAND_QUEUE) cout<<"CL_INVALID_COMMAND_QUEUE"<<endl; else if(error == CL_INVALID_CONTEXT) cout<<"CL_INVALID_CONTEXT"<<endl; else if(error == CL_INVALID_DEVICE) cout<<"CL_INVALID_DEVICE"<<endl; else if(error == CL_INVALID_DEVICE_TYPE) cout<<"CL_INVALID_DEVICE_TYPE"<<endl; else if(error == CL_INVALID_EVENT) cout<<"CL_INVALID_EVENT"<<endl; else if(error == CL_INVALID_EVENT_WAIT_LIST) cout<<"CL_INVALID_EVENT_WAIT_LIST"<<endl; else if(error == CL_INVALID_GLOBAL_OFFSET) cout<<"CL_INVALID_GLOBAL_OFFSET"<<endl; else if(error == CL_INVALID_HOST_PTR) cout<<"CL_INVALID_HOST_PTR"<<endl; else if(error == CL_INVALID_KERNEL) cout<<"CL_INVALID_KERNEL"<<endl; else if(error == CL_INVALID_KERNEL_ARGS) cout<<"CL_INVALID_KERNEL_ARGS"<<endl; else if(error == CL_INVALID_MEM_OBJECT) cout<<"CL_INVALID_MEM_OBJECT"<<endl; else if(error == CL_INVALID_OPERATION) cout<<"CL_INVALID_OPERATION"<<endl; else if(error == CL_INVALID_PLATFORM) cout<<"CL_INVALID_PLATFORM"<<endl; else if(error == CL_INVALID_PROGRAM) cout<<"CL_INVALID_PROGRAM"<<endl; else if(error == CL_INVALID_PROGRAM_EXECUTABLE) cout<<"CL_INVALID_PROGRAM_EXECUTABLE"<<endl; else if(error == CL_INVALID_QUEUE_PROPERTIES) cout<<"CL_INVALID_QUEUE_PROPERTIES"<<endl; else if(error == CL_INVALID_SAMPLER) cout<<"CL_INVALID_SAMPLER"<<endl; else if(error == CL_INVALID_VALUE) cout<<"CL_INVALID_VALUE"<<endl; else if(error == CL_INVALID_WORK_DIMENSION) cout<<"CL_INVALID_WORK_DIMENSION"<<endl; else if(error == CL_INVALID_WORK_GROUP_SIZE) cout<<"CL_INVALID_WORK_GROUP_SIZE"<<endl; else if(error == CL_MEM_COPY_HOST_PTR) cout<<"CL_MEM_COPY_HOST_PTR"<<endl; else if(error == CL_MEM_OBJECT_ALLOCATION_FAILURE) cout<<"CL_MEM_OBJECT_ALLOCATION_FAILURE"<<endl; else if(error == CL_MEM_USE_HOST_PTR) cout<<"CL_MEM_USE_HOST_PTR"<<endl; else if(error == CL_OUT_OF_HOST_MEMORY) cout<<"CL_OUT_OF_HOST_MEMORY"<<endl; else if(error == CL_OUT_OF_RESOURCES) cout<<"CL_OUT_OF_RESOURCES"<<endl; else { //cout<<"No error at: "+dataPoint<<endl<<endl; return false; } cout<<"Error at: "+dataPoint<<endl<<endl; return errorOccurred; } void printDeviceInfo(cl_device_id device) { cl_uint error; size_t size; char deviceName[512] = {0}; char vendor[512] = {0}; char driverVersion[512] = {0}; char deviceVersion[512] = {0}; cl_uint cacheSize = 0; cl_ulong globalMemSize = 0; cl_uint maxClockFrequency = 0; cl_uint maxComputeUnits = 0; cl_platform_id platformID = 0; size_t maxWorkGroupSize; cl_uint maxWorkItemDimensions = 0; size_t maxWorkItemSizes[3]; //get the device name error = clGetDeviceInfo(device, CL_DEVICE_NAME, sizeof(deviceName), deviceName, &size); if(error == CL_SUCCESS) cout<<"Device Name: "<<deviceName<<endl; else cout<<"Error getting device name"<<endl; //get the vendor error = clGetDeviceInfo(device, CL_DEVICE_VENDOR, sizeof(vendor), vendor, &size); if(error == CL_SUCCESS) cout<<"Vendor: "<<vendor<<endl; else cout<<"Error getting vendor"<<endl; //get the driver version error = clGetDeviceInfo(device, CL_DRIVER_VERSION, sizeof(driverVersion), driverVersion, &size); if(error == CL_SUCCESS) cout<<"Driver Version: "<<driverVersion<<endl; else cout<<"Error getting driver version"<<endl; //get the device version error = clGetDeviceInfo(device, CL_DEVICE_VERSION, sizeof(deviceVersion), deviceVersion, &size); if(error == CL_SUCCESS) cout<<"Device Version: "<<deviceVersion<<endl; else cout<<"Error getting device version"<<endl; //get the global memory size error = clGetDeviceInfo(device, CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(globalMemSize), &globalMemSize, &size); if(error == CL_SUCCESS) cout<<"Global memory size: "<<globalMemSize<<endl; else cout<<"Error getting global memory size"<<endl; //get the max clock frequency error = clGetDeviceInfo(device, CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(maxClockFrequency), &maxClockFrequency, NULL); if(error == CL_SUCCESS) { cout<<"Max clock frequency: "; if(maxClockFrequency > 1000) { maxClockFrequency /= 1000; cout<<maxClockFrequency<<" GHz"<<endl; } else cout<<maxClockFrequency<<" MHz"<<endl; } else cout<<"Error getting max clock frequency"<<endl; //get the number of compute units error = clGetDeviceInfo(device, CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(maxComputeUnits), &maxComputeUnits, NULL); if(error == CL_SUCCESS) cout<<"Max compute units: "<<maxComputeUnits<<endl; else cout<<"Error getting max compute units"<<endl; //get the platform error = clGetDeviceInfo(device, CL_DEVICE_PLATFORM, sizeof(platformID), &platformID, NULL); if(error == CL_SUCCESS) cout<<"Platform: "<<platformID<<endl; else cout<<"Error getting platform"<<endl; error = clGetDeviceInfo(device, CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(maxWorkGroupSize), &maxWorkGroupSize, NULL); if(error == CL_SUCCESS) cout<<"Max Work Group Size: "<<maxWorkGroupSize<<endl; else cout<<"Error getting max work group size"<<endl; error = clGetDeviceInfo(device, CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, sizeof(maxWorkItemDimensions), &maxWorkItemDimensions, NULL); if(error == CL_SUCCESS) cout<<"Max Work Item Dimensions: "<<maxWorkItemDimensions<<endl; else cout<<"Error getting max work item dimensions"<<endl; error = clGetDeviceInfo(device, CL_DEVICE_MAX_WORK_ITEM_SIZES, sizeof(maxWorkItemSizes), &maxWorkItemSizes, NULL); if(error == CL_SUCCESS) { for(int i = 0; i < 3; i++) cout<<"Max Work Item Sizes: "<<maxWorkItemSizes[i]<<endl; } else cout<<"Error getting max work item sizes"<<endl; cout<<endl<<endl; } //in a file called test.cl, the location in the main program needs to be changed int getRow(int globalId, int dimensions) { return globalId/dimensions; } int getCol(int globalId, int dimensions) { return globalId%dimensions; } __kernel void globalMM(__global int *A, __global int *B, __global int *C, int dimensions) { int row = get_global_id(0); int rowNum = getRow(row, dimensions); int colNum = getCol(row, dimensions); /*printf("RowNum: %i\tColNum: %i\n", rowNum, colNum);*/ for(int k = 0; k < dimensions; k++) { C[rowNum * dimensions + colNum] = 0; for(int j = 0; j < dimensions; j++) C[rowNum * dimensions + colNum] += A[rowNum * dimensions + j] * B[j * dimensions + colNum]; } }

                    • Confused about global/local id
                      dravisher

                      Are you sure using static array allocation isn't the problem? For the 1D case you are trying to allocate 1 MiB in total (4 arrays of 65536 4 byte elements each). That might be pusing it, though I'm not sure. Also long int is usually the same size as an int nowadays (32-bit), though perhaps it isn't for your compiler.

                    • Confused about global/local id
                      nou

                      in this

                      error = clGetKernelWorkGroupInfo(kernel, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(int), &local, NULL);

                      you should use size_t because it return size_t so on 64bit system this will crash.

                      get_global_id(1) will return 0 if you execute only 1D kernel. get_global_size(1) will return 1.

                      you can run kernel on 2D domain. it is second parameter for clEnqueueNDRange()

                      and then work with flatened array like this array[y*get_global_size(0)+x]

                        • Confused about global/local id
                          notyou

                           

                          Originally posted by: dravisher Are you sure using static array allocation isn't the problem? For the 1D case you are trying to allocate 1 MiB in total (4 arrays of 65536 4 byte elements each). That might be pusing it, though I'm not sure. Also long int is usually the same size as an int nowadays (32-bit), though perhaps it isn't for your compiler.

                          That's what I was wondering, but in my openMP programs, there has been no issue with much larger arrays.

                           

                           

                          Originally posted by: nou in this

                           

                          error = clGetKernelWorkGroupInfo(kernel, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(int), &local, NULL);

                           

                          you should use size_t because it return size_t so on 64bit system this will crash.

                           

                          get_global_id(1) will return 0 if you execute only 1D kernel. get_global_size(1) will return 1.

                           

                          you can run kernel on 2D domain. it is second parameter for clEnqueueNDRange()

                           

                          and then work with flatened array like this array[y*get_global_size(0)+x]

                           

                          You mean it should be: error = clGetKernelWorkGroupInfo(kernel, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &local, NULL); right? That makes sense based on what you've said.

                          As for the 2D arrays, I'm currently working with them flattened. But let's say I still want to access them like they are 2D. You're saying that if I put 2 in the dim field ( error = clEnqueueNDRangeKernel(cmd_queue, kernel, 2, NULL, &global, NULL, 0, NULL, NULL) then I should be able to access the "row" of the matrix using get_global_id(0) and then the columns with get_global_id(1). Then I can access them with array[get_global_size(0) * width+get_global_id(1)/*column*/], correct?

                           

                          I'll give this a shot and let you know how it goes.

                            • Confused about global/local id
                              nou

                              yes. if you pass 2 as dimension into clEnqueueNDRange() it will run kernels in matrix order.of course you must pass global as two item arrray

                              size_t global = {x_size, y_size};

                                • Confused about global/local id
                                  notyou

                                   

                                  Originally posted by: nou yes. if you pass 2 as dimension into clEnqueueNDRange() it will run kernels in matrix order.of course you must pass global as two item arrray

                                   

                                  size_t global = {x_size, y_size};

                                   

                                  Now that I changed the code, it does correctly give me the row and column (based on the get_local_id(0/1)) and run on the CPU. However, since the change, it seems that now when I make the dimension 256, my PC hard locks.

                                    • Confused about global/local id
                                      notyou

                                      While testing, I also noticed that when running the code on the GPU, it will run correctly for sizes up to 128, but if I pass NULL instead of local (in clEnqueueNDRangeKernel) it will take much longer (many orders of magnitude) than if I choose a local size before this. Could someone with a working machine test out the GPU code to see if the code crashes on them (dimensions = 256 should do it). I'd like to see if this is a driver issue (Mobility 5870) or if there is something wrong with the code. Thanks.