17 Replies Latest reply on Aug 20, 2010 10:54 PM by notyou

    Differing CPU and GPU Results

    notyou

      Hello,

      I've been working on the 0-1 knapsack problem recently and I've extended my code to work with input sizes > 256. When running the code on the CPU, everything works as it should, even for inputs > 1024 (max workgroup size), but when I move to the GPU, anything > 256 gives me an error when verifying the results. Does anyone see anything wrong with my code as for what could be causing this?

      -Matt

      gpu_knapsack.cl __kernel void knapsack(__global int *gpu_results, __global int *weight, __global int *profit, int NUM_ROWS, int NUM_COLS, __local int *gpu_subset, int WORKGROUP_SIZE) { int i = 0; int j = 0; int local_id = get_local_id(0); int num_groups = NUM_COLS / WORKGROUP_SIZE; int group_offset = 0; int weightval = 0; int profitval = 0; int gpu_subset1 = 0; int gpu_subset2 = 0; int boolcheck = 0; int val = 0; int offset = 0; /*start with a row of 0's, set to 0 so we don't need to read from global memory*/ for(i = 0; i < num_groups; i++) gpu_subset[(WORKGROUP_SIZE * i) + local_id] = 0; for(i = 1; i < NUM_ROWS; i++) { weightval = weight[i - 1]; profitval = profit[i - 1]; for(j = 0; j < num_groups; j++) { group_offset = WORKGROUP_SIZE * j; gpu_subset1 = gpu_subset[group_offset + local_id]; /*results will be undefined if local_id < weightval but that is fine since they will be filtered out by the boolcheck later*/ gpu_subset2 = gpu_subset[group_offset + local_id - weightval] + profitval; boolcheck = (group_offset + local_id >= weightval) && (gpu_subset2 > gpu_subset1); val = (boolcheck) ? gpu_subset2 : gpu_subset1; /*offset for where in global memory we'll write to and perform global write*/ offset = (i * NUM_COLS) + group_offset + local_id; gpu_results[offset] = val; /*force all threads to finish execution before writing our results back into the local array since a thread may be delayed and thus get an updated value instead of the "old" one*/ barrier(CLK_LOCAL_MEM_FENCE); /*now that all threads have finished execution update local memory with our new values for the next iteration*/ gpu_subset[group_offset + local_id] = val; } } } Main.cpp #include <CL/cl.hpp> #include <iostream> #include <iomanip> #include <omp.h> #include <fstream> #include <string> #include <sstream> using namespace std; //#define cpu //run on CPU, comment out to run on GPU, if running on GPU, change BLOCK_SIZE to 16 #define NUM_RUNS 1 #define NUM_ITEMS 512 //# rows, add 1 for empty row #define MAX_WEIGHT (NUM_ITEMS - 1) //# cols, add 1 for empty column #define NUM_COLS (MAX_WEIGHT + 1) #define NUM_ROWS (NUM_ITEMS + 1) #define NUM_ELEMENTS (NUM_COLS * NUM_ROWS) #define CPU_WORKGROUP_SIZE 1024 #define GPU_WORKGROUP_SIZE 256 int weight[NUM_ITEMS]; int profit[NUM_ITEMS]; bool errorCheck(cl_int, string); //returns true if there is an error (and prints appropriate message) void seq_knapsack(int[], int[]); void generateRandomFile(int, string); void loadFromFile(string, int[]); void verify(); int seq_results[NUM_ELEMENTS]; int gpu_results[NUM_ELEMENTS]; int main(int argc, char *argv[]) { cout<<setprecision(10); stringstream val; val << NUM_COLS; string s = val.str(); generateRandomFile(NUM_ITEMS, "weight" + s + ".txt"); generateRandomFile(NUM_ITEMS, "profit" + s + ".txt"); loadFromFile("weight" + s + ".txt", weight); loadFromFile("profit" + s + ".txt", profit); //initialization cl_int error; //check to see if we error our during most steps cl_platform_id platform; cl_uint numPlatforms; cl_uint num_devices_returned = 0; //get a list of all the platforms error = clGetPlatformIDs(0, NULL, &numPlatforms); if (0 < numPlatforms) { cl_platform_id* platforms = new cl_platform_id[numPlatforms]; error = clGetPlatformIDs(numPlatforms, platforms, NULL); for (unsigned i = 0; i < numPlatforms; ++i) { //char pbuf[100]; //error = clGetPlatformInfo(platforms[i], CL_PLATFORM_VENDOR, sizeof(pbuf), pbuf, NULL); platform = platforms[i]; } delete[] platforms; } //get our GPU device cl_device_id device; #ifndef cpu error = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL); #endif #ifdef cpu error = clGetDeviceIDs(platform, CL_DEVICE_TYPE_CPU, 1, &device, NULL); #endif //if(!errorCheck(error, "Getting device info")) // printDeviceInfo(device); //create a context with our devices cl_context context; context = clCreateContext(NULL, 1, &device, NULL, NULL, &error); errorCheck(error, "Creating context"); //create a command queue cl_command_queue cmd_queue; cmd_queue = clCreateCommandQueue(context, device, 0, &error); errorCheck(error, "Creating command queue"); //create our program char *fileName = "F:\\University\\comp4520\\0-1 knapsack\\knapsack\\knapsack\\gpu_knapsack.cl"; const char *source = ""; string line; string sourceFile = ""; ifstream file (fileName); if(file.is_open()) { while(!file.eof()) { getline(file, line); sourceFile.append(line); } } source = sourceFile.c_str(); file.close(); cl_program program; program = clCreateProgramWithSource(context, 1, &source, NULL, &error); errorCheck(error, "Creating program with source"); cout<<"Building Program"<<endl; //build our program error = clBuildProgram(program, 1, &device, "", NULL, NULL); errorCheck(error, "Building program"); char logFile[16384]; error = clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, sizeof(logFile), &logFile, 0); errorCheck(error, "Getting Program Info"); cout<<"\n--------------------Build Log--------------------\n\n"<<logFile<<"\n--------------------End Build Log--------------------\n\n"<<endl; //create kernel objects for all kernel functions in the program object cl_kernel kernel; cl_uint numKernels; error = clCreateKernelsInProgram(program, 1, &kernel, &numKernels); errorCheck(error, "Creating kernel in program"); #ifdef cpu size_t local[1] = {CPU_WORKGROUP_SIZE}; size_t global[1] = {CPU_WORKGROUP_SIZE}; if(NUM_COLS < CPU_WORKGROUP_SIZE) { local[0] = NUM_COLS; global[0] = NUM_COLS; } #else size_t local[1] = {GPU_WORKGROUP_SIZE}; size_t global[1] = {GPU_WORKGROUP_SIZE}; if(NUM_COLS < GPU_WORKGROUP_SIZE) { local[0] = NUM_COLS; global[0] = NUM_COLS; } #endif //zero the first row and column for(int j = 0; j < NUM_COLS; j++) gpu_results[j] = 0; for(int i = 0; i < NUM_ROWS; i++) gpu_results[i * NUM_COLS] = 0; cl_mem gpu_results_buffer; cl_mem gpu_subset_buffer; cl_mem weight_buffer; cl_mem profit_buffer; cl_event cmd_event; seq_knapsack(weight, profit); int rows = NUM_ROWS; int cols = NUM_COLS; int workgroup_size; #ifdef cpu workgroup_size = CPU_WORKGROUP_SIZE; if(NUM_COLS < CPU_WORKGROUP_SIZE) workgroup_size = NUM_COLS; #else workgroup_size = GPU_WORKGROUP_SIZE; if(NUM_COLS < GPU_WORKGROUP_SIZE) workgroup_size = NUM_COLS; #endif //current working row of the problem int gpu_subset[NUM_COLS]; //these can be set here since they won't be changing during execution weight_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(weight), NULL, &error); errorCheck(error, "Creating weight buffer"); profit_buffer = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(profit), NULL, &error); errorCheck(error, "Creating profit buffer"); gpu_results_buffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(gpu_results), NULL, &error); errorCheck(error, "Creating parallel buffer"); gpu_subset_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(gpu_subset), NULL, &error); errorCheck(error, "Creating subset buffer"); error = clEnqueueWriteBuffer(cmd_queue, gpu_results_buffer, CL_TRUE, 0, sizeof(gpu_results), &gpu_results, 0, NULL, &cmd_event); errorCheck(error, "Enqueue gpu_results_buffer"); error = clEnqueueWriteBuffer(cmd_queue, weight_buffer, CL_TRUE, 0, sizeof(weight), &weight, 0, NULL, &cmd_event); errorCheck(error, "Enqueue weight buffer"); error = clEnqueueWriteBuffer(cmd_queue, profit_buffer, CL_TRUE, 0, sizeof(profit), &profit, 0, NULL, &cmd_event); errorCheck(error, "Enqueue profit buffer"); error = clEnqueueWriteBuffer(cmd_queue, gpu_subset_buffer, CL_TRUE, 0, sizeof(gpu_subset), &gpu_subset, 0, NULL, &cmd_event); errorCheck(error, "Enqueue profit buffer"); error = clSetKernelArg(kernel, 0, sizeof(gpu_results_buffer), &gpu_results_buffer); errorCheck(error, "Setting kernel arg [0]"); error = clSetKernelArg(kernel, 1, sizeof(weight_buffer), &weight_buffer); errorCheck(error, "Setting kernel arg [1]"); error = clSetKernelArg(kernel, 2, sizeof(profit_buffer), &profit_buffer); errorCheck(error, "Setting kernel arg [2]"); error = clSetKernelArg(kernel, 3, sizeof(int), &rows); errorCheck(error, "Setting kernel arg [3]"); error = clSetKernelArg(kernel, 4, sizeof(int), &cols); errorCheck(error, "Setting kernel arg [4]"); error = clSetKernelArg(kernel, 5, sizeof(gpu_subset_buffer), NULL); errorCheck(error, "Setting kernel arg [5]"); error = clSetKernelArg(kernel, 6, sizeof(int), &workgroup_size); //enqueue our kernel to execute on the device error = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global, local, 0, NULL, NULL); errorCheck(error, "Enqueuing ND Range Kernel"); //wait for execution to finish clFinish(cmd_queue); //read the results from the device cl_event resultsEvent; error = clEnqueueReadBuffer(cmd_queue, gpu_results_buffer, CL_TRUE, 0, sizeof(gpu_results), &gpu_results, 0, NULL, &resultsEvent); errorCheck(error, "Reading Results Buffer"); clReleaseMemObject(gpu_subset_buffer); clReleaseMemObject(gpu_results_buffer); clReleaseMemObject(weight_buffer); clReleaseMemObject(profit_buffer); verify(); clReleaseProgram(program); clReleaseKernel(kernel); clReleaseCommandQueue(cmd_queue); clReleaseContext(context); system("pause"); return 0; } void seq_knapsack(int weight[], int profit[]) { int i, j; //zero the first row and column for(j = 0; j < NUM_COLS; j++) seq_results[j] = 0; for(i = 0; i < NUM_ROWS; i++) seq_results[i * NUM_COLS] = 0; for(i = 1; i < NUM_ROWS; i++) for(j = 1; j < NUM_COLS; j++) { if(j >= weight[i - 1] && (seq_results[(i - 1) * NUM_COLS + j - weight[i - 1]] + profit[i - 1]) > seq_results[(i - 1) * NUM_COLS + j]) seq_results[(i * NUM_COLS) + j] = (seq_results[(i - 1) * NUM_COLS + j - weight[i - 1]] + profit[i - 1]); else seq_results[(i * NUM_COLS) + j] = (seq_results[(i - 1) * NUM_COLS + j]); } #ifdef verbose //print array j = 0; for(i = 0; i < NUM_ELEMENTS; i++) { cout<<seq_results[i]<<" "; if(j == MAX_WEIGHT) { cout<<endl; j = 0; } else j++; } cout<<endl; #endif } void loadFromFile(string loadFile, int myArray[NUM_ITEMS]) { ifstream myfile (loadFile); int num = 0; int i = 0; if (myfile.is_open()) { while (!myfile.eof()) { myfile >> num; myArray[i] = num; i++; if(i == NUM_ITEMS) break; } } myfile.close(); } void generateRandomFile(int num_vars, string outFile) { int num; ifstream inputFile; ofstream outputFile; inputFile.open(outFile); if(inputFile.is_open()) { inputFile.close(); return; } else outputFile.open(outFile); srand((unsigned int)(time(NULL))); for(int i = 0; i < num_vars; i++) { num = rand() % 26; //num from 0-25 outputFile<<num<<" "; } outputFile.close(); } void verify() { bool error = false; cout<<endl; for(int i = 0; i < NUM_ELEMENTS; i++) if(seq_results[i] != gpu_results[i]) { cout<<"Sequential[i]="<<seq_results[i]<<"\tGPU[i]="<<gpu_results[i]<<".\t Error at i="<<i<<". Stopping."<<endl; error = true; break; } if(!error) cout<<"No difference between calculations."<<endl<<endl<<endl; } bool errorCheck(cl_int error, string dataPoint) { bool errorOccurred = true; if(error == CL_BUILD_PROGRAM_FAILURE) cout<<"CL_BUILD_PROGRAM_FAILURE"<<endl; else if(error == CL_COMPILER_NOT_AVAILABLE) cout<<"CL_COMPILER_NOT_AVAILABLE"<<endl; else if(error == CL_DEVICE_NOT_AVAILABLE) cout<<"CL_DEVICE_NOT_AVAILABLE"<<endl; else if(error == CL_DEVICE_NOT_FOUND) cout<<"CL_DEVICE_NOT_FOUND"<<endl; else if(error == CL_INVALID_ARG_INDEX) cout<<"CL_INVALID_ARG_INDEX"<<endl; else if(error == CL_INVALID_ARG_SIZE) cout<<"CL_INVALID_ARG_SIZE"<<endl; else if(error == CL_INVALID_ARG_VALUE) cout<<"CL_INVALID_ARG_VALUE"<<endl; else if(error == CL_INVALID_BINARY) cout<<"CL_INVALID_BINARY"<<endl; else if(error == CL_INVALID_BUFFER_SIZE) cout<<"CL_INVALID_BUFFER_SIZE"<<endl; else if(error == CL_INVALID_BUILD_OPTIONS) cout<<"CL_INVALID_BUILD_OPTIONS"<<endl; else if(error == CL_INVALID_COMMAND_QUEUE) cout<<"CL_INVALID_COMMAND_QUEUE"<<endl; else if(error == CL_INVALID_CONTEXT) cout<<"CL_INVALID_CONTEXT"<<endl; else if(error == CL_INVALID_DEVICE) cout<<"CL_INVALID_DEVICE"<<endl; else if(error == CL_INVALID_DEVICE_TYPE) cout<<"CL_INVALID_DEVICE_TYPE"<<endl; else if(error == CL_INVALID_EVENT) cout<<"CL_INVALID_EVENT"<<endl; else if(error == CL_INVALID_EVENT_WAIT_LIST) cout<<"CL_INVALID_EVENT_WAIT_LIST"<<endl; else if(error == CL_INVALID_GLOBAL_OFFSET) cout<<"CL_INVALID_GLOBAL_OFFSET"<<endl; else if(error == CL_INVALID_HOST_PTR) cout<<"CL_INVALID_HOST_PTR"<<endl; else if(error == CL_INVALID_KERNEL) cout<<"CL_INVALID_KERNEL"<<endl; else if(error == CL_INVALID_KERNEL_ARGS) cout<<"CL_INVALID_KERNEL_ARGS"<<endl; else if(error == CL_INVALID_MEM_OBJECT) cout<<"CL_INVALID_MEM_OBJECT"<<endl; else if(error == CL_INVALID_OPERATION) cout<<"CL_INVALID_OPERATION"<<endl; else if(error == CL_INVALID_PLATFORM) cout<<"CL_INVALID_PLATFORM"<<endl; else if(error == CL_INVALID_PROGRAM) cout<<"CL_INVALID_PROGRAM"<<endl; else if(error == CL_INVALID_PROGRAM_EXECUTABLE) cout<<"CL_INVALID_PROGRAM_EXECUTABLE"<<endl; else if(error == CL_INVALID_QUEUE_PROPERTIES) cout<<"CL_INVALID_QUEUE_PROPERTIES"<<endl; else if(error == CL_INVALID_SAMPLER) cout<<"CL_INVALID_SAMPLER"<<endl; else if(error == CL_INVALID_VALUE) cout<<"CL_INVALID_VALUE"<<endl; else if(error == CL_INVALID_WORK_DIMENSION) cout<<"CL_INVALID_WORK_DIMENSION"<<endl; else if(error == CL_INVALID_WORK_GROUP_SIZE) cout<<"CL_INVALID_WORK_GROUP_SIZE"<<endl; else if(error == CL_MEM_COPY_HOST_PTR) cout<<"CL_MEM_COPY_HOST_PTR"<<endl; else if(error == CL_MEM_OBJECT_ALLOCATION_FAILURE) cout<<"CL_MEM_OBJECT_ALLOCATION_FAILURE"<<endl; else if(error == CL_MEM_USE_HOST_PTR) cout<<"CL_MEM_USE_HOST_PTR"<<endl; else if(error == CL_OUT_OF_HOST_MEMORY) cout<<"CL_OUT_OF_HOST_MEMORY"<<endl; else if(error == CL_OUT_OF_RESOURCES) cout<<"CL_OUT_OF_RESOURCES"<<endl; else if(error != 0) cout<<"Unknown error"<<endl; else if(error == 0) { //cout<<"No error at: "+dataPoint<<endl<<endl; return false; } cout<<"Error at: "+dataPoint<<endl<<endl; return errorOccurred; }

        • Differing CPU and GPU Results
          Illusio

          Yes. Max workgroup size being 256 on your GPU.

          Are you checking the return value from clEnqueueNDRangeKernel? I'm pretty sure you get an error when you attempt to exceed the max.

           

            • Differing CPU and GPU Results
              notyou

               

              Originally posted by: Illusio Yes. Max workgroup size being 256 on your GPU.

               

              Are you checking the return value from clEnqueueNDRangeKernel? I'm pretty sure you get an error when you attempt to exceed the max.

               

               

               

              I am aware that the max workgrop size is 256 for the GPU and 1024 for the CPU (I will explain later what I am trying to do). I am also checking the result from clEnqueueNDRangeKernel and it doesn't return an error (because I'm setting the local and global size to 256).

              Now, the 0-1 knapsack problem has a good amount of DLP, provided each row is completed before the next is started. Now, for 256 columns or less, this works perfectly on the GPU (as expected), but when I move up to 512 (just to make it split evenly), I get errors when the final results are returned.

              The initial problem is that if the global size is > 256, it wants to create multiple work groups, and since I can't synchronize between workgroups, there is inconsistency when updating the array.

              So, to try and work around this, I pass the # of columns and the workgroup size (here, 512 and 256). Then, in the actual CL code, I do for each row, split the row into j groups (#columns/workgroup size) = 2 of size (workgroup size) each. Then have each thread performs (2 = j) loop iterations where the 1st iteration will calculate the results for [0, 255] while the next iteration will go from [256, 511] (in this case).

              I've worked through the code and I can't seem to find anything wrong that would cause the results to be off. What I don't understand is that the results are correct when run on my i7 (4 core, 8 thread) CPU (even if I use inputs of size > 1024, which should put me in the same conditions as the GPU error).

              Hopefully this explains better what my issue is.

                • Differing CPU and GPU Results
                  Illusio

                  Hehe, sorry about that. Guess I misunderstood what you wrote, I took it as you having hardcoded 1024 in the local work group size via a define because of the caps in your code. Gotta love long argument lists.

                   

                    • Differing CPU and GPU Results
                      notyou

                      No problem, it happens. But now that you're understanding it correctly, can you see anything that would cause different results? I just want to have a second brain evaluate this so I don't go crazy.

                        • Differing CPU and GPU Results
                          n0thing

                          CPU and GPU don't work in the same way. All threads in a work-group are executed serially on a CPU (a work-group is one software thread) but 64 (high end GPUs) threads are executed parallely on a GPU in a work-group.

                            • Differing CPU and GPU Results
                              notyou

                               

                              Originally posted by: n0thing CPU and GPU don't work in the same way. All threads in a work-group are executed serially on a CPU (a work-group is one software thread) but 64 (high end GPUs) threads are executed parallely on a GPU in a work-group.


                              That doesn't make sense. It's running on a multi-core CPU, so shouldn't the items in the group get split up and executed by any of the cores in parallel? Also, in my previous projects, CPU utilization has been 100% when running the OpenCL version, meaning all cores are being used, how is this possible if everything is still executed sequentially? Also, I 100% realize that the GPU does everything in parallel, but in my code, I can't see any problem with where values are being written since the local id + offset should always be a unique location.

                                • Differing CPU and GPU Results
                                  n0thing

                                  There can be multiple groups so that CPU utilization is 100%.

                                  Can you try with work-group size of 64 (wavefront size) with GPU and see if you are getting correct results?

                                   

                                    • Differing CPU and GPU Results
                                      notyou

                                       

                                      Originally posted by: n0thing There can be multiple groups so that CPU utilization is 100%.

                                      Can you try with work-group size of 64 (wavefront size) with GPU and see if you are getting correct results?



                                      I should have been more clear. The 100% utilization occurred when there were only 1024 items (1 workgroup) = 128/CPU core.

                                      Now for testing the workgroup size, I've done a few things with my code.

                                      1) If the workgroup size (256) >= num items (64 or a second case, 256), the workgroup size is set to the number of items, 64 (or 256). The results end up correct.

                                      2) If the workgroup size (256) < num items (512 for simplicity). Then what happens is I set the local and global workgroup size to 256, giving me [0, 255] threads. I've done this because I need to have each row completed before the next is done because of data dependencies. Now, if I set global = 512, then it wants to create 2 workgroups, but we can't synchronize between workgroups so we come back to my workaround of in the CL code, making each thread do it's local id + offset (based on what iteration we're in, 256 * 0 + local and 256 * 1 + local for 512 items). Hope this makes sense.

                                        • Differing CPU and GPU Results
                                          n0thing

                                          Are you assuming that all 256 threads in your work-group running in parallel?

                                          This isn't the case as only 64 threads run parallely at a time so you should use synchronization when writing to local memory and reading again from it. The array gpu_subset was set to zero initially but there is no barrier between the write and following read in the loop.

                                          Or if you don't want to use a barrier - use work-group size of 64.

                                            • Differing CPU and GPU Results
                                              notyou

                                               

                                              Originally posted by: n0thing Are you assuming that all 256 threads in your work-group running in parallel?

                                               

                                              This isn't the case as only 64 threads run parallely at a time so you should use synchronization when writing to local memory and reading again from it. The array gpu_subset was set to zero initially but there is no barrier between the write and following read in the loop.

                                               

                                              Or if you don't want to use a barrier - use work-group size of 64.



                                              Yes, I was assuming all 256 threads were running in parallel, but I now realize that was a mistake. I will try both using a workgroup size of 64 and barriers to see which combination gives me the best performance with correct results. I will post back late today whether or not your suggestion works.

                                                • Differing CPU and GPU Results
                                                  notyou

                                                  As before, with a workgroup size of 64 (for 64 columns) the results are correct, but once I move to 128 columns, I get errors. I should also note that I put barriers in between every operation just to check that wasn't the problem and that didn't fix it. I'm going to roll back to 2.01 since that's the only one that printf seems to work with (unrecognized extension during CL file compilation otherwise) and try debugging it that way.

                                                  If anyone else could also test the code it would be appreciated (the sample I provided earlier should be everything you need), just to see if anyone else encounters the same issue.

                                                    • Differing CPU and GPU Results
                                                      notyou

                                                      Just updating everyone on my roll back to 2.01. Now the opposite is happening, the results are correct on the GPU <edited>. I will try moving back to 2.2 to see if the situation changes.

                                                      Edit @ Micah, I had used barriers there as well during debugging but it never seemed to make a difference. None the less, I have added them in while I bounce back and forth between SDK versions.

                                                      Edited out my mistake of stating the CPU version wasn't working when it actually was.

                                                        • Differing CPU and GPU Results
                                                          notyou

                                                          Final update. After much installing/uninstalling I've got it figured out. With SDK 2.01, the GPU results are correct. <edited> Moving to SDK 2.2, the CPU results are correct, but the GPU is now incorrect.

                                                          Conclusion: somewhere between 2.01 and 2.1 something was broken. I can provide the exact code I'm using for debugging purposes since there's nothing that needs to be hidden. Please let me know if this is the case. Thanks.

                                                          Edited out my mistake of stating the CPU version wasn't working when it actually was.

                                  • Differing CPU and GPU Results
                                    MicahVillmow
                                    notyou,
                                    you need barriers after here:
                                    /*start with a row of 0's, set to 0 so we don't need to read from global memory*/
                                    for(i = 0; i < num_groups; i++)
                                    gpu_subset[(WORKGROUP_SIZE * i) + local_id] = 0;

                                    and here:
                                    /*now that all threads have finished execution update local memory with our new values for the next iteration*/
                                    gpu_subset[group_offset + local_id] = val;
                                    • Differing CPU and GPU Results
                                      MicahVillmow
                                      notyou,
                                      Does the problem show up in SDK 2.2? If so, lets focus on that as 2.01 and 2.1 won't receive any backported fixes.
                                        • Differing CPU and GPU Results
                                          notyou

                                           

                                          Originally posted by: MicahVillmow notyou, Does the problem show up in SDK 2.2? If so, lets focus on that as 2.01 and 2.1 won't receive any backported fixes.


                                          Yes. The problem still shows up in 2.2, and at this time, I'll correct a mistake I made earlier, the CPU version is working (I accidentally commented out a section of code which was why it looked like it wasn't). So, to summarize, GPU works with 2.01 (the last version it works with) and the CPU works with 2.2 (no issues here). I'd also like to point out that cl_amd_printf seems to have issues when I try to use it as upon compilation of the CL program, it flags it as an unrecognized extension (even though my device info says otherwise) and ignores it, causing a segmentation fault later.

                                          Edit: I also noticed one other very interesting thing. When running on the CPU, the two barriers that you said were necessary (and I agree), cause errors when the number of columns is greater than the number of items in the work group. When they are removed, there are no issues.

                                          As for my system, it's an Asus G73JH, Intel Core i7-720QM (4C, 8T), 5870M (based on 5770).

                                          If there's anything else I can do to help track down this issue, please let me know.

                                          -Matt