15 Replies Latest reply on Aug 17, 2011 11:47 AM by genaganna

    clEnqueueNDRangeKernel returns -45 on 2nd GPU

    jbrussell
      Works OK on primary HD 5870 but not on secondary FirePro v8750

      Greetings,

      I am running SDK 2.5 on a Windows 7 box with an HD 5870 (the primary adapter) and a FirePro V8750 (secondary).  clGetDeviceInfo() on the 5870 shows OpenCL 1.1 while the V8750 shows version 1.0 (in code box below).  The V8750 also returns "ATI RV770" as its name, which is a little unsettling but the V8750 has an RV770 as its core so I guess that's OK.  Also, in order to activate the secondary card, I did the 3 resistors in the DVI-VGA adaptor trick that I saw somewhere in this forum. 

      My kernel builds and runs fine when I target the 5870.  When targeting the the V8750, the build is OK but clEnqueueNDRangeKernel() returns -45, which is  CL_INVALID_PROGRAM_EXECUTABLE.

      I suspect this has something to do with the V8750 as being version 1.0 instead of 1.1.  Is there a way to force the build to be a 1.0 build?  According to the OpenCL spec the only valid version option for clBuildProgram() is "-cl-std=CL1.1", so there doesn't appear to be a way to force it to build a 1.0-compatible executable.  It seems to imply that it will build to whatever the device's CL_DEVICE_OPENCL_C_VERSION is but I don't really know if that's the case.

      Is there a way to upgrade the V8750 to 1.1-compatible?  I've loaded the latest driver, am running the latest SDK, etc.

      Thanks,

      JR

       

      Num available OpenCl devices = 16 Device 0 settings: Device type = GPU Vendor = Advanced Micro Devices, Inc. Name = Cypress Global memory size = 800 MB Local memory size = 32 KB Max workgroup size = 256 Max dimensions = 3 Max workitem sizes = 256,256,256 Version = OpenCL 1.1 AMD-APP-SDK-v2.5 (684.213) OpenCL C Version = OpenCL C 1.1 Device 1 settings: Device type = GPU Vendor = Advanced Micro Devices, Inc. Name = ATI RV770 Global memory size = 1024 MB Local memory size = 16 KB Max workgroup size = 256 Max dimensions = 3 Max workitem sizes = 256,256,256 Version = OpenCL 1.0 AMD-APP-SDK-v2.5 (684.213) OpenCL C Version = OpenCL C 1.0 Error: Enqueueing my_kernel onto command queue. Status = -45

        • clEnqueueNDRangeKernel returns -45 on 2nd GPU
          MicahVillmow
          jbrussell,
          You cannot run binaries across different OpenCL versions. You will need to compile for both your 1.1 and 1.0 devices seperately. The R7XX series of chips will never support OpenCL 1.1 as they do not have the necessary hardware requirements.
            • clEnqueueNDRangeKernel returns -45 on 2nd GPU
              jbrussell

              Thanks, but is there a way with SDK 2.5 to force a compile to OpenCL 1.0?

                • clEnqueueNDRangeKernel returns -45 on 2nd GPU
                  genaganna

                   

                  Originally posted by: jbrussell Thanks, but is there a way with SDK 2.5 to force a compile to OpenCL 1.0?

                   

                  Just remove -cl-std option from build options which allows compiler to select compiler version based on device.

                    • clEnqueueNDRangeKernel returns -45 on 2nd GPU
                      jbrussell

                      The -cl-std option is not there.  I have the buildoptions parameter to clBuildProgram() set to NULL. 

                        • clEnqueueNDRangeKernel returns -45 on 2nd GPU
                          genaganna

                           

                          Originally posted by: jbrussell The -cl-std option is not there.  I have the buildoptions parameter to clBuildProgram() set to NULL. 

                          Have you built kernel for all devices or first device?  Can you copy your runtime code and kernel code here which allows us to answer quickly?

                            • clEnqueueNDRangeKernel returns -45 on 2nd GPU
                              jbrussell

                              In looking at the original code it looks like it builds just for device 0 since the num_devices parameter to clBuildProgram() is set to 1.  So now when I force it to only build the kernel for the device I intend to use, via a #define, it works fine for  devices[0], the 5870.  For devices[1], the V8750, clBuildProgram returns -11, CL_BUILD_PROGRAM_FAILURE.  When ported to an NVIDIA OpenCL 1.0 system this kernel builds and runs fine, so maybe there's a problem with the SDK setup?  The NVIDIA device happens to be device 0, so maybe the problem is for secondary devices?

                              I can't show the kernel code because it's proprietary but it's really only doing some fairly simple math on a ushort image in global memory...converts to float, does some math, converts back to ushort for output.

                              I've attached the portions of the runtime code that init and do the build...pretty standard, glommed from the template example.

                              So, to restate the problem(s):

                              When building only for the single device on which the kernel is to be run, clBuildProgram() returns -11  for the OpenCL 1.0 V8750 device (device 1), and returns CL_SUCCESS for the OPENCL 1.1 5870 device (device 0).

                               

                                • clEnqueueNDRangeKernel returns -45 on 2nd GPU
                                  jbrussell

                                  Looks like the code attachment failed.  2nd try:

                                   

                                  cl_int initializeCL(void) { cl_int status = 0; size_t deviceListSize; /* * Have a look at the available platforms and pick either * the AMD one if available or a reasonable default. */ cl_uint numPlatforms; cl_platform_id platform = NULL; status = clGetPlatformIDs(0, NULL, &numPlatforms); if (status != CL_SUCCESS) { std::cout << "Error: Getting Platforms. (clGetPlatformsIDs)\n"; return 1; } if (numPlatforms > 0) { cl_platform_id* platforms = new cl_platform_id[numPlatforms]; status = clGetPlatformIDs(numPlatforms, platforms, NULL); if (status != CL_SUCCESS) { std::cout << "Error: Getting Platform Ids. (clGetPlatformsIDs)\n"; return 1; } // Print platform stats query_and_print_platform_info( platforms, numPlatforms ); for (cl_uint i=0; i < numPlatforms; ++i) { char pbuff[100]; status = clGetPlatformInfo( platforms[i], CL_PLATFORM_VENDOR, sizeof(pbuff), pbuff, NULL ); if (status != CL_SUCCESS) { std::cout << "Error: Getting Platform Info.(clGetPlatformInfo)\n"; return 1; } platform = platforms[i]; if (!strcmp(pbuff, "Advanced Micro Devices, Inc.")) { break; } } delete platforms; } if (NULL == platform) { std::cout << "NULL platform found so Exiting Application." << std::endl; return 1; } /* * If we could find our platform, use it. Otherwise use just available platform. */ cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM, (cl_context_properties)platform, 0 }; ///////////////////////////////////////////////////////////////// // Create an OpenCL context ///////////////////////////////////////////////////////////////// context = clCreateContextFromType( cps, CL_DEVICE_TYPE_GPU, NULL, NULL, &status ); if(status != CL_SUCCESS) { std::cout<<"Error: Creating Context. (clCreateContextFromType)\n"; return 1; } /* First, get the size of device list data */ status = clGetContextInfo( context, CL_CONTEXT_DEVICES, 0, NULL, &deviceListSize ); if (status != CL_SUCCESS) { std::cout << "Error: Getting Context Info (device list size, clGetContextInfo)\n"; return 1; } else { std::cout << "Num available OpenCl devices = " << deviceListSize << std::endl; } if (deviceListSize == 0) { std::cout << "Error: No devices found.\n"; return 1; } ///////////////////////////////////////////////////////////////// // Detect OpenCL devices ///////////////////////////////////////////////////////////////// devices = (cl_device_id *)malloc(deviceListSize); if (devices == 0) { std::cout << "Error allocating mem for devices.\n"; return 1; } /* Now, get the device list data */ status = clGetContextInfo( context, CL_CONTEXT_DEVICES, deviceListSize, devices, NULL ); if (status != CL_SUCCESS) { std::cout << "Error: Getting Context Info (device list, clGetContextInfo)\n"; return 1; } // Print stats about the OpenCL devices // if ( query_and_print_device_info( devices, deviceListSize ) != CL_SUCCESS ) if ( query_and_print_device_info( devices, 2 ) != CL_SUCCESS ) // craps out after 2 on PC { return 1; } ///////////////////////////////////////////////////////////////// // Create an OpenCL command queue ///////////////////////////////////////////////////////////////// commandQueue = clCreateCommandQueue( context, #ifdef RUN_ON_CYPRESS devices[0], #else #ifdef RUN_ON_FIREPRO devices[1], #else devices[0], #endif #endif CL_QUEUE_PROFILING_ENABLE, // enable profiling, in-order execution &status ); if (status != CL_SUCCESS) { std::cout << "Creating Command Queue. (clCreateCommandQueue)\n"; return 1; } ///////////////////////////////////////////////////////////////// // Create OpenCL device memory buffers ///////////////////////////////////////////////////////////////// // Create image buffers on device if ( create_image_device_buffers(context) != CL_SUCCESS ) { std::cout << "Error in create_image_device_buffers" << std::endl; return 1; } // Create func 0 buffers on device if ( create_func0_device_buffers(context) != CL_SUCCESS ) { std::cout << "Error in create_func0_device_buffers" << std::endl; return 1; } // Create func1 buffers on device if ( create_func1_device_buffers(context) != CL_SUCCESS ) { std::cout << "Error in create_func1_device_buffers" << std::endl; return 1; } ///////////////////////////////////////////////////////////////// // Load CL file, build CL program object, create CL kernel object ///////////////////////////////////////////////////////////////// const char * filename = "my_Kernels.cl"; std::string sourceStr = convertFileToString(filename); const char * source = sourceStr.c_str(); size_t sourceSize[] = { strlen(source) }; const char * buildOptions = NULL; program = clCreateProgramWithSource( context, 1, &source, sourceSize, &status ); if (status != CL_SUCCESS) { std::cout << "Error: Loading Binary into cl_program (clCreateProgramWithBinary)\n"; return 1; } /* create a cl program executable for all the devices specified */ #ifdef RUN_ON_CYPRESS status = clBuildProgram(program, 1, &devices[0], buildOptions, NULL, NULL); #else #ifdef RUN_ON_FIREPRO status = clBuildProgram(program, 1, &devices[1], buildOptions, NULL, NULL); #else // Original, builds across all devices status = clBuildProgram(program, 1, devices, NULL, NULL, NULL); #endif #endif if (status != CL_SUCCESS) { std::cout << "Error: Building Program (clBuildProgram) = " << status << std::endl; return 1; } /* get a kernel object handle for a kernel with the given name */ func0_kernel = clCreateKernel(program, "func0_kernel", &status); if (status != CL_SUCCESS) { std::cout << "Error: Creating func0_kernel from program. (clCreateKernel) = " << status << std::endl; return 1; } func1_kernel = clCreateKernel(program, "func1_kernel", &status); if (status != CL_SUCCESS) { std::cout << "Error: Creating func1_kernel from program. (clCreateKernel) = " << status << std::endl; return 1; } return CL_SUCCESS; } // The enqueue is pretty standard...here's the snippet size_t globThreads[] = {imageWidth, imageHeight}; size_t locThreads[] = {16, 16}; status = clEnqueueNDRangeKernel( commandQueue, func0_kernel, 2, // num dimensions NULL, // global work offset globThreads, // global work size NULL, //locThreads, // local work size 0, // num events in wait list NULL, // ptr to event wait list &events[4] // event ); if (status != CL_SUCCESS) { std::cout << "Error: Enqueueing func0_kernel onto command queue. Status = " << status << std::endl; return 1; }

                                  • clEnqueueNDRangeKernel returns -45 on 2nd GPU
                                    genaganna

                                     

                                    Originally posted by: jbrussell In looking at the original code it looks like it builds just for device 0 since the num_devices parameter to clBuildProgram() is set to 1.  So now when I force it to only build the kernel for the device I intend to use, via a #define, it works fine for  devices[0], the 5870.  For devices[1], the V8750, clBuildProgram returns -11, CL_BUILD_PROGRAM_FAILURE.  When ported to an NVIDIA OpenCL 1.0 system this kernel builds and runs fine, so maybe there's a problem with the SDK setup?  The NVIDIA device happens to be device 0, so maybe the problem is for secondary devices?

                                     

                                    I can't show the kernel code because it's proprietary but it's really only doing some fairly simple math on a ushort image in global memory...converts to float, does some math, converts back to ushort for output.

                                     

                                    I've attached the portions of the runtime code that init and do the build...pretty standard, glommed from the template example.

                                     

                                    So, to restate the problem(s):

                                     

                                    When building only for the single device on which the kernel is to be run, clBuildProgram() returns -11  for the OpenCL 1.0 V8750 device (device 1), and returns CL_SUCCESS for the OPENCL 1.1 5870 device (device 0).

                                     



                                    When you get CL_BUILD_PROGRAM_FAILURE,  use clGetProgramBuildInfo(CL_PROGRAM_BUILD_LOG) to get build log. which should solve your problem.

                                    Paste program build log here which helps us to answer quickly.

                                    It looks like you have written code which does not support on V8750 device.

                                      • clEnqueueNDRangeKernel returns -45 on 2nd GPU
                                        jbrussell

                                        OK, now we're getting somewhere.  clGetProgramBuildInfo() returned this: 

                                        C:\Users\JBR\AppData\Local\Temp\OCL5183.tmp.cl(40): error: write to < 32
                                                  bits via pointer not allowed unless cl_khr_byte_addressable_store is
                                                  enabled
                                              outImg[r*dimX + c] = (ushort)temp;
                                              ^

                                        So I added this pragma line to the kernel code:

                                        #pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable

                                        and got this message:

                                        C:\Users\JBR\AppData\Local\Temp\OCL5183.tmp.cl(7): error: can't enable all OpenCL extensions or unrecognized OpenCL extension
                                          #pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable
                                        ^

                                        So it appears I can't enable the cl_khr_byte_addressable_store extension that it says I need.  That extension is mentioned in the 1.0 spec but not the 1.1 spec.  So it appears that the compiler either doesn't recognize the V8750 as a 1.0 device, or the compiler doesn't do 1.0 builds?

                                        Another interesting error message is this:

                                        4 errors detected in the compilation of "C:\Users\JBR\AppData\Local\Temp\OCL5183.tmp.cl".
                                        Internal error: compiler frontend invocation failed. Make sure ATISTREAMSDKROOT is set

                                        Looks like a leftover from the stream days.  I added a new environment variable ATISTREAMSDKROOT to point to the same location as AMDSDKROOT but still got the same error.

                                         

                                          • clEnqueueNDRangeKernel returns -45 on 2nd GPU
                                            genaganna

                                             

                                            Originally posted by: jbrussell OK, now we're getting somewhere.  clGetProgramBuildInfo() returned this: 

                                            C:\Users\JBR\AppData\Local\Temp\OCL5183.tmp.cl(40): error: write to < 32           bits via pointer not allowed unless cl_khr_byte_addressable_store is           enabled       outImg[r*dimX + c] = (ushort)temp;       ^

                                            So I added this pragma line to the kernel code:

                                            #pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable

                                            and got this message:

                                             

                                            C:\Users\JBR\AppData\Local\Temp\OCL5183.tmp.cl(7): error: can't enable all OpenCL extensions or unrecognized OpenCL extension   #pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable ^

                                            So it appears I can't enable the cl_khr_byte_addressable_store extension that it says I need.  That extension is mentioned in the 1.0 spec but not the 1.1 spec.  So it appears that the compiler either doesn't recognize the V8750 as a 1.0 device, or the compiler doesn't do 1.0 builds?

                                            Another interesting error message is this:

                                            4 errors detected in the compilation of "C:\Users\JBR\AppData\Local\Temp\OCL5183.tmp.cl". Internal error: compiler frontend invocation failed. Make sure ATISTREAMSDKROOT is set

                                            Looks like a leftover from the stream days.  I added a new environment variable ATISTREAMSDKROOT to point to the same location as AMDSDKROOT but still got the same error.

                                             





                                            V8750 supports only OpenCL 1.0 and does not support cl_khr_byte_addressable_store.  Compiler is giving appropriate error.

                                            Regarding ATISTREAMSDKROOT,  Please check whether you have more than one OpenCL libraries on your system.

                                            AMDAPPSDKROOT is the right one. AMDSDKROOT is not correct one.

                                              • clEnqueueNDRangeKernel returns -45 on 2nd GPU
                                                jbrussell

                                                 

                                                "V8750 supports only OpenCL 1.0 and does not support cl_khr_byte_addressable_store.  Compiler is giving appropriate error."

                                                According to the online documentation cl_khr_byte_addressable_store is only an OpenCL 1.0 extension, i.e., not available for 1.1.  That tells me that the compiler is not recognizing the V8750 as a 1.0 device or, if it does see it as 1.0, the cl_khr_byte_addressable_store extension is not available.

                                                 

                                                 



                                                 

                                                 

                                                 

                                                 

                                                 

                                                 

                                                 

                                                 



                                                 

                                                  • clEnqueueNDRangeKernel returns -45 on 2nd GPU
                                                    genaganna

                                                     

                                                    Originally posted by: jbrussell "V8750 supports only OpenCL 1.0 and does not support cl_khr_byte_addressable_store.  Compiler is giving appropriate error."

                                                     

                                                    According to the online documentation cl_khr_byte_addressable_store is only an OpenCL 1.0 extension, i.e., not available for 1.1.  That tells me that the compiler is not recognizing the V8750 as a 1.0 device or, if it does see it as 1.0, the cl_khr_byte_addressable_store extension is not available.

                                                     



                                                     



                                                    I should reframe my statement. V8750 supports only OpenCL 1.0 core spec and does not support few extension likes cl_khr_byte_addressable_store.

                                                    In OpenCL 1.1, cl_khr-byte_addressable_store became core spec.

                                                • clEnqueueNDRangeKernel returns -45 on 2nd GPU
                                                  genaganna

                                                   

                                                  Originally posted by: jbrussell  

                                                   

                                                  Another interesting error message is this:

                                                   

                                                  4 errors detected in the compilation of "C:\Users\JBR\AppData\Local\Temp\OCL5183.tmp.cl". Internal error: compiler frontend invocation failed. Make sure ATISTREAMSDKROOT is set

                                                   





                                                  Thank you for reporting this issue. I am able to reproduce this issue at my end.  Reported to developers and will be fixed in future releases.