14 Replies Latest reply on Sep 5, 2012 2:30 AM by torandi

    OpenGL / OpenCL interop problems

    Nico83

      Hi everyone.

      I'm currently working on a project in which I create a VBO using OpenGL, and then I fill it with OpenCL. Finally I render this VBO with glDrawArray. I create the cl vbo with the clCreateFromGLBuffer command. I'm working on XP 64 SP2 with VS2008 and a Radeon HD 5850. I've downloaded the lastest graphic drivers, the lastest opencl drivers and the lastest stream framework.

      My program runs but with ridiculous performance. I suspect a problem with the interop between cl/gl.

      - I have carefully read the examples in the stream sdk: the only difference is that i create my gl context using MFC while the simpleGL example create it using glut. Is there anu known issue using mfc and opencl ?

      - I have to tell that I firstly develop my code on nvidia framework: it works perfectly, without any interop problem. As far as opencl is supposed to be portable, i really don't understand why this difference occurs. Is OpenCL on ATI still in beta ?

      - A last question: when downloading the drivers, there is no OpenCL.dll while at runtime, every program (mine and those og the sdk) look for opencl.dll . Surprisingly, there is a atiocl64.dll in C:\Program Files (x86)\ATI Stream\bin\x86_64 : when I rename this dll in opencl.dll, the examples in the sdk stream work perfectly ? Is it normal ?

       

      Thanks by advance. If it help, I can put some code ...

        • OpenGL / OpenCL interop problems
          genaganna

           

          Originally posted by: Nico83 Hi everyone.

           

          I'm currently working on a project in which I create a VBO using OpenGL, and then I fill it with OpenCL. Finally I render this VBO with glDrawArray. I create the cl vbo with the clCreateFromGLBuffer command. I'm working on XP 64 SP2 with VS2008 and a Radeon HD 5850. I've downloaded the lastest graphic drivers, the lastest opencl drivers and the lastest stream framework.

           

          My program runs but with ridiculous performance. I suspect a problem with the interop between cl/gl.

           

          - I have carefully read the examples in the stream sdk: the only difference is that i create my gl context using MFC while the simpleGL example create it using glut. Is there anu known issue using mfc and opencl ?

           

          Please send your code to streamdeveloper@amd.com

           

          - I have to tell that I firstly develop my code on nvidia framework: it works perfectly, without any interop problem. As far as opencl is supposed to be portable, i really don't understand why this difference occurs. Is OpenCL on ATI still in beta ?

           

          - A last question: when downloading the drivers, there is no OpenCL.dll while at runtime, every program (mine and those og the sdk) look for opencl.dll . Surprisingly, there is a atiocl64.dll in C:\Program Files (x86)\ATI Stream\bin\x86_64 : when I rename this dll in opencl.dll, the examples in the sdk stream work perfectly ? Is it normal ?

           

           

           



          OpenCL.dll is in your system32 or SysWOW64 folder.  How is it working by renaming atiocl64.dll to opencl.dll.

          Which driver and Which SDk are you using?

            • OpenGL / OpenCL interop problems
              nou

              do you create VBO after OpenCL context?

              do you call clCreateFromGLBuffer() only once?

                • OpenGL / OpenCL interop problems
                  Nico83

                  I'm sorry but I can't publish the whole code. But I can give you the init of opengl and the creation of the vbo.

                  On NVidia CL, i created firstly the vbo with opengl, then i created the cl context and after i created the vbo cl. As I saw that examples in stream sdk firstly created the cl context and then created gl vbo and cl vbo, I changed the order of creation, but it was useless. Is the order important (it's not the case on nvidia) ?

                  Moreover, there is no opencl.dll in system32 (in fact, there nowhere opencl.dll). In fact, the opencl.lib look for methods in opencl.dll. This dll is searched in directories referenced in the path variable: as C:\Program Files (x86)\ATI Stream\bin\x86_64 contains atiocl64 that I've renamed in opencl.dll, it seems to works. Do you think that the issu comes from the cl installation ?

                  // TODO: Add your specialized creation code here CClientDC dc(this) ; // // Fill in the pixel format descriptor. // // for performance purpose _Timer.Start(); static PIXELFORMATDESCRIPTOR pfd ; memset(&pfd, 0, sizeof(PIXELFORMATDESCRIPTOR)) ; pfd.nSize = sizeof(PIXELFORMATDESCRIPTOR); pfd.nVersion = 1 ; pfd.dwFlags = PFD_DOUBLEBUFFER | PFD_SUPPORT_OPENGL | PFD_DRAW_TO_WINDOW ; pfd.iPixelType = PFD_TYPE_RGBA ; pfd.cColorBits = 24 ; pfd.cDepthBits = 32 ; pfd.iLayerType = PFD_MAIN_PLANE ; pfd.cStencilBits = 1; int nPixelFormat = ChoosePixelFormat(dc.GetSafeHdc(), &pfd); if (nPixelFormat == 0) { TRACE("ChoosePixelFormat Failed %d\r\n",GetLastError()) ; return -1 ; } TRACE("Pixel Format %d\r\n",nPixelFormat) ; BOOL bResult = SetPixelFormat (dc.GetSafeHdc(), nPixelFormat, &pfd); if (!bResult) { TRACE("SetPixelFormat Failed %d\r\n",GetLastError()) ; return -1 ; } // // Create a rendering context. // m_hrc = wglCreateContext(dc.GetSafeHdc()); if (!m_hrc) { TRACE("wglCreateContext Failed %x\r\n", GetLastError()) ; return -1; } wglMakeCurrent( dc.GetSafeHdc(), m_hrc ); glGenBuffers(1, &myVBO); glBindBuffer(GL_ARRAY_BUFFER, myVBO); glBufferData(GL_ARRAY_BUFFER, SIZE, 0, GL_DYNAMIC_DRAW); glBindBuffer(GL_ARRAY_BUFFER, 0); glFinish(); vbo_cl = clCreateFromGLBuffer(_CLcontext, CL_MEM_READ_WRITE, myVBO, &errcode_ret);

                    • OpenGL / OpenCL interop problems
                      genaganna

                       

                      Originally posted by: Nico83 I'm sorry but I can't publish the whole code. But I can give you the init of opengl and the creation of the vbo.

                       

                      On NVidia CL, i created firstly the vbo with opengl, then i created the cl context and after i created the vbo cl. As I saw that examples in stream sdk firstly created the cl context and then created gl vbo and cl vbo, I changed the order of creation, but it was useless. Is the order important (it's not the case on nvidia) ?

                       

                      Yes order is very important in ATI sdk.

                       

                      Moreover, there is no opencl.dll in system32 (in fact, there nowhere opencl.dll). In fact, the opencl.lib look for methods in opencl.dll. This dll is searched in directories referenced in the path variable: as C:\Program Files (x86)\ATI Stream\bin\x86_64 contains atiocl64 that I've renamed in opencl.dll, it seems to works. Do you think that the issu comes from the cl installation ?

                       



                      which SDK are you using?  Could you please run CLInfo and send log printed on command line?

                        • OpenGL / OpenCL interop problems
                          Nico83

                          OK, thanks. Can you indicate me the right order to create vbo with interop. I tried:

                          init cl context -> generate gl vbo -> generate cl vbo

                          generate gl vbo -> init cl context -> generate cl vbo

                          But it doesn't work better

                          clinfo gave me the following:

                          Number of platforms: 1 Platform Profile: FULL_PROFILE Platform Version: OpenCL 1.1 ATI-Stream-v2.2 (302) Platform Name: ATI Stream Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback Platform Name: ATI Stream Number of devices: 2 Device Type: CL_DEVICE_TYPE_CPU Device ID: 4098 Max compute units: 8 Max work items dimensions: 3 Max work items[0]: 1024 Max work items[1]: 1024 Max work items[2]: 1024 Max work group size: 1024 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Max clock frequency: 2993Mhz Address bits: 64 Max memory allocation: 1073741824 Image support: No Max size of kernel argument: 4096 Alignment (bits) of base address: 1024 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: Yes Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: No Cache type: Read/Write Cache line size: 64 Cache size: 32768 Global memory size: 3221225472 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Global Local memory size: 32768 Profiling timer resolution: 0 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: Yes Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 00000001808E3568 Name: Intel(R) Xeon(R) CPU X5450 @ 3.00GHz Vendor: GenuineIntel Driver version: 2.0 Profile: FULL_PROFILE Version: OpenCL 1.1 ATI-Stream-v2.2 (302) Extensions: cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_printf Device Type: CL_DEVICE_TYPE_GPU Device ID: 4098 Max compute units: 18 Max work items dimensions: 3 Max work items[0]: 256 Max work items[1]: 256 Max work items[2]: 256 Max work group size: 256 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 4 Preferred vector width double: 0 Max clock frequency: 725Mhz Address bits: 32 Max memory allocation: 134217728 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 8192 Max image 2D height: 8192 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 1024 Alignment (bits) of base address: 32768 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: None Cache line size: 0 Cache size: 0 Global memory size: 536870912 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 32768 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: No Queue properties: Out-of-Order: No Profiling : Yes Platform ID: 00000001808E3568 Name: Cypress Vendor: Advanced Micro Devices, Inc. Driver version: CAL 1.4.739 Profile: FULL_PROFILE Version: OpenCL 1.1 ATI-Stream-v2.2 (302) Extensions: cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops Passed!

                            • OpenGL / OpenCL interop problems
                              laobrasuca

                              may I ask you how large are your openGL object buffers? Do they together take more than 256MB? I'm asking it coz I myself have a gl/cl application and I do have an extremely low performance and I seriously suspect that it is related to the amount of data I take in the VRAM. Ever since I take more than 256MB of data (not necessarily with one single buffer), my performance drops badly. Note though that my kernels still run as fast as usual, but the time that the cl takes to create buffers (from gl or not) is awfully big, but only on amd card, not in nvidia's.

                                • OpenGL / OpenCL interop problems
                                  laobrasuca

                                  Nico, how about your buffers size?

                                    • OpenGL / OpenCL interop problems
                                      laobrasuca

                                      about performance, I did some GL/CL buffer creation/acquisition tests and I could notice that bigger is the buffer slower will be the clCreateFromGLBuffer/ clEnqueueAcquireGLObjects process. Example: 2MB takes 6ms, 20MB takes 41ms, 200MB takes 450ms, 1GB takes 2540ms... (HD5770 card). It seems like data copy is being done somehow/somewhere during this process instead of a simple "permission change" on buffer ownership. Same test on nvidia card always results in 0ms no matter the buffer size.

                                        • OpenGL / OpenCL interop problems
                                          nou

                                          laobrasuca: do you measure clEnqueueAcquireGLObjects() time alone? because maybe in clCreateFromGLBuffer() there is some copy to shared area.

                                          and one thing. make sure that you create all OpenGL resources AFTER create OpenCL context.

                                            • OpenGL / OpenCL interop problems
                                              laobrasuca

                                              my bad, i did create the gl resources BEFORE cl context, my bad. I correct it and times are less scary, but still not perfect. i do the following

                                              create gl context (glut/glew)

                                              create cl context (platform, context, device, commandqueue, in this order. no programs/kernels whatsoever)

                                              create gl buffers (genbuffer/bindbuffer/bufferdata, allocate 200MB for ibo + 100MB for vbo)

                                              create/acquire cl buffers. Let me detail this last one:

                                              - clCreateFromGLBuffer for ibo

                                              - clEnqueueAcquireGLObjects for ibo

                                              - clCreateFromGLBuffer for vbo

                                              - clEnqueueAcquireGLObjects for vbo

                                              - clFinish.

                                              Now, runing this and measuring elapsed time (using time.h) i narrow down to:

                                              clCreateFromGLBuffer (ibo) :0 secs.
                                              clEnqueueAcquireGLObjects (ibo) :0.001 secs.
                                              clCreateFromGLBuffer (vbo) :0 secs.
                                              clEnqueueAcquireGLObjects (vbo) :0.387 secs.
                                              clFinish :0.073 secs.

                                              tomorrow i'll make changes on the buffer sizes to see what happens (i'm not able to do it now).

                                                • OpenGL / OpenCL interop problems
                                                  nou

                                                  hmmm strange. what size report OpenCL as global memory? because second buffer have 387ms. IIRC someone tell that he observe slow down when he cross global OpenCL mem limit.

                                                    • OpenGL / OpenCL interop problems
                                                      himanshu.gautam

                                                      hi all,

                                                      The best way is to query the device using device info flags and then allocate buffer and  select workgroup size accordingly.

                                                        • OpenGL / OpenCL interop problems
                                                          laobrasuca

                                                           

                                                          Originally posted by: himanshu.gautam hi all,

                                                           

                                                          The best way is to query the device using device info flags and then allocate buffer and  select workgroup size accordingly.

                                                           

                                                          in my case i don't even select workgroup size since the buffer problem occurs independently of any program build or kernel execution.

                                                           

                                                           

                                                          Originally posted by: nou hmmm strange. what size report OpenCL as global memory? because second buffer have 387ms. IIRC someone tell that he observe slow down when he cross global OpenCL mem limit.

                                                          the size reported is 1073741824 for CL_DEVICE_GLOBAL_MEM_SIZE and 268435456 for CL_DEVICE_MAX_MEM_ALLOC_SIZE (i use GPU_MAX_HEAP_SIZE 100 and GPU_STAGING_BUFFER_SIZE 2048). A few more examples:

                                                          IBO size: 200400000
                                                          VBO size: 200400000
                                                          clCreateFromGLBuffer (ibo) :0 secs.
                                                          clEnqueueAcquireGLObjects (ibo) :0 secs.
                                                          clCreateFromGLBuffer (vbo) :0.001 secs.
                                                          clEnqueueAcquireGLObjects (vbo) :0.497 secs.
                                                          clFinish :0.097 secs.

                                                          --------------------------------------------------------------

                                                          IBO size: 320400000
                                                          VBO size: 100200000
                                                          clCreateFromGLBuffer (ibo) :0.001 secs.
                                                          clEnqueueAcquireGLObjects (ibo) :0 secs.
                                                          clCreateFromGLBuffer (vbo) :0 secs.
                                                          clEnqueueAcquireGLObjects (vbo) :0.55 secs.
                                                          clFinish :0.101 secs.

                                                          --------------------------------------------------------------

                                                          IBO size: 100200000
                                                          VBO size: 320400000
                                                          clCreateFromGLBuffer (ibo) :0.001 secs.
                                                          clEnqueueAcquireGLObjects (ibo) :0.001 secs.
                                                          clCreateFromGLBuffer (vbo) :0 secs.
                                                          clEnqueueAcquireGLObjects (vbo) :0.229 secs.
                                                          clFinish :0.101 secs.

                                                          --------------------------------------------------------------

                                                          IBO size: 320400000
                                                          VBO size: 10020000
                                                          clCreateFromGLBuffer (ibo) :0 secs.
                                                          clEnqueueAcquireGLObjects (ibo) :0 secs.
                                                          clCreateFromGLBuffer (vbo) :0.001 secs.
                                                          clEnqueueAcquireGLObjects (vbo) :0.237 secs.
                                                          clFinish :0.08 secs.

                                                          --------------------------------------------------------------

                                                          IBO size: 10020000
                                                          VBO size: 320400000
                                                          clCreateFromGLBuffer (ibo) :0 secs.
                                                          clEnqueueAcquireGLObjects (ibo) :0 secs.
                                                          clCreateFromGLBuffer (vbo) :0 secs.
                                                          clEnqueueAcquireGLObjects (vbo) :0.024 secs.
                                                          clFinish :0.201 secs.

                                                           

                                                          Any clue on why is this so, or how to overcome it? In addition, if i run the kernels in this context, the kernel execution time increase severely compared to when manipulating only ibo or only vbo.

                                  • Re: OpenGL / OpenCL interop problems
                                    torandi

                                    Reviving this old thread since I'm have this exact same problem. clEnqueueAcquireGLObjects is extremely slow on ati cards for buffers larger than a couple of kb. Not having that problem on nvidia cards.

                                     

                                    I don't have any exact timing, but I have tested to run only acquire/release GLObjects (and not running the kernel at all) and then without them too. The difference is a jump from 40 fps (with acquire/release) and 80fps without.

                                    The shared buffer is 500kb and the total size of the cl buffers is 1MB.

                                     

                                    Edit: Okay, this was weird. Rebooted, and now I don't seem to have this problem. Might have been something with leaked memory on the card or something, maybe?