Archives Discussions

Kieren · ‎03-01-2011

GL_INTEROP is not working if OpenCL and OpenGL contexts are on differed devices

Hi there

I'm trying to get the best performance out of my multi-GPU system.

The idea was to have both GPUs work at the same time

There for I tired to modify the oclSimpleGl example form the NVIDIA developer page in the way that one GPU is doing the OpenGL load an the other GPU is modifying the VBO via OpenCL.

As long as I'm not using GL_INTEROP everything is working fine.

The first GPU (OpenGL) is on 50% load and the second GPU(OpenCL) 25%.

Without GL_ INTEROP the Data has to be copied in each update cycle.

It is really slow only ~280fps.

With only one GPU I get also 280fps with 70% GPU load.
(GPU laod checkt with GPU-Z)

So no the fun stuff.

Enabling GL_INTEROP and using only the first GPU leads to round about 1200fps (!) with 54% load. Yes really 1,2k fps.

And now the problem.

Using GL_INTEROP with two GPUs sounds like a good idea.

But it seams as the Buffer is not transferred between the GPUs.

Both GPUs ware working. GPU0(OpenGL) 29%, GPU1(OpenCL) 25% and still the windows says 1,2 fps but stays black.

If I check the buffer via

glBindBufferARB(GL_ARRAY_BUFFER, vbo);

float* ptr = (float*)glMapBufferARB(GL_ARRAY_BUFFER, GL_WRITE_ONLY_ARB);

glUnmapBuffer(GL_ARRAY_BUFFER);

the content stays at 0.0 all over the array.

Neither OpenGL nor OpenCL are reporting any errors. The OCLcontext is created with no problems clEnqueueAcquireGLObjects works fine.

Am I doing anything wrong? Or is the driver not aware that the data it modifies is on the other device.

It looks a little like clCreateFromGLBuffer() does not check if the OpenGl and OpenCL contexts are on the same device. And just a pointer to the actual position of the OpenGL-Buffer is promoted not the device. As there are two identical GPUs the pointer is reachable and some random memory on GPU1(OpenCL) is modified.

Thank you for any suggestions how to fix this problem.

Regards

Kieren

-----------------------------
System Info:
Alienware M17XR2
Two ATI Mobility Radeon HD 5870 (CrossFireX enabled)
Intel(R) i7 CPU M620

(I'm german so sorry for the bad english )

laobrasuca · ‎03-01-2011

it should be easy for you to set interop when using different devices of a same platform. After creating the GL context using a given device, you should be able to create a CL context from the GL one for all GPU devices of the same platform:

    cl_context_properties cpsGL[] = { CL_CONTEXT_PLATFORM, (cl_context_properties)platform,
                                    CL_WGL_HDC_KHR,                            glCurrentDC,
                                    CL_GL_CONTEXT_KHR,                        glCtx,
                                    0};

context = clCreateContextFromType(cprops, CL_DEVICE_TYPE_GPU, NULL, NULL, &status);

i can't see why you're facing any problem.

However, bear in mind that while CL has the hands on the vbo's, you can't use GL to manipulate them. You've always got to release GL buffers before using gl commands. By the way, that's the reason why i can't see why you plan to use 2 cards, one for GL and another for CL in parallel, since you can't use GL buffers while CL is using them. It will certainly be slower than using one single card (due to memory exchanges through pci-e bus), unless you have enough things to do on the GL side while you are modifying buffers on the CL side.

Anyways, do not forget to use glFinish or clFinish in order to synchronize things depending on what you're doing.

Kieren · ‎03-02-2011

Hi

thank you for the option list. I already had the exact same parameters.

Why I try to use 2 GUPs. Simple because we can and we should do so.
There will be enough parallel work for both GPUs in my project.
The oclSimpleGL is just a test case.
Later OpenCL will be used to improve parts of the created frames via “Ray Tracing”.

I'll attach the imported parts of the oclSimpleGl code that lead too my problem.

I'm still not sure if it is a bug in the ATI/AMD OpenCL implementation, that there is no error output when trying to use clEnqueueAcquireGLObjects() on different devices. Or, that if is allowed to do so, no data is transferred between the two GPUs.

The code is nearly unchanged and form the oclSimpleGL example as linked on the NVIDIA page.

// Initialize GL //***************************************************************************** void InitGL(int* argc, char** argv) { // initialize GLUT glutInit(argc, argv); glutInitDisplayMode(GLUT_RGBA | GLUT_DOUBLE); glutInitWindowPosition (glutGet(GLUT_SCREEN_WIDTH)/2 - window_width/2, glutGet(GLUT_SCREEN_HEIGHT)/2 - window_height/2); glutInitWindowSize(window_width, window_height); iGLUTWindowHandle = glutCreateWindow("OpenCL/GL Interop (VBO)"); // register GLUT callback functions glutDisplayFunc(DisplayGL); glutKeyboardFunc(KeyboardGL); glutMouseFunc(mouse); glutMotionFunc(motion); // initialize necessary OpenGL extensions glewInit(); GLboolean bGLEW = glewIsSupported("GL_VERSION_2_0 GL_ARB_pixel_buffer_object"); shrCheckErrorEX(bGLEW, shrTRUE, pCleanup); // default initialization glClearColor(0.0, 0.0, 0.0, 1.0); glDisable(GL_DEPTH_TEST); // viewport glViewport(0, 0, window_width, window_height); // projection glMatrixMode(GL_PROJECTION); glLoadIdentity(); gluPerspective(60.0, (GLfloat)window_width / (GLfloat) window_height, 0.1, 10.0); // set view matrix glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); glMatrixMode(GL_MODELVIEW); glLoadIdentity(); glTranslatef(0.0, 0.0, translate_z); glRotatef(rotate_x, 1.0, 0.0, 0.0); glRotatef(rotate_y, 0.0, 1.0, 0.0); return; } // Create Context //***************************************************************************** ... cl_context_properties props[] = { CL_GL_CONTEXT_KHR, (cl_context_properties)wglGetCurrentContext(), CL_WGL_HDC_KHR, (cl_context_properties)wglGetCurrentDC(), CL_CONTEXT_PLATFORM, (cl_context_properties)cpPlatform, 0 }; cxGPUContext = clCreateContext(props, 1, &cdDevices[uiDeviceUsed], NULL, NULL, &ciErrNum); ... // Create VBO //***************************************************************************** void createVBO(GLuint* vbo) { // create VBO unsigned int size = mesh_width * mesh_height * 4 * sizeof(float); // create buffer object glGenBuffers(1, vbo); glBindBuffer(GL_ARRAY_BUFFER, *vbo); // initialize buffer object glBufferData(GL_ARRAY_BUFFER, size, 0, GL_DYNAMIC_DRAW); // create OpenCL buffer from GL VBO vbo_cl = clCreateFromGLBuffer(cxGPUContext, CL_MEM_WRITE_ONLY, *vbo, &ciErrNum); } // Run the OpenCL part of the computation //***************************************************************************** void runKernel() { ciErrNum = CL_SUCCESS; // map OpenGL buffer object for writing from OpenCL glFinish(); ciErrNum = clEnqueueAcquireGLObjects(cqCommandQueue, 1, &vbo_cl, 0,0,0); shrCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup); // Set arg 3 and execute the kernel ciErrNum = clSetKernelArg(ckKernel, 3, sizeof(float), &anim); ciErrNum |= clEnqueueNDRangeKernel(cqCommandQueue, ckKernel, 2, NULL, szGlobalWorkSize, NULL, 0,0,0 ); shrCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup); // unmap buffer object ciErrNum = clEnqueueReleaseGLObjects(cqCommandQueue, 1, &vbo_cl, 0,0,0); shrCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup); clFinish(cqCommandQueue); } // Display callback //***************************************************************************** void DisplayGL() { ... // run OpenCL kernel to generate vertex positions runKernel(); // clear graphics then render from the vbo glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); glBindBuffer(GL_ARRAY_BUFFER, vbo); glVertexPointer(4, GL_FLOAT, 0, 0); glEnableClientState(GL_VERTEX_ARRAY); glColor3f(1.0, 0.0, 0.0); glDrawArrays(GL_POINTS, 0, mesh_width * mesh_height); glDisableClientState(GL_VERTEX_ARRAY); // flip backbuffer to screen glutSwapBuffers(); glutPostRedisplay(); ... }

laobrasuca · ‎03-02-2011

well, i don't see any problem with your partial code, and the fact that it works when using CL/GL iterop for the main GPU seems to confirm it. Did you try to use the CPU as the second device (the CL device)? Does it work (even if very slowly)? What about the crossfire, do you have it enabled or not? Try both cases.

Can you try the following: download GPUCapsViewer 1.10 http://www.geeks3d.com/20110208/gpu-caps-viewer-1-10-0-geforce-gt-440-improved-opencl-support/ and run any of the CL samples in the second GPU? It should work (at least it did for me). Can you also run it using the primary HD5870 and see if you have any difference in performance?

By the way, at this matter, I wrote an e-mail to Jerome (JeGX) asking how they set CL/GL iterop with the second card, but no answers as of now. Me, I haven't succeed on using GL/CL on my second GPU, NVIDIA one, while the main GPU is a Radeon. But GPUCapsViewer 1.10 does it, and very well! Like, setting the NVIDIA card as the main one (and taking out the Radeon from my system), the Julia CL GPU demo runs at 67.80 fps average. (several runs of 1 min bench using Fraps) But when I let Radeon on my system as primary card, and use the same NVIDIA card, I get 75.50fps average! (several runs of 1 min bench using Fraps) The GeForce is at 70%-75% load in both cases while the Radeon is at 20%-25% when used together (GPU-Z). This is pretty much what you expect as result. It's pretty obvious that the Radeon takes the GL part of it (drawing and stuff) while the NVIDIA process the CL part, but I ignore how the GL context of the Radeon is shared with the GeForce. They maybe use some sort of wglShareLists, but it's just a guess. If anyone here knows how GPUCapsViewer does the job, please let us know!

nou · ‎03-02-2011

try look at section 9.7.5

you should be able to query which device is belong to some OpenGL context.

laobrasuca · ‎03-03-2011

hehey, thx nou! I finally have an answers concerning the -1000 error code I have when creating CL context for the NV card from the GL context (which is likely from the Radeon). There's this new token now CL_INVALID_GL_SHAREGROUP_REFERENCE_KHR with value -1000. I'll read it an try to see if I can do something about it. Thx for the tip

edit: it is also present on the 1.0 CL version, but since it is not in the cl.h file, but in the cl_gl.h instead, I had never found this value. Besides, it does not appear in the error list of the clCerateContext/clCerateContextFromType on section 4.3 (it should maybe be referenced there somehow).

nou · ‎03-03-2011

i can say that if you use OpenGL interoperability AMD implementation create additional shared OpenGL context. most likely it use this shared context to upload/download data between OpenGL context and OpenCL CPU device.

and nVidia most likely don't implement such workaround.

laobrasuca · ‎03-03-2011

Originally posted by: nou i can say that if you use OpenGL interoperability AMD implementation create additional shared OpenGL context. most likely it use this shared context to upload/download data between OpenGL context and OpenCL CPU device.

not impossible indeed. But why the OP could not use the second GPU? Have you already tried to use a second AMD GPU to do CL processing while the GL part is on the primary AMD card?

and nVidia most likely don't implement such workaround.

well, in the case where 2 of their GPUs are mounted in a same system this kind of solution would be necessary. Or a more flexible solution, as they try to do with the WGL_nv_gpu_affinity extension.

Still, these are intra platform solutions. I'd like to know what GPUCapsViewer uses as solution, given that it is for Windows platform, my second card is not of the same platform as the primary one and that the display to which it is connected is not activated. I'll try to create a second DC for the secondary card (probably need to play with EnumDisplayDevices and CreateDC) and use the same Rendering Context of the primary card to make it the current context (hoping that the pixel format of the primary DC will also be supported by the second one). Or I'll maybe have to create a second Rendering context and share the first one with the second one (wglsharelists), but I don't if it will be possible with no enabled display connected to the card.

Kieren · ‎03-07-2011

Hi again,

thank you for the help.

I tried GL_INTEROP with my CPU and it works fine.

On the 2*Nvidia-GPU system at university it worked fine, too.

Can someone pleas try to check my test case on a dual AMD-GPU system.

If it is possible to check it on a multi “Mobility Radeon HD” system it will help me even more.

I extended the oclSimpleGL a bit.

You shout be able to select the device the OpenCL kernel runs on. Pleas try to find the GPU the OpenGL context is not bound to. If the sin-wave is moving everything is OK if not you have the same problem I have

http://developer.download.nvidia.com/compute/opencl/sdk/website/samples.html#oclSimpleGL

Just download the example form Nvidia an replace the content of the cpp file with my code attached.

Originally posted by: nou try look at section 9.7.5

you should be able to query which device is belong to some OpenGL context.

Thx very helpful tip.

Edit: is clGetGLContextInfoKHR() still not in the .lib file?

I found this as i faced a linker problem with this function.

/* * Copyright 1993-2010 NVIDIA Corporation. All rights reserved. * * Please refer to the NVIDIA end user license agreement (EULA) associated * with this source code for terms and conditions that govern your use of * this software. Any use, reproduction, disclosure, or distribution of * this software and related documentation outside the terms of the EULA * is strictly prohibited. * */ /* This example demonstrates how to use the OpenCL/OpenGL interoperability to dynamically modify a vertex buffer using a OpenCL kernel. The steps are: 1. Create an empty vertex buffer object (VBO) 2. Create an OpenCL memory object from the vertex buffer object 3. Acquire the VBO for writing from OpenCL 4. Run OpenCL kernel to modify the vertex positions 5. Release the VBO for returning ownership to OpenGL 6. Render the results using OpenGL Host code */ #include <iostream> // Utilities, OpenCL and system includes #include <oclUtils.h> // GLEW and GLUT includes #include <GL/glew.h> #if defined (__APPLE__) || defined(MACOSX) #include <GLUT/glut.h> #else #include <GL/glut.h> #endif #ifdef UNIX #if defined (__APPLE__) || defined(MACOSX) #include <OpenGL/OpenGL.h> #include <GLUT/glut.h> #else #include <GL/glx.h> #endif #endif #if defined (__APPLE__) || defined(MACOSX) #define GL_SHARING_EXTENSION "cl_APPLE_gl_sharing" #else #define GL_SHARING_EXTENSION "cl_khr_gl_sharing" #endif // Constants, defines, typedefs and global declarations //***************************************************************************** // Rendering window vars const unsigned int window_width = 512; const unsigned int window_height = 512; const unsigned int mesh_width = 256; const unsigned int mesh_height = 256; // OpenCL vars cl_platform_id cpPlatform; cl_context cxGPUContext; cl_device_id* cdDevices; cl_uint uiDevCount; cl_command_queue cqCommandQueue; cl_kernel ckKernel; cl_mem vbo_cl; cl_program cpProgram; cl_int ciErrNum; char* cPathAndName = NULL; // var for full paths to data, src, etc. char* cSourceCL = NULL; // Buffer to hold source for compilation size_t szGlobalWorkSize[] = {mesh_width, mesh_height}; // vbo variables GLuint vbo; int iGLUTWindowHandle = 0; // handle to the GLUT window // mouse controls int mouse_old_x, mouse_old_y; int mouse_buttons = 0; float rotate_x = 0.0, rotate_y = 0.0; float translate_z = -3.0; // Sim and Auto-Verification parameters float anim = 0.0; int iFrameCount = 0; // FPS count for averaging int iFrameTrigger = 90; // FPS trigger for sampling int iFramesPerSec = 0; // frames per second int iTestSets = 3; int g_Index = 0; shrBOOL bQATest = shrFALSE; shrBOOL bNoPrompt = shrFALSE; // Forward Function declarations //***************************************************************************** // OpenCL functionality void runKernel(); void saveResultOpenCL(int argc, const char** argv, const GLuint& vbo); // GL functionality void InitGL(int* argc, char** argv); void createVBO(GLuint* vbo); void DisplayGL(); void KeyboardGL(unsigned char key, int x, int y); void mouse(int button, int state, int x, int y); void motion(int x, int y); // Helpers void TestNoGL(); void Cleanup(int iExitCode); void (*pCleanup)(int) = &Cleanup; // Main program //***************************************************************************** int main(int argc, char** argv) { // start logs shrSetLogFileName ("oclSimpleGL.txt"); shrLog("%s Starting...\n\n", argv[0]); // check command line args if (argc > 1) { bQATest = shrCheckCmdLineFlag(argc, (const char**)argv, "qatest"); bNoPrompt = shrCheckCmdLineFlag(argc, (const char**)argv, "noprompt"); } //bQATest = shrTRUE; // Initialize OpenGL items (if not No-GL QA test) shrLog("%sInitGL...\n\n", bQATest ? "Skipping " : "Calling "); if(!bQATest) { InitGL(&argc, argv); } //Get the NVIDIA platform ciErrNum = oclGetPlatformID(&cpPlatform); oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup); // Get the number of GPU devices available to the platform ciErrNum = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_ALL, 0, NULL, &uiDevCount); oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup); // Create the device list cdDevices = new cl_device_id [uiDevCount]; ciErrNum = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_ALL, uiDevCount, cdDevices, NULL); oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup); for(unsigned i =0; i < uiDevCount; ++i) { shrLog("Device # %u, ", i); oclPrintDevName(LOGBOTH, cdDevices); shrLog("\n"); } shrLog("\n"); // Get device requested on command line, if any #define GPU_PROFILING #define GL_INTEROP unsigned int uiDeviceUsed = 0; //hier kann man auswählen welches der oCL divices gewählt wrerden soll unsigned int uiEndDev = uiDevCount - 1; shrLog("Pleas select device: "); std::cin >> uiDeviceUsed; if(uiDeviceUsed >= uiDevCount) { shrLog("Device ID %u not valid\n ", uiDeviceUsed); return -1; } if(shrGetCmdLineArgumentu(argc, (const char**)argv, "device", &uiDeviceUsed )) { uiDeviceUsed = CLAMP(uiDeviceUsed, 0, uiEndDev); uiEndDev = uiDeviceUsed; } // Check if the requested device (or any of the devices if none requested) supports context sharing with OpenGL if(!bQATest) { bool bSharingSupported = false; for(unsigned int i = uiDeviceUsed; (!bSharingSupported && (i <= uiEndDev)); ++i) { size_t extensionSize; ciErrNum = clGetDeviceInfo(cdDevices, CL_DEVICE_EXTENSIONS, 0, NULL, &extensionSize ); oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup); if(extensionSize > 0) { char* extensions = (char*)malloc(extensionSize); ciErrNum = clGetDeviceInfo(cdDevices, CL_DEVICE_EXTENSIONS, extensionSize, extensions, &extensionSize); oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup); std::string stdDevString(extensions); free(extensions); size_t szOldPos = 0; size_t szSpacePos = stdDevString.find(' ', szOldPos); // extensions string is space delimited while (szSpacePos != stdDevString.npos) { if( strcmp(GL_SHARING_EXTENSION, stdDevString.substr(szOldPos, szSpacePos - szOldPos).c_str()) == 0 ) { // Device supports context sharing with OpenGL uiDeviceUsed = i; bSharingSupported = true; break; } do { szOldPos = szSpacePos + 1; szSpacePos = stdDevString.find(' ', szOldPos); } while (szSpacePos == szOldPos); } } } shrLog("%s...\n\n", bSharingSupported ? "Using CL-GL Interop" : "No device found that supports CL/GL context sharing"); oclCheckErrorEX(bSharingSupported, true, pCleanup); // Define OS-specific context properties and create the OpenCL context #if defined (__APPLE__) CGLContextObj kCGLContext = CGLGetCurrentContext(); CGLShareGroupObj kCGLShareGroup = CGLGetShareGroup(kCGLContext); cl_context_properties props[] = { CL_CONTEXT_PROPERTY_USE_CGL_SHAREGROUP_APPLE, (cl_context_properties)kCGLShareGroup, 0 }; cxGPUContext = clCreateContext(props, 0,0, NULL, NULL, &ciErrNum); #else #ifdef UNIX cl_context_properties props[] = { CL_GL_CONTEXT_KHR, (cl_context_properties)glXGetCurrentContext(), CL_GLX_DISPLAY_KHR, (cl_context_properties)glXGetCurrentDisplay(), CL_CONTEXT_PLATFORM, (cl_context_properties)cpPlatform, 0 }; cxGPUContext = clCreateContext(props, 1, &cdDevices[uiDeviceUsed], NULL, NULL, &ciErrNum); #else // Win32 cl_context_properties props[] = { CL_GL_CONTEXT_KHR, (cl_context_properties)wglGetCurrentContext(), CL_WGL_HDC_KHR, (cl_context_properties)wglGetCurrentDC(), CL_CONTEXT_PLATFORM, (cl_context_properties)cpPlatform, 0 }; cxGPUContext = clCreateContext(props, 1, &cdDevices[uiDeviceUsed], NULL, NULL, &ciErrNum); #endif #endif } else { cl_context_properties props[] = {CL_CONTEXT_PLATFORM, (cl_context_properties)cpPlatform, 0}; cxGPUContext = clCreateContext(props, 1, &cdDevices[uiDeviceUsed], NULL, NULL, &ciErrNum); } shrCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup); //TODO: ?? //clGetGLContextInfoKHR( // Log device used (reconciled for requested requested and/or CL-GL interop capable devices, as applies) shrLog("Device # %u, ", uiDeviceUsed); oclPrintDevName(LOGBOTH, cdDevices[uiDeviceUsed]); shrLog("\n"); // create a command-queue cqCommandQueue = clCreateCommandQueue(cxGPUContext, cdDevices[uiDeviceUsed], 0, &ciErrNum); shrCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup); // Program Setup size_t program_length; cPathAndName = shrFindFilePath("simpleGL.cl", argv[0]); shrCheckErrorEX(cPathAndName != NULL, shrTRUE, pCleanup); cSourceCL = oclLoadProgSource(cPathAndName, "", &program_length); shrCheckErrorEX(cSourceCL != NULL, shrTRUE, pCleanup); // create the program cpProgram = clCreateProgramWithSource(cxGPUContext, 1, (const char **) &cSourceCL, &program_length, &ciErrNum); shrCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup); // build the program ciErrNum = clBuildProgram(cpProgram, 0, NULL, "-cl-fast-relaxed-math", NULL, NULL); if (ciErrNum != CL_SUCCESS) { // write out standard error, Build Log and PTX, then cleanup and exit shrLogEx(LOGBOTH | ERRORMSG, ciErrNum, STDERROR); oclLogBuildInfo(cpProgram, oclGetFirstDev(cxGPUContext)); oclLogPtx(cpProgram, oclGetFirstDev(cxGPUContext), "oclSimpleGL.ptx"); Cleanup(EXIT_FAILURE); } // create the kernel ckKernel = clCreateKernel(cpProgram, "sine_wave", &ciErrNum); shrCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup); // create VBO (if using standard GL or CL-GL interop), otherwise create Cl buffer createVBO(&vbo); // set the args values ciErrNum = clSetKernelArg(ckKernel, 0, sizeof(cl_mem), (void *) &vbo_cl); ciErrNum |= clSetKernelArg(ckKernel, 1, sizeof(unsigned int), &mesh_width); ciErrNum |= clSetKernelArg(ckKernel, 2, sizeof(unsigned int), &mesh_height); shrCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup); // If specified, compute and save off data for regression tests if(shrCheckCmdLineFlag(argc, (const char**) argv, "regression")) { // run OpenCL kernel once to generate vertex positions, then save results runKernel(); saveResultOpenCL(argc, (const char**)argv, vbo); } // init timer 1 for fps measurement shrDeltaT(1); // Start main GLUT rendering loop for processing and rendering, // or otherwise run No-GL Q/A test sequence shrLog("\n%s...\n", bQATest ? "No-GL test sequence" : "Standard GL Loop"); if(!bQATest) { glutMainLoop(); } else { TestNoGL(); } // Normally unused return path Cleanup(EXIT_FAILURE); } // Initialize GL //***************************************************************************** void InitGL(int* argc, char** argv) { // initialize GLUT glutInit(argc, argv); glutInitDisplayMode(GLUT_RGBA | GLUT_DOUBLE); glutInitWindowPosition (glutGet(GLUT_SCREEN_WIDTH)/2 - window_width/2, glutGet(GLUT_SCREEN_HEIGHT)/2 - window_height/2); glutInitWindowSize(window_width, window_height); iGLUTWindowHandle = glutCreateWindow("OpenCL/GL Interop (VBO)"); // register GLUT callback functions glutDisplayFunc(DisplayGL); glutKeyboardFunc(KeyboardGL); glutMouseFunc(mouse); glutMotionFunc(motion); // initialize necessary OpenGL extensions glewInit(); GLboolean bGLEW = glewIsSupported("GL_VERSION_2_0 GL_ARB_pixel_buffer_object"); shrCheckErrorEX(bGLEW, shrTRUE, pCleanup); // default initialization glClearColor(0.0, 0.0, 0.0, 1.0); glDisable(GL_DEPTH_TEST); // viewport glViewport(0, 0, window_width, window_height); // projection glMatrixMode(GL_PROJECTION); glLoadIdentity(); gluPerspective(60.0, (GLfloat)window_width / (GLfloat) window_height, 0.1, 10.0); // set view matrix glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); glMatrixMode(GL_MODELVIEW); glLoadIdentity(); glTranslatef(0.0, 0.0, translate_z); glRotatef(rotate_x, 1.0, 0.0, 0.0); glRotatef(rotate_y, 0.0, 1.0, 0.0); return; } // Run the OpenCL part of the computation //***************************************************************************** void runKernel() { ciErrNum = CL_SUCCESS; #ifdef GL_INTEROP // map OpenGL buffer object for writing from OpenCL glFinish(); ciErrNum = clEnqueueAcquireGLObjects(cqCommandQueue, 1, &vbo_cl, 0,0,0); shrCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup); #endif // Set arg 3 and execute the kernel ciErrNum = clSetKernelArg(ckKernel, 3, sizeof(float), &anim); ciErrNum |= clEnqueueNDRangeKernel(cqCommandQueue, ckKernel, 2, NULL, szGlobalWorkSize, NULL, 0,0,0 ); shrCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup); #ifdef GL_INTEROP // unmap buffer object ciErrNum = clEnqueueReleaseGLObjects(cqCommandQueue, 1, &vbo_cl, 0,0,0); shrCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup); clFinish(cqCommandQueue); //GLenum err = glGetError(); // //glBindBufferARB(GL_ARRAY_BUFFER, vbo); //float* ptr = (float*)glMapBufferARB(GL_ARRAY_BUFFER, GL_WRITE_ONLY_ARB); //glUnmapBuffer(GL_ARRAY_BUFFER); //err = glGetError(); #else // Explicit Copy // map the PBO to copy data from the CL buffer via host glBindBufferARB(GL_ARRAY_BUFFER, vbo); // map the buffer object into client's memory void* ptr = glMapBufferARB(GL_ARRAY_BUFFER, GL_WRITE_ONLY_ARB); ciErrNum = clEnqueueReadBuffer(cqCommandQueue, vbo_cl, CL_TRUE, 0, sizeof(float) * 4 * mesh_height * mesh_width, ptr, 0, NULL, NULL); shrCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup); glUnmapBufferARB(GL_ARRAY_BUFFER); #endif } // Create VBO //***************************************************************************** void createVBO(GLuint* vbo) { // create VBO unsigned int size = mesh_width * mesh_height * 4 * sizeof(float); if(!bQATest) { // create buffer object glGenBuffers(1, vbo); glBindBuffer(GL_ARRAY_BUFFER, *vbo); // create inizial data //float data[mesh_height][mesh_width][4]; float *data = new float[mesh_width * mesh_height * 4]; for(int y=0; y< mesh_height; ++y) { for (int x=0; x < mesh_width; ++x) { float u = x / (float) mesh_width; float v = y / (float) mesh_height; u = u*2.0f - 1.0f; v = v*2.0f - 1.0f; float freq = 4.0f; data[y*mesh_width*4+ x*4 + 0] = u; //u data[y*mesh_width*4+ x*4 + 1] = sin(u*freq ) * cos(v*freq ) * 0.5f; //w data[y*mesh_width*4+ x*4 + 2] = v; //v data[y*mesh_width*4+ x*4 + 3] = 1.f; //alpha ? } } // initialize buffer object glBufferData(GL_ARRAY_BUFFER, size, data, GL_DYNAMIC_DRAW); #ifdef GL_INTEROP // create OpenCL buffer from GL VBO vbo_cl = clCreateFromGLBuffer(cxGPUContext, CL_MEM_WRITE_ONLY, *vbo, &ciErrNum); #else // create standard OpenCL mem buffer vbo_cl = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY, size, NULL, &ciErrNum); #endif shrCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup); } else { // create standard OpenCL mem buffer vbo_cl = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY, size, NULL, &ciErrNum); shrCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup); } } // Display callback //***************************************************************************** void DisplayGL() { // increment the geometry computation parameter (or set to reference for Q/A check) if (iFrameCount < iFrameTrigger) { anim += 0.01f; } // start timer 0 if it's update time double dProcessingTime = 0.0; if (iFrameCount >= iFrameTrigger) { shrDeltaT(0); } // run OpenCL kernel to generate vertex positions runKernel(); // get processing time from timer 0, if it's update time if (iFrameCount >= iFrameTrigger) { dProcessingTime = shrDeltaT(0); } // clear graphics then render from the vbo glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); glColor3f(0.0,1.0, 0.0); glutWireTeapot(0.5); glBindBuffer(GL_ARRAY_BUFFER, vbo); glVertexPointer(4, GL_FLOAT, 0, 0); glEnableClientState(GL_VERTEX_ARRAY); glColor3f(1.0, 0.0, 0.0); glDrawArrays(GL_POINTS, 0, mesh_width * mesh_height); glDisableClientState(GL_VERTEX_ARRAY); // flip backbuffer to screen glutSwapBuffers(); glutPostRedisplay(); // Increment the frame counter, and do fps if it's time if (iFrameCount++ > iFrameTrigger) { // set GLUT Window Title char cTitle[256]; iFramesPerSec = (int)((double)iFrameCount/shrDeltaT(1)); #ifdef GPU_PROFILING #ifdef _WIN32 sprintf_s(cTitle, 256, "OpenCL Simple GL (VBO) | %u x %u | %i fps | Proc. t = %.5f s", mesh_width, mesh_height, iFramesPerSec, dProcessingTime); #else sprintf(cTitle, "OpenCL Simple GL (VBO) | %u x %u | %i fps | Proc. t = %.5f s", mesh_width, mesh_height, iFramesPerSec, dProcessingTime); #endif #else #ifdef _WIN32 sprintf_s(cTitle, 256, "OpenCL Simple GL (VBO) | W: %u H: %u", mesh_width, mesh_height ); #else sprintf(cTitle, "OpenCL Simple GL (VBO) | W: %u H: %u", mesh_width, mesh_height); #endif #endif glutSetWindowTitle(cTitle); // Log fps and processing info to console and file shrLog(" %s\n", cTitle); // Cleanup up and quit if requested and counter is up iTestSets--; if (bNoPrompt && (!iTestSets)) { Cleanup(EXIT_SUCCESS); } // reset framecount, trigger and timer iFrameCount = 0; iFrameTrigger = (iFramesPerSec > 1) ? iFramesPerSec * 2 : 1; } } // Keyboard events handler //***************************************************************************** void KeyboardGL(unsigned char key, int x, int y) { switch(key) { case '\033': // escape quits case '\015': // Enter quits case 'Q': // Q quits case 'q': // q (or escape) quits // Cleanup up and quit Cleanup(EXIT_SUCCESS); break; } } // Mouse event handlers //***************************************************************************** void mouse(int button, int state, int x, int y) { if (state == GLUT_DOWN) { mouse_buttons |= 1<<button; } else if (state == GLUT_UP) { mouse_buttons = 0; } mouse_old_x = x; mouse_old_y = y; glutPostRedisplay(); } void motion(int x, int y) { float dx, dy; dx = x - mouse_old_x; dy = y - mouse_old_y; if (mouse_buttons & 1) { rotate_x += dy * 0.2; rotate_y += dx * 0.2; } else if (mouse_buttons & 4) { translate_z += dy * 0.01; } mouse_old_x = x; mouse_old_y = y; // set view matrix glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); glMatrixMode(GL_MODELVIEW); glLoadIdentity(); glTranslatef(0.0, 0.0, translate_z); glRotatef(rotate_x, 1.0, 0.0, 0.0); glRotatef(rotate_y, 0.0, 1.0, 0.0); glutPostRedisplay(); } // If specified, write data to file for external regression testing //***************************************************************************** void saveResultOpenCL(int argc, const char** argv, const GLuint& vbo) { // map buffer object glBindBuffer(GL_ARRAY_BUFFER_ARB, vbo); float* data = (float*)glMapBuffer(GL_ARRAY_BUFFER, GL_READ_ONLY); // save data for regression testing result shrWriteFilef("./data/regression.dat", data, mesh_width * mesh_height * 3, 0.0); // unmap GL buffer object if(!glUnmapBuffer(GL_ARRAY_BUFFER)) { shrLog("Unmap buffer failed !\n"); } } // Run a test sequence without any GL //***************************************************************************** void TestNoGL() { // Set arg 3 and Warmup call to assure OpenCL driver is awake ciErrNum = clSetKernelArg(ckKernel, 3, sizeof(float), &anim); ciErrNum |= clEnqueueNDRangeKernel(cqCommandQueue, ckKernel, 2, NULL, szGlobalWorkSize, NULL, 0, 0, 0 ); shrCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup); clFinish(cqCommandQueue); // Start timer 0 and process n loops on the GPU const int iCycles = 250*1000; shrLog("Running clEnqueueNDRangeKernel for %d cycles...\n\n", iCycles); shrDeltaT(0); for (int i = 0; i < iCycles ; i++) { ciErrNum |= clEnqueueNDRangeKernel(cqCommandQueue, ckKernel, 2, NULL, szGlobalWorkSize, NULL, 0, 0, 0 ); } shrCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup); clFinish(cqCommandQueue); // Get elapsed time and throughput, then log to sample and master logs double dAvgTime = shrDeltaT(0)/(double)iCycles; shrLogEx(LOGBOTH | MASTER, 0, "oclSimpleGL, Throughput = %.4f MPixels/s, Time = %.5f s, Size = %u Pixels, NumDevsUsed = %u, Workgroup = %u\n", (1.0e-6 * mesh_width * mesh_height)/dAvgTime, dAvgTime, (mesh_width * mesh_height), 1, 0); // Cleanup and exit Cleanup(EXIT_SUCCESS); } // Function to clean up and exit //***************************************************************************** void Cleanup(int iExitCode) { // Cleanup allocated objects shrLog("\nStarting Cleanup...\n\n"); if(ckKernel)clReleaseKernel(ckKernel); if(cpProgram)clReleaseProgram(cpProgram); if(cqCommandQueue)clReleaseCommandQueue(cqCommandQueue); if(vbo) { glBindBuffer(1, vbo); glDeleteBuffers(1, &vbo); vbo = 0; } if(vbo_cl)clReleaseMemObject(vbo_cl); if(cxGPUContext)clReleaseContext(cxGPUContext); if(iGLUTWindowHandle)glutDestroyWindow(iGLUTWindowHandle); if(cPathAndName)free(cPathAndName); if(cSourceCL)free(cSourceCL); if(cdDevices)delete(cdDevices); // finalize logs and leave shrLog("%s\n\n", iExitCode == 0 ? "PASSED" : "FAILED"); if (bQATest || bNoPrompt) { shrLogEx(LOGBOTH | CLOSELOG, 0, "oclSimpleGL.exe Exiting...\n"); } else { shrLogEx(LOGBOTH | CLOSELOG, 0, "oclSimpleGL.exe Exiting...\nPress <Enter> to Quit\n"); #ifdef WIN32 getchar(); #endif } exit (iExitCode); }

Kieren · ‎03-10-2011

Pleas can someone, take the time und test this GL_INTEROP option with 2 ADM-GPUs.

Thank you very much. Woud help a lot.

himanshu_gautam · ‎03-13-2011

Hi kieren,

I will try to check it once such config is available to me.

I hope you are trying your code with crossfirex disabled. One reason i can guess for slow performance is transfer of buffers between GPUs.

You also mentioned that the buffers don't seem to be transferrred between GPUs. I hope you are handling the events properly?

Thanks for sharing your experience.

Kieren · ‎03-15-2011

Hi

so again what the problem is.

I try to use GL_INTEROP means i create OpenCL context and OpenCL buffers froum the exsisting OpenGL things.

CONTEXT:

cl_context_properties props[] =
{

CL_GL_CONTEXT_KHR, (cl_context_properties)wglGetCurrentContext(),
CL_WGL_HDC_KHR, (cl_context_properties)wglGetCurrentDC(),
CL_CONTEXT_PLATFORM, (cl_context_properties)cpPlatform,
0
};
cxGPUContext = clCreateContext(props, 1, &cdDevices[uiDeviceUsed], NULL, NULL, &ciErrNum);

BUFFER:

vbo_cl = clCreateFromGLBuffer(cxGPUContext, CL_MEM_WRITE_ONLY, *vbo, &ciErrNum);

As long as the device i use vor OpenCL is the GPU OpenGL is running on, or the CPU.

Even on a 2 Nvidia GPU system it does work no metter on which of the both cards my OpenCL code is running

BUT on my 2* Mobility Radeon HD 5870 System, if I use the seconed GPU the one OpenGL is not running on the Buffer object is not moidfied any more.

I do not have the handle the buffer transferr by my self as long as I'm using GL_INTEROP. It works for CPU und NVIDA-GPUs without aditional code.

BUT not with the AMD-GPUs.

Sadly I do not have the possibility to check the code on two desktop AMD-GPUs...

CrossFireX:

The CCC seams to have a bug with the checkbox to enable/disable X-Fire. So i use the config where OpenCL finds 2 GPU. The funny thing ist the CCC shows this config as X-Fire aktive. But no matter whoat seeing both GPUs seams to be the right config

Kieren · ‎03-29-2011

Is there no one out there having 2 or morge GPU-System to check the GL_INTEROP issue?

Woud be very very helpful.

ttvdrad · ‎09-06-2011

Hi, did you ever figure out this issue? I am pretty much stuck with the same problem.

Kieren · ‎09-06-2011

I have still proplems with this.

The projekt I'm working at was rearaned in this way that I don't need to use dual GPU any more.

I realised that it works some how better if you realy do the stuff very very accurate.

Check if you put glFinish() in your code.

Acquire an release the GL Objketcs.

Do not forget to que clFinish() after releasing.

I tryed it recently because I've to write the paper about the projekt right now and some howe this time the screen stayed not black, but it wasn't the right result either only somhow something related to what I expected.

Maybee you'll be able to make it work. If so, please give a feed back here

Regrads,

Kieren

The attached code shows the important parts I hope...

cl::Image2DGL m_specTexOCL; cl::Image2DGL m_posTexOCL; buffs.push_back(*hybrid_specTexOCL); buffs.push_back(*hybrid_posTexOCL); glFinish(); oclQueue.enqueueAcquireGLObjects(&buffs); oclQueue.enqueueNDRangeKernel(*Kernel, cl::NullRange, cl::NDRange(1234), cl::NDRange(initWorkGroupSize)); oclQueue.enqueueReleaseGLObjects(&buffs); oclQueue.finish();

Archives Discussions

OpenGL OpenCL interop on Multy GPUs