4 Replies Latest reply on Jul 11, 2016 2:37 PM by lonelygoat

    debug / release version of host code behave differently

    lonelygoat

      Hi there,

       

      I am trying to figure out why debug version and release version of my app written in c++ using opencl c++ api act differently. Basically what my app does is:

       

      1. create a few kernels

      2. create a set of buffers

      3. set kernel arguments

       

      then execute these kernels in a loop by repeatedly call clEnqueueNDRangeKernel with a single in order command queue(oddly enough, if I use a new command queue for each command, the outcome changes. commands are properly synchronized of course). There are some read/write commands spread in the queue.

       

      I understand that c++ debug / release version behave differently may happen, but it seems to be pretty simple code in c++ side. I think I may have missed something in host code...

       

      Could any one give me some hints please ? Thanks in advance.

        • Re: debug / release version of host code behave differently
          dipak

          Hi,

          From your above description, it's difficult to suggest anything without knowing what type of difference you're observing or referring. It would be great if you can share your code (host + kernel) and point out the exact problem you are observing. Please also specify your setup details (OS, GPU, driver etc.).

           

          Regards,

            • Re: debug / release version of host code behave differently
              lonelygoat

              Hi kipak,

               

              Thanks for the reply. below are some system info:

               

                  OS: Windows 10 64bit

                  GPU: amd radeon R5 230

                  sdk: amd app sdk 2.9

                  opencl 1.2

               

              there are 4 kernels, simple ones like initializing a buffer work fine. code of the 2 main kernels are pretty lengthy, which are adopted from c code. I have carefully checked them to make sure there are no pointers in struct (I will double check them again later)...

               

              I think the difference between debug / release version is about host code, which should have nothing to do with kernel code since the later is compiled by opencl runtime. therefore, I suspect there are hidden bugs in my host code cause the problem. Basically, in debug version I see approximately correct image produced (but not exactly as I expected), while I only see some dots in blank background in release version. one kernel is running in an iteration many many times depending on input (from thousands, to millions.. kernel code are too long to post here, below are host code that execute the problematic kernel:

               

              It first writes to a buffer to update its content for the kernel to process, then executes the kernel, which will update data in another buffer.

               

              err = pClObj->cmdQ.enqueueWriteBuffer(pClObj->clBuffer, CL_TRUE, 0, numOfItems * sizeof(CL_DATA_INFO), pBufferData);

              //error check and etc.

              err = pClObj->cmdQ.enqueueNDRangeKernel(Kernel, cl::NullRange, cl::NDRange(numOfItems));

              // error check and etc.

              err = pClObj->cmdQ.finish();

              // error check and etc.

               

              the above host code may be executed in a loop like thousands or millions times. each time the content of clBuffer gets updated first, then the kernel is executed. pClObj is a class that compile kernel source, create kernels and set kernel arguments and etc. numOfItems is a configurable number that determines how many data items are to be passed to gpu for processing at a time, which is set to 256.

               

              I use finish() to wait for execution completion is just for updating progress bar, removing it does't change anything. I am wondering if one command queue can be used like this ?  Put some commands in it and call finish() to wait until queued commands finish, then put similar commands in it again and wait for queued commands to finish, and repeat like thousands or millions times ? Is there a limit in the number of commands to be queued in the same command queue ?

                • Re: debug / release version of host code behave differently
                  dipak

                  Hi,

                  Put some commands in it and call finish() to wait until queued commands finish, then put similar commands in it again and wait for queued commands to finish, and repeat like thousands or millions times ?

                  Yes, you can. However, in that case, the host thread will be blocked by the clFinish() until all the previous commands completed. So, commands for next iteration can not be enqueued unless commands from previous iteration are completed

                  For example:

                  for(int i = 0; i < N; i++)
                  {
                    enqueue(q, command1);
                    ...
                    clFinish(q);
                  }
                  

                  In this case, host thread will be blocked at the end of the loop and thus, commands for the 2nd iteration can not be enqueued unless commands from the 1st loop are completed

                   

                  Whereas, you may use event objects to enqueue next series of commands without blocking the host thread. You've to describe the dependency accordingly.

                  [As you know, commands are executed in-order if the command queue is itself an in-order queue.]

                  For example:

                  for(int i = 0; i < N; i++)
                  {
                    enqueue(q, command1, event[i - 1]); // depends on event generated from previous iteration
                    ... 
                    // may use clFlush() at this point or after a certain number of commands
                  }
                  clFinish(q);
                  

                   

                  Is there a limit in the number of commands to be queued in the same command queue ?

                  Yes, a runtime limit is there. The limit depends on the runtime command queue size and type of the command (i.e. command packet size).

                   

                  Now coming to your mismatch problem. As you suspect a host-side bug, it would be good if you can narrow down the problematic region or suspected APIs. At this point, it's not clear whether the issue is in OpenCL APIs or inside your own code. Or may be any other C/C++ build issue completely unrelated to OpenCL. Unless that information is available, it's difficult to suggest/help anything. You may try different host-side compilers (say, GCC, VC++) to cross check the mismatch.

                   

                  Note: SDK 2.9 is quite old. Please use the latest SDK and driver compatible with your setup.

                   

                  Regards,

                    • Re: debug / release version of host code behave differently
                      lonelygoat

                      Hi dipak,

                       

                      Many thanks for the detailed clarifications. I am updating SDK to the latest version....

                       

                      Both of debug and release version of the Windows executable compiled with VS2010 generate a completely empty image under nv graphic card (macbook pro with GT650M). The original app was a Windows Service. Since CodeXL won't attach to a running process for debugging, the service was temporarily re-written as a normal executable so CodeXL can load it and debug its kernels. So far with no luck in using CodeXL, when trying to step into the suspected kernel, CodeXL pops up a message box saying "IL instruction uav_read_cmp_xchg is not yet supported", but stepping into other kernels is fine...

                       

                      Thanks again for your hints and I will post here when I solve it....

                       

                      Best regards,