6 Replies Latest reply on Jun 7, 2013 2:13 AM by ash

    Code XL : How to debug Kernel on CPU?

    ash

      Hi everybody,

      I've just started to learn OpenCL and wanted to use CodeXL to debug more easily.

      My system :

      - NVIDIA GPU GeForce GTX 650

      - Intel(R) Xeon(R) CPU E5430 2.66Hz

      - Linux RedHat 6

      When I start CodeXL it requires an AMD GPU. I'd like to know why I can't debug the kernel even if it's launched on CPU?

      Is there anyway with my setup to debug with CodeXL .

       

      Otherwise I tried with GDB and I can enter the kernel but I have to pass -O0 in addition to -g to the program's compiler options.

      And it's not really easy to debug like that since I don't know how to isolate some thread and watch values.

       

      I hope you'll be able to help.

       

      Best regards,

      ash

        • Re: Code XL : How to debug Kernel on CPU?
          dorono

          Hi ash,

          CodeXL does not support kernel debugging on the CPU. Kernel debugging is only available for kernels running on AMD GPU devices.

          CodeXL does provide API level debugging on device brands other than AMD. You can set API breakpoints and observe debug information of the code running on the host.

            • Re: Code XL : How to debug Kernel on CPU?
              ash

              Hi,

              Thanks for your answer. Then, do you know which API I could use to debug the kernel on a NVIDIA GPU device? I've been searching for some time now but I can't find an answer. Because I'd like to see the data passed to the kernel and how it's computed. I tried to use GDB but I can't see all work-items and don't know how to isolate one in particular to see what's going on.

               

              Best regards,

              ash

              • Re: Code XL : How to debug Kernel on CPU?
                ash

                Hi,

                 

                I got an AMD GPU HD 6450 for testing CodeXL. For my program I had some weird results so I tried with a sample, the Reduction sample actually and I have some weird result:

                the input buffer has the same value everywhere : 11508480 and it doesn't change even by going step by step. I have the same thing when I watch output it's 11508224 everywhere.

                What's wrong? Do I need to do something in particular to debug with CodeXL?

                By default it's running on the GPU so the kernel debugging should be ok. And the sample is working just fine. Please give me hand.

                 

                Best regards,

                ash

                  • Re: Code XL : How to debug Kernel on CPU?
                    himanshu.gautam

                    Asked around for some help... So, soon somebody will be helping you out here.

                    Thanks for your patience.

                    • Re: Code XL : How to debug Kernel on CPU?
                      urishomroni

                      Hi ash,

                       

                      You might have noticed that the values you are seeing are 0x00AF9B00 and 0x00AF9A00 - which are fairly round numbers in hexadecimal. This is because they are pointers (namely, pointers to the GPU's global address space).
                      Here's what's going on:

                       

                      1. Let's say your kernel has a parameter or variable __global int4* input.

                      1a. The value you are looking at in the Locals or Multi-Watch view is the value of "input", which is the pointer variable.

                      1b. Since it is a global pointer, its value is constant across all work items - it stays as chosen by the call to clSetKernelArg.

                      1c. As a side note, it's still a pointer to the GPU's global memory space, and is not a pointer in the host application's memory space.

                      1d. If it were a __local pointer it would be constant across all the work group, and most likely across all the global work as well, since WG 1's __local 0x00001000 is separate from WG 2's __local 0x00001000 (the local memory space is separate for each work group), and the compiler is likely to use it to make its work simpler.

                       

                      2. Since it is a pointer / array, access to the values stored is done via an index - "input[0]", "*input" (same value as "input[0]"), "input[1000]", or "input[ind]" (where ind is an index variable).

                      2a. This is not currently supported by the Multi-Watch view, since it currently only supports simple, "flat" expressions.

                      2b. The first three items can be seen and watched in CodeXL's Watch view. Just type in the variable name, and the constant numerical index in square brackets with no spaces, and you'll get the dereferenced value.

                      2c. CodeXL does not yet full parsing abilities in the Watch view, such as supporting dereferencing via a variable or expression value.

                      2d. Both 2a and 2c are in our feature want-list, and will hopefully be implemented in a CodeXL version in the near future.

                       

                      3. As a workaround, you can use a __private (unqualified) variable, which can be watched and Multi-Watched freely.

                      3a. For example:

                      __kernel void myKern(__global int4* input, __global int4* output)
                      {
                           int x = get_global_id(0);
                           int y = get_global_id(1);
                           int ind = x + (y * get_global_size(0) );

                           int4 inputVal = input[ind];
                           int4 outputVal = (int4)0;

                           // Do some stuff here to calculate outputVal from inputVal
                           // ...
                           // ...

                           output[ind] = outputVal;
                      }

                      3b. As a side note, the above example is also better practice, since global memory access is

                      * More "expensive" (i.e. slower, consumes more power) than local memory access, which is in turn more "expensive" than private memory.

                      * Easily parallellable (two work items accessing different indices / addresses) and not easily serializable (the same work item accessing the same index / address over and over again.

                      so moving to this model will not only make it easier to debug (by watching inputVal and outputVal), but also faster.

                       

                      Hope this helps,