1 Reply Latest reply on Apr 9, 2013 12:02 PM by helloworld922

    CodeXL GPU Performance Counter Profiling Crashes

    helloworld922

      Hi,

       

      I have a program I'm trying to profile with CodeXL but for some reason the GPU Performance Counter profiler is crashing my program. I have verified that my program runs correctly by itself and with the other CodeXL profilers (CPU sampling, GPU Application Trace).

       

      System configuration:

       

      Windows 7 x64

      Visual Studio 2012 (built targeting x64 release)

      Catalyst 13.1 driver

      CodeXL v. 1.1.2885.0

      AMD APP SDK v.2.8 for Window x64

      ATI HD5650 mobile video card

       

      These are the kernels I'm trying to profile:

       

      kernel void VectorAdd01(global float* vecA, global float* vecB, global float* results)
      {
                size_t gid = get_global_id(0);
                results[gid] = vecA[gid] + vecB[gid];
      }
      
      
      kernel void VectorAdd02(global float2* vecA, global float2* vecB, global float2* results)
      {
                size_t gid = get_global_id(0);
                results[gid] = vecA[gid] + vecB[gid];
      }
      
      
      kernel void VectorAdd04(global float4* vecA, global float4* vecB, global float4* results)
      {
                size_t gid = get_global_id(0);
                results[gid] = vecA[gid] + vecB[gid];
      }
      
      
      kernel void VectorAdd08(global float8* vecA, global float8* vecB, global float8* results)
      {
                size_t gid = get_global_id(0);
                results[gid] = vecA[gid] + vecB[gid];
      }
      
      
      kernel void VectorAdd16(global float16* vecA, global float16* vecB, global float16* results)
      {
                size_t gid = get_global_id(0);
                results[gid] = vecA[gid] + vecB[gid];
      }
      

       

      They're basically the same VectorAdd kernel for different vector float sizes.

       

      This is the main program code:

       

      #include "ocl_utils.hpp"
      #include "clProgs.hpp"
      #include <CL/cl.hpp>
      #include <iostream>
      
      
      #define SCALE 16
      
      
      int main(void)
      {
                std::vector<cl::Platform> platforms;
                cl_int err;
                err = cl::Platform::get(&platforms);
                if(!cl_err_code(err))
                {
                          std::cout << "found " << platforms.size() << " platform(s)" << std::endl;
                          cl::Platform platform(platforms[0]);
                          std::vector<cl::Device> devices;
                          err = platform.getDevices(CL_DEVICE_TYPE_GPU, &devices);
                          if(!cl_err_code(err))
                          {
                                    std::cout << "found " << devices.size() << " GPU device(s)" << std::endl;
                                    unsigned int compute_units = devices[0].getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>(&err);
                                    std::cout << "devices[0] compute units: " << compute_units << std::endl;
                                    cl::Context gpuContext(devices, nullptr, pfn_notify, nullptr, &err);
                                    unsigned long long max_work_group = devices[0].getInfo<CL_DEVICE_MAX_WORK_GROUP_SIZE>(&err);
                                    std::cout << "devices[0] max work group size: " << max_work_group << std::endl;
                                    auto max_work_item_size = devices[0].getInfo<CL_DEVICE_MAX_WORK_ITEM_SIZES>(&err);
                                    std::cout << "devices[0] max work item size: " << max_work_item_size[0] << ", " << max_work_item_size[1] << ", " << max_work_item_size[2] << std::endl;
                                    std::cout << "devices[0] max memory allocation: " << devices[0].getInfo<CL_DEVICE_MAX_MEM_ALLOC_SIZE>(&err) << std::endl;
      
      
                                    // create work queue
                                    cl::CommandQueue queue(gpuContext, devices[0], 0, &err);
                                    if(!cl_err_code(err))
                                    {
                                              cl::Program program(gpuContext, test1_src, false, &err);
                                              if(!cl_err_code(err))
                                              {
                                                        // try to build the program
                                                        err = program.build(nullptr, nullptr, nullptr);
                                                        if(!cl_err_code(err))
                                                        {
                                                                  std::vector<cl::Kernel> kernels;
                                                                  kernels.push_back(cl::Kernel(program, "VectorAdd01", &err));
                                                                  kernels.push_back(cl::Kernel(program, "VectorAdd02", &err));
                                                                  kernels.push_back(cl::Kernel(program, "VectorAdd04", &err));
                                                                  kernels.push_back(cl::Kernel(program, "VectorAdd08", &err));
                                                                  kernels.push_back(cl::Kernel(program, "VectorAdd16", &err));
                                                                  //err = program.createKernels(&kernels);
                                                                  for(auto i = kernels.begin(); i != kernels.end(); ++i)
                                                                  {
                                                                            std::cout << (*i).getInfo<CL_KERNEL_FUNCTION_NAME>() << std::endl;
                                                                  }
                                                                  if(!cl_err_code(err))
                                                                  {
                                                                            unsigned long long pref_work_size = kernels[0].getWorkGroupInfo<CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE>(devices[0], &err);
      
      
                                                                            std::cout << "devices[0] preferred workgroup size multiple: " << pref_work_size << std::endl;
                                                                            // create the dataset
                                                                            size_t length = compute_units * 16 * max_work_item_size[0] * SCALE;
                                                                            float *vecA = new float[length];
                                                                            float *vecB = new float[length];
                                                                            float *results = new float[length];
                                                                            for(size_t i = 0; i < length; ++i)
                                                                            {
                                                                                      vecA[i] = i;
                                                                                      vecB[i] = i;
                                                                            }
                                                                            std::cout << "allocation vecA" << std::endl;
                                                                            cl::Buffer vecABuffer(gpuContext, CL_MEM_COPY_HOST_PTR | CL_MEM_READ_ONLY, length * sizeof(float), vecA, &err);
                                                                            if(!cl_err_code(err))
                                                                            {
                                                                                      std::cout << "allocation vecB" << std::endl;
                                                                                      cl::Buffer vecBBuffer(gpuContext, CL_MEM_COPY_HOST_PTR | CL_MEM_READ_ONLY, length * sizeof(float), vecB, &err);
                                                                                      if(!cl_err_code(err))
                                                                                      {
                                                                                                std::cout << "allocation result buffer" << std::endl;
                                                                                                cl::Buffer resultBuffer(gpuContext, CL_MEM_ALLOC_HOST_PTR | CL_MEM_WRITE_ONLY, length * sizeof(float), nullptr, &err);
                                                                                                if(!cl_err_code(err))
                                                                                                {
                                                                                                          for(size_t i = 0; i < 5; ++i)
                                                                                                          {
                                                                                                                    err = kernels[i].setArg(0, vecABuffer);
                                                                                                                    if(!cl_err_code(err))
                                                                                                                    {
                                                                                                                              err = kernels[i].setArg(1, vecBBuffer);
                                                                                                                              if(!cl_err_code(err))
                                                                                                                              {
                                                                                                                                        err = kernels[i].setArg(2, resultBuffer);
                                                                                                                                        if(!cl_err_code(err))
                                                                                                                                        {
                                                                                                                                                  for(size_t j = 1; j <= max_work_item_size[0]; j <<= 1)
                                                                                                                                                  {
                                                                                                                                                            std::cout << "running " << kernels[i].getInfo<CL_KERNEL_FUNCTION_NAME>() << " with " << j << " work item size" << std::endl;
                                                                                                                                                            err = queue.enqueueNDRangeKernel(kernels[i], cl::NDRange(0), cl::NDRange(length / (1 << i)), cl::NDRange(j), nullptr, nullptr);
                                                                                                                                                            if(!cl_err_code(err))
                                                                                                                                                            {
                                                                                                                                                                      err = queue.enqueueReadBuffer(resultBuffer, true, 0, length * sizeof(float), results, nullptr, nullptr);
                                                                                                                                                                      cl_err_code(err);
                                                                                                                                                            }
                                                                                                                                                  }
                                                                                                                                        }
                                                                                                                              }
                                                                                                                    }
                                                                                                          }
                                                                                                }
                                                                                      }
                                                                            }
                                                                            delete[] vecA;
                                                                            delete[] vecB;
                                                                            delete[] results;
                                                                  }
                                                        }
                                                        else
                                                        {
                                                                  // get build error
                                                                  std::cout << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0], &err).c_str() << std::endl;
                                                        }
                                              }
                                    }
                          }
                }
                std::cout << "done" << std::endl;
                return 0;
      }
      
      

       

      clProgs has the string for the CL Program source, and ocl_utils contains a function for displaying error messages (quite lengthy so I didn't want to post it).

       

      I'm not entirely sure what is causing the issue, but changing the problem SCALE from above 4 will cause the profiler to crash. Above 4 the profiler will crash pretty much 100% of the time, at 4 the profiler crashes ~90% of the time, and below that I haven't gotten the profiler to crash.

       

      I suspect that there might be some sort of memory issue, but the largest buffer I have is 1.25MB, well below the 512MB CL_DEVICE_MAX_MEM_ALLOC_SIZE limit. Even the net allocation for all three buffers I'm using is 3.75MB, still well below the limit.

       

      Any idea what's going on here? I can try posting a zip of my project if needed.

       

      edit:

       

      I tried using the OpenCL built-in profiling features and that runs without any issues as well.