AnsweredAssumed Answered

CodeXL GPU Performance Counter Profiling Crashes

Question asked by helloworld922 on Apr 7, 2013
Latest reply on Apr 9, 2013 by helloworld922

Hi,

 

I have a program I'm trying to profile with CodeXL but for some reason the GPU Performance Counter profiler is crashing my program. I have verified that my program runs correctly by itself and with the other CodeXL profilers (CPU sampling, GPU Application Trace).

 

System configuration:

 

Windows 7 x64

Visual Studio 2012 (built targeting x64 release)

Catalyst 13.1 driver

CodeXL v. 1.1.2885.0

AMD APP SDK v.2.8 for Window x64

ATI HD5650 mobile video card

 

These are the kernels I'm trying to profile:

 

kernel void VectorAdd01(global float* vecA, global float* vecB, global float* results)
{
          size_t gid = get_global_id(0);
          results[gid] = vecA[gid] + vecB[gid];
}


kernel void VectorAdd02(global float2* vecA, global float2* vecB, global float2* results)
{
          size_t gid = get_global_id(0);
          results[gid] = vecA[gid] + vecB[gid];
}


kernel void VectorAdd04(global float4* vecA, global float4* vecB, global float4* results)
{
          size_t gid = get_global_id(0);
          results[gid] = vecA[gid] + vecB[gid];
}


kernel void VectorAdd08(global float8* vecA, global float8* vecB, global float8* results)
{
          size_t gid = get_global_id(0);
          results[gid] = vecA[gid] + vecB[gid];
}


kernel void VectorAdd16(global float16* vecA, global float16* vecB, global float16* results)
{
          size_t gid = get_global_id(0);
          results[gid] = vecA[gid] + vecB[gid];
}

 

They're basically the same VectorAdd kernel for different vector float sizes.

 

This is the main program code:

 

#include "ocl_utils.hpp"
#include "clProgs.hpp"
#include <CL/cl.hpp>
#include <iostream>


#define SCALE 16


int main(void)
{
          std::vector<cl::Platform> platforms;
          cl_int err;
          err = cl::Platform::get(&platforms);
          if(!cl_err_code(err))
          {
                    std::cout << "found " << platforms.size() << " platform(s)" << std::endl;
                    cl::Platform platform(platforms[0]);
                    std::vector<cl::Device> devices;
                    err = platform.getDevices(CL_DEVICE_TYPE_GPU, &devices);
                    if(!cl_err_code(err))
                    {
                              std::cout << "found " << devices.size() << " GPU device(s)" << std::endl;
                              unsigned int compute_units = devices[0].getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>(&err);
                              std::cout << "devices[0] compute units: " << compute_units << std::endl;
                              cl::Context gpuContext(devices, nullptr, pfn_notify, nullptr, &err);
                              unsigned long long max_work_group = devices[0].getInfo<CL_DEVICE_MAX_WORK_GROUP_SIZE>(&err);
                              std::cout << "devices[0] max work group size: " << max_work_group << std::endl;
                              auto max_work_item_size = devices[0].getInfo<CL_DEVICE_MAX_WORK_ITEM_SIZES>(&err);
                              std::cout << "devices[0] max work item size: " << max_work_item_size[0] << ", " << max_work_item_size[1] << ", " << max_work_item_size[2] << std::endl;
                              std::cout << "devices[0] max memory allocation: " << devices[0].getInfo<CL_DEVICE_MAX_MEM_ALLOC_SIZE>(&err) << std::endl;


                              // create work queue
                              cl::CommandQueue queue(gpuContext, devices[0], 0, &err);
                              if(!cl_err_code(err))
                              {
                                        cl::Program program(gpuContext, test1_src, false, &err);
                                        if(!cl_err_code(err))
                                        {
                                                  // try to build the program
                                                  err = program.build(nullptr, nullptr, nullptr);
                                                  if(!cl_err_code(err))
                                                  {
                                                            std::vector<cl::Kernel> kernels;
                                                            kernels.push_back(cl::Kernel(program, "VectorAdd01", &err));
                                                            kernels.push_back(cl::Kernel(program, "VectorAdd02", &err));
                                                            kernels.push_back(cl::Kernel(program, "VectorAdd04", &err));
                                                            kernels.push_back(cl::Kernel(program, "VectorAdd08", &err));
                                                            kernels.push_back(cl::Kernel(program, "VectorAdd16", &err));
                                                            //err = program.createKernels(&kernels);
                                                            for(auto i = kernels.begin(); i != kernels.end(); ++i)
                                                            {
                                                                      std::cout << (*i).getInfo<CL_KERNEL_FUNCTION_NAME>() << std::endl;
                                                            }
                                                            if(!cl_err_code(err))
                                                            {
                                                                      unsigned long long pref_work_size = kernels[0].getWorkGroupInfo<CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE>(devices[0], &err);


                                                                      std::cout << "devices[0] preferred workgroup size multiple: " << pref_work_size << std::endl;
                                                                      // create the dataset
                                                                      size_t length = compute_units * 16 * max_work_item_size[0] * SCALE;
                                                                      float *vecA = new float[length];
                                                                      float *vecB = new float[length];
                                                                      float *results = new float[length];
                                                                      for(size_t i = 0; i < length; ++i)
                                                                      {
                                                                                vecA[i] = i;
                                                                                vecB[i] = i;
                                                                      }
                                                                      std::cout << "allocation vecA" << std::endl;
                                                                      cl::Buffer vecABuffer(gpuContext, CL_MEM_COPY_HOST_PTR | CL_MEM_READ_ONLY, length * sizeof(float), vecA, &err);
                                                                      if(!cl_err_code(err))
                                                                      {
                                                                                std::cout << "allocation vecB" << std::endl;
                                                                                cl::Buffer vecBBuffer(gpuContext, CL_MEM_COPY_HOST_PTR | CL_MEM_READ_ONLY, length * sizeof(float), vecB, &err);
                                                                                if(!cl_err_code(err))
                                                                                {
                                                                                          std::cout << "allocation result buffer" << std::endl;
                                                                                          cl::Buffer resultBuffer(gpuContext, CL_MEM_ALLOC_HOST_PTR | CL_MEM_WRITE_ONLY, length * sizeof(float), nullptr, &err);
                                                                                          if(!cl_err_code(err))
                                                                                          {
                                                                                                    for(size_t i = 0; i < 5; ++i)
                                                                                                    {
                                                                                                              err = kernels[i].setArg(0, vecABuffer);
                                                                                                              if(!cl_err_code(err))
                                                                                                              {
                                                                                                                        err = kernels[i].setArg(1, vecBBuffer);
                                                                                                                        if(!cl_err_code(err))
                                                                                                                        {
                                                                                                                                  err = kernels[i].setArg(2, resultBuffer);
                                                                                                                                  if(!cl_err_code(err))
                                                                                                                                  {
                                                                                                                                            for(size_t j = 1; j <= max_work_item_size[0]; j <<= 1)
                                                                                                                                            {
                                                                                                                                                      std::cout << "running " << kernels[i].getInfo<CL_KERNEL_FUNCTION_NAME>() << " with " << j << " work item size" << std::endl;
                                                                                                                                                      err = queue.enqueueNDRangeKernel(kernels[i], cl::NDRange(0), cl::NDRange(length / (1 << i)), cl::NDRange(j), nullptr, nullptr);
                                                                                                                                                      if(!cl_err_code(err))
                                                                                                                                                      {
                                                                                                                                                                err = queue.enqueueReadBuffer(resultBuffer, true, 0, length * sizeof(float), results, nullptr, nullptr);
                                                                                                                                                                cl_err_code(err);
                                                                                                                                                      }
                                                                                                                                            }
                                                                                                                                  }
                                                                                                                        }
                                                                                                              }
                                                                                                    }
                                                                                          }
                                                                                }
                                                                      }
                                                                      delete[] vecA;
                                                                      delete[] vecB;
                                                                      delete[] results;
                                                            }
                                                  }
                                                  else
                                                  {
                                                            // get build error
                                                            std::cout << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0], &err).c_str() << std::endl;
                                                  }
                                        }
                              }
                    }
          }
          std::cout << "done" << std::endl;
          return 0;
}

 

clProgs has the string for the CL Program source, and ocl_utils contains a function for displaying error messages (quite lengthy so I didn't want to post it).

 

I'm not entirely sure what is causing the issue, but changing the problem SCALE from above 4 will cause the profiler to crash. Above 4 the profiler will crash pretty much 100% of the time, at 4 the profiler crashes ~90% of the time, and below that I haven't gotten the profiler to crash.

 

I suspect that there might be some sort of memory issue, but the largest buffer I have is 1.25MB, well below the 512MB CL_DEVICE_MAX_MEM_ALLOC_SIZE limit. Even the net allocation for all three buffers I'm using is 3.75MB, still well below the limit.

 

Any idea what's going on here? I can try posting a zip of my project if needed.

 

edit:

 

I tried using the OpenCL built-in profiling features and that runs without any issues as well.

Outcomes