5 Replies Latest reply on Mar 19, 2011 5:50 AM by himanshu.gautam

    CLC compilation times growing (exponentially?)

    broxvall
      Problem with OpenCL 1.1 compiler requiring too large (> 30 sec) time to compile small test example

      Hi,

      I've encountered a strange behaviour of the OpenCL compiler for OpenCL 1.1 (64bit Linux). The initial symptoms was that during the kernel compilation of a program that uses the C++ bindings to OpenCL , my program just hangs during the call to the "build(...)" function - but this is dependent on the actual code that is compiled by CL.

      I've stripped down the code to the most extreme case I could find and got the weird situation that depending on if two variables where initialized with a constant or with a trigonometric expression i got compilation times ranging from 3.9 seconds (both are constant), 5.4 seconds (one is constant) to 25 seconds (both are trigonometric).

      I've marked the places as A,B respectively C,D to show where the problem seems to occur.

      For the larger program, the compilation doesn't termnate within one hour. (I can attach this too if you ask me, but I think this first case already shows the problem better). Also, the problem only occurs when compiling for the GPU (not for the CPU).  I belive that the same jump in compilation time occurs for a few more "tricky" floating point operations that (eg. the ones that require the T-processing element) and that this creates eg. a scheduling problem that grows with exponential time (?)

      Can anyone comment on this? Give any hints for work-arounds?

      / Mathias

      -------------

      I've done a simple " while [ 1 ]; do ps aux | grep 'clc' | grep -v grep ; done" which gave me:

      mbl      13803  0.0  0.0   4148   576 pts/1    S+   19:03   0:00 sh -c /media/disk/opencl/ati-stream-sdk-v2.2-lnx64/bin/x86_64/clc --emit=llvmbc -D__IMAGE_SUPPORT__=1 -D__Juniper__=1 -D__GPU__=1 -Dcl_khr_global_int32_base_atomics=1 -Dcl_khr_global_int32_extended_atomics=1 -Dcl_khr_local_int32_base_atomics=1 -Dcl_khr_local_int32_extended_atomics=1 -Dcl_khr_3d_image_writes=1 -Dcl_khr_byte_addressable_store=1 -Dcl_khr_gl_sharing=1 -Dcl_amd_device_attribute_query=1 -Dcl_amd_printf=1 -Dcl_amd_media_ops=1 -I./ -o "/tmp/OCLBH0vhZ1.bc" "/tmp/OCLBH0vhZ.cl"  2> "/tmp/OCLBH0vhZ.log"

       

       

       

       

       

       

      loat4 mycubicSolver(float a,float b,float c,float d) { float4 res; float2 u,v; u = (float2)(0.0); v = (float2)(0.0); float Phi1=atan2(u.s1,u.s0)/3.f; u.s0 = 0.0f; // A //u.s0 = cos(Phi1); // B u.s1 = sin(Phi1); float Phi2=atan2(v.s1,v.s0)/3.f; //v.s0 = 0.0f; // C v.s0 = cos(Phi2); // D v.s1 = sin(Phi2); float2 x1, x2; x1.s0 = u.s0 + v.s0; x1.s1 = 0.f; float2 uv; uv.s0 = 0.f; uv.s1 = u.s1 - v.s1; x2.s0 = -x1.s0/2 - sqrt(3.f)/2.f * uv.s1; x2.s1 = + sqrt(3.f)/2.f * uv.s0; res.s2 = x2.s0; res.s3 = x2.s1; return res; } __kernel void filtering(float4 offset,__global unsigned char *rawIn,__global float *tensorsIn,__global float *filteredOut) { float4 res = mycubicSolver(-1, 1., 1, 1); }

        • CLC compilation times growing (exponentially?)
          himanshu.gautam

          There has been many optimizatios in the internal compiler and i do not see any such issue with your kernel. It compiles well with SKA. I also tried a similar kernel and it runs fine.

          Any ways you can send in your kernel so that we can verify the issue does not exist anymore.

            • CLC compilation times growing (exponentially?)
              broxvall

              Hi,

              I cannot add the whole program since it realies on 100 MB large data files (with slightly confidential patient data). However, I append a kernel + sourcecode that I just verified to give a bad behaviour when running on a 5870 Mobility Radeon card under Ubuntu 10.10 64bit version (laptop is Asus G73Jh).

              I append a kernel as well as the C++ code snipped used to load and compile it. Running with release 2.1 the compilation succeeds after ~1 seconds and the program stops since it needs some more kernels not included here (as it should).

              Using release 2.2 the call to Program:build() never terminates, and the whole process starts consuming roughly 1MB/second of CPU RAM memory. After leaving it for a while (> 10 mins) the computer becomes practically crashed due to intense swapping.

              If this kernel works with your setup i would suspect that it is either a platform related problem (I've also tried it under Ubuntu 10.04 64bit with radeon 5870 card (not mobility)) - or a problem with how the call to the compiler is made (ie. the C++ bindings?).

              Anyway, for now I've just fallen back to using the old 2.1 release and all of the works fine with that one.


              / Mathias

               

              float8 cubicSolver(float a,float b,float c,float d) { float8 res; res.s67 = (float2)(0.f); if(b == 0.f && c == 0.f && d == 0.f) // Special case I: z^3 = 0 res = (float8)(0.f); else if(b == 0.f && c == 0.f) { // Special case II: z^3 = -d/a res.s0 = pow(-d/a, 1.f/3.f); res.s1 = 0.f; res.s2 = res.s0 * -.5f; res.s3 = + res.s0 * sqrt(3.f)/2.f; res.s4 = res.s0 * -.5f; res.s5 = - res.s0 * sqrt(3.f)/2.f; } else if(b == 0 && d == 0) { // Special case III: z(z^2 + c/a)) = 0 res.s0 = 0.f; res.s1 = 0.f; res.s2 = sqrt(-c/a); res.s3 = 0.f; res.s4 = -sqrt(-c/a); res.s5 = 0.f; } /* TODO: should we not have a case when A == 0 ?? */ else { // Generic case float p = c/a - b*b/(3*a*a); float q = d/a - (b*c)/(3*a*a) + b*b*b/(9*a*a*a) - b*b*b/(27*a*a*a*a); float D = pow((p/3),3.f) + pow((q/2),2.f);; float2 u,v; if(D >= 0.f) { // u,v are real numbers. // Calculate a real root u.s0 = pow(-q/2 + sqrt(D), 1/3.f); u.s1 = 0.f; v.s0 = pow(-q/2 - sqrt(D), 1/3.f); v.s1 = 0.f; } else { float R,Phi; // Need to compute a complex root // U* = (-q/2 + sqrt(D))^(1/3) // ---> // Ur = -q/2 // Ui = sqrt(-D) // R = |<Ur,Ui>| // Phi = atan2(Ui,Ur) // R* = R^(1/3.) // Phi* = 1/3. * Phi // U* = R* (cos(Phi*) + i sin(Phi*)) u.s0=-q/2; u.s1=sqrt(-D); R=pow(length(u),1/3.f); Phi=atan2(u.s1,u.s0)*1/3.f; u.s0 = R*cos(Phi); u.s1 = R*sin(Phi); // Same for: V* = (-q/2 - sqrt(D))^(1/3); v.s0=-q/2; v.s1=-sqrt(-D); R=pow(length(v),1/3.f); Phi=atan2(v.s1,v.s0)*1/3.f; v.s0 = R*cos(Phi); v.s1 = R*sin(Phi); } // u+v should never be complex // u-v is complex only if D>=0 // If D<0: u and v are complex and conjugates => u+v is real // To ensure a least one non-imaginary part we use the function real, // otherwise a truncation part may give an imaginary part float2 x1,x2,x3; x1.s0 = u.s0 + v.s0; x1.s1 = 0.f; float2 uv; uv.s0 = u.s0 - v.s0; uv.s1 = u.s1 - v.s1; // x2 = -(u.r+v.r)/2 + i*sqrt(3)*(u-v)/2 // x3 = -(u.r+v.r)/2 i*sqrt(3)*(u-v)/2 x2.s0 = -x1.s0/2 - sqrt(3.f)/2.f * uv.s1; x2.s1 = + sqrt(3.f)/2.f * uv.s0; x3.s0 = -x1.s0/2 + sqrt(3.f)/2.f * uv.s1; x3.s1 = - sqrt(3.f)/2.f * uv.s0; res.s0 = x1.s0; res.s1 = x1.s1; res.s2 = x2.s0; res.s3 = x2.s1; res.s4 = x3.s0; res.s5 = x3.s1; } return res; } __kernel void filtering(float4 offset,__global unsigned char *rawIn,__global float *tensorsIn,__global float *filteredOut) { int4 ts2pos = (int4)(get_global_id(0), get_global_id(1), get_global_id(2), 0); int4 id = ts2pos + (int4)(offset); if(id.s0 > DS0 || id.s1 > DS1 || id.s2 > DS2) return; /* Index in output array */ int rawIndex=id.s0+id.s1*DS0+id.s2*DS0*DS1+KERN_RADIUS*DS0*DS1*DS2; int outIndex=id.s0+id.s1*DS0+id.s2*DS0*DS1; const int framesize=DS0*DS1*DS2; int tensorIndex = (ts2pos.s0 + ts2pos.s1*SUBSIZE + ts2pos.s2*SUBSIZE*SUBSIZE)*1; /* Get the raw reading here for visualization, and possibly to quit early */ int raw = rawIn[rawIndex]; if(raw < 2) { filteredOut[outIndex] = 0.f; return; } float4 tmp2; float16 tensor; tensor.s0123 = vload4(tensorIndex*4+0, tensorsIn); tensor.s4567 = vload4(tensorIndex*4+1, tensorsIn); tensor.s89AB = vload4(tensorIndex*4+2, tensorsIn); tensor.sCDEF = vload4(tensorIndex*4+3, tensorsIn); /* Compute the eigen values for this tensor. */ /* Variables for the interesting parts of the tensor */ float t11 = tensor.s0, t12 = tensor.s1, t13 = tensor.s2; float t22 = tensor.s5, t23 = tensor.s6; float t33 = tensor.sA; float8 res = cubicSolver(-1, t11 + t22 + t33, t23*t23 + t13*t13 + t12*t12 - t11*t22 - t11*t33 - t22*t33, t11*t22*t33 + 2*t12*t23*t13 - t11*t23*t23 - t22*t13*t13 - t33*t12*t12); /* NOTE that the results of the cubicSolver above isn't actually used here! Still the compilation crashes... */ filteredOut[outIndex] = raw/256.f; } ------------------ #include "general.h" #include "shaders.h" #include "vector.h" #include "loadDicom.h" #include "volumetric.h" #include "visualizer.h" #include "opencl.h" /* Stuff for OpenCL's C++ bindings */ #include <utility> #include <cstdio> #include <cstdlib> #include <fstream> #include <iostream> #include <string> #include <iterator> #include <sys/time.h> double gettimef() { struct timeval tv; gettimeofday(&tv,NULL); return tv.tv_sec + 1e-6*tv.tv_usec; } /** Subsize is size of the tensor cubes generated at each pass over the volume. Ie. a larger size of data may be read from the raw data, and generated in the intermediate steps. */ const int OpenCL::subsize = 120; //96 /** Size of the low-pass gaussian filters applied to the output of the quadrature filters */ const int OpenCL::lpradius = 4; /** Kernradius is the size of the quadrature filters used for generating the tensors. */ const int OpenCL::kernradius = 3; /** Size of the tensor window generated in the first step of the passes over the volume. To handle the borders during the lowpass filtering of the tensors this need be larger than the final end result */ const int OpenCL::ts1size = subsize+2*lpradius; /* Simple error checking function for all OpenCL error management. Taken C++ bindings source/examples. */ inline void checkErr(cl_int err, const char * name) { if (err != CL_SUCCESS) { std::cerr << "ERROR: " << name << " (" << err << ")" << std::endl; exit(EXIT_FAILURE); } } /* Singleton instance used for all computations */ OpenCL *OpenCL::openCL; OpenCL::OpenCL(int dims, int raw, int ko, int cpu, int gpu) :filterDimensions(dims), useRawSignal(raw), convolution_unroll(ko), useCPU(cpu), useGPU(gpu) { cl_int err; int dev; Visualizer *visualizer = Visualizer::visualizer; filterDimensions=dims; useRawSignal=raw; cl::vector< cl::Platform > platformList; cl::Platform::get(&platformList); checkErr(platformList.size()!=0 ? CL_SUCCESS : -1, "cl::Platform::get"); // std::string platformVendor; cl::string platformVendor; platformList[0].getInfo(CL_PLATFORM_VENDOR, &platformVendor); std::cerr << "Platform is by: " << platformVendor.c_str() << "\n"; // Iterate over the requested devices (CPU/GPU) for(dev=0;dev<2;dev++) { if(dev == CPU && !useCPU) continue; if(dev == GPU && !useGPU) continue; printf("Preparing context for %s\n",dev==0?"CPU":"GPU"); cl_context_properties cprops[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platformList[0])(), 0}; context[dev] = new cl::Context(dev==CPU ? CL_DEVICE_TYPE_CPU : CL_DEVICE_TYPE_GPU, cprops, NULL, NULL, &err); checkErr(err, "Context::Context()"); cl::vector<cl::Device> devices; devices = context[dev]->getInfo<CL_CONTEXT_DEVICES>(); checkErr(devices.size() > 0 ? CL_SUCCESS : -1, "devices.size() > 0"); std::ifstream file("volumetric.cl"); checkErr(file.is_open() ? CL_SUCCESS:-1, "opening volumetric.cl"); std::string prog(std::istreambuf_iterator<char>(file),(std::istreambuf_iterator<char>())); //cl::Program::Sources source(1,std::make_pair(prog.c_str(), prog.length()+1)); cl::Program::Sources source(1); source[0] = std::make_pair(prog.c_str(), prog.length()+1); //source[1] = std::make_pair(prog2.c_str(), prog2.length()+1); cl::Program program(*context[dev], source); char defines[1024]; sprintf(defines,"-I./ -DDIMENSIONS=%d -DDS0=%d -DDS1=%d -DDS2=%d -DSUBSIZE=%d -DSUBSIZE2=%d -DLP_RADIUS=%d -DKERN_RADIUS=%d -DJUST_RAW=%d -DCONV_UNROLL=%d", filterDimensions, visualizer->dataSize[0],visualizer->dataSize[1],visualizer->dataSize[2], subsize, subsize+lpradius*2, lpradius, kernradius,useRawSignal,convolution_unroll); /* */ /* This is where the build() is called */ /* */ printf("DEFINES: %s\n",defines); fflush(stdout); double t0 = gettimef(); err = program.build(devices,defines); printf("BUILD DONE! (%.3fs)\n",gettimef()-t0); fflush(stdout); printf("ERR = %d\n",err); if(err != CL_SUCCESS) { printf("Building failed:\n"); cl::string buildLog((char*)""); cl::string buildOptions((char*)""); err = program.getBuildInfo<cl::string>(devices[0],CL_PROGRAM_BUILD_LOG,&buildLog); err = program.getBuildInfo<cl::string>(devices[0],CL_PROGRAM_BUILD_OPTIONS,&buildOptions); printf("Build Options: %s\n",buildOptions.c_str()); printf("Build Log:\n%s\n",buildLog.c_str()); //exit(0); } checkErr(file.is_open() ? CL_SUCCESS : -1, "Program::build()"); queue[dev] = new cl::CommandQueue(*context[dev], devices[0], CL_QUEUE_PROFILING_ENABLE | 0, &err); checkErr(err, "CommandQueue::CommandQueue()"); //queue[dev]->setProperty(CL_QUEUE_PROFILING_ENABLE,true,NULL); kern_genTensor[dev] = new cl::Kernel(program,"generateTensor",&err); checkErr(err, "Kernel::Kernel() - generateTensor"); kern_lpTensor[dev] = new cl::Kernel(program,"lpTensor",&err); checkErr(err, "Kernel::Kernel() - lpTensor"); kern_filtering[dev] = new cl::Kernel(program,"filtering",&err); checkErr(err, "Kernel::Kernel() - filtering"); } //printf("All contexts created\n"); fflush(stdout); }