Archives Discussions

broxvall · ‎11-22-2010

Problem with OpenCL 1.1 compiler requiring too large (> 30 sec) time to compile small test example

Hi,

I've encountered a strange behaviour of the OpenCL compiler for OpenCL 1.1 (64bit Linux). The initial symptoms was that during the kernel compilation of a program that uses the C++ bindings to OpenCL , my program just hangs during the call to the "build(...)" function - but this is dependent on the actual code that is compiled by CL.

I've stripped down the code to the most extreme case I could find and got the weird situation that depending on if two variables where initialized with a constant or with a trigonometric expression i got compilation times ranging from 3.9 seconds (both are constant), 5.4 seconds (one is constant) to 25 seconds (both are trigonometric).

I've marked the places as A,B respectively C,D to show where the problem seems to occur.

For the larger program, the compilation doesn't termnate within one hour. (I can attach this too if you ask me, but I think this first case already shows the problem better). Also, the problem only occurs when compiling for the GPU (not for the CPU). I belive that the same jump in compilation time occurs for a few more "tricky" floating point operations that (eg. the ones that require the T-processing element) and that this creates eg. a scheduling problem that grows with exponential time (?)

Can anyone comment on this? Give any hints for work-arounds?

/ Mathias

-------------

I've done a simple " while [ 1 ]; do ps aux | grep 'clc' | grep -v grep ; done" which gave me:

mbl 13803 0.0 0.0 4148 576 pts/1 S+ 19:03 0:00 sh -c /media/disk/opencl/ati-stream-sdk-v2.2-lnx64/bin/x86_64/clc --emit=llvmbc -D__IMAGE_SUPPORT__=1 -D__Juniper__=1 -D__GPU__=1 -Dcl_khr_global_int32_base_atomics=1 -Dcl_khr_global_int32_extended_atomics=1 -Dcl_khr_local_int32_base_atomics=1 -Dcl_khr_local_int32_extended_atomics=1 -Dcl_khr_3d_image_writes=1 -Dcl_khr_byte_addressable_store=1 -Dcl_khr_gl_sharing=1 -Dcl_amd_device_attribute_query=1 -Dcl_amd_printf=1 -Dcl_amd_media_ops=1 -I./ -o "/tmp/OCLBH0vhZ1.bc" "/tmp/OCLBH0vhZ.cl" 2> "/tmp/OCLBH0vhZ.log"

loat4 mycubicSolver(float a,float b,float c,float d) { float4 res; float2 u,v; u = (float2)(0.0); v = (float2)(0.0); float Phi1=atan2(u.s1,u.s0)/3.f; u.s0 = 0.0f; // A //u.s0 = cos(Phi1); // B u.s1 = sin(Phi1); float Phi2=atan2(v.s1,v.s0)/3.f; //v.s0 = 0.0f; // C v.s0 = cos(Phi2); // D v.s1 = sin(Phi2); float2 x1, x2; x1.s0 = u.s0 + v.s0; x1.s1 = 0.f; float2 uv; uv.s0 = 0.f; uv.s1 = u.s1 - v.s1; x2.s0 = -x1.s0/2 - sqrt(3.f)/2.f * uv.s1; x2.s1 = + sqrt(3.f)/2.f * uv.s0; res.s2 = x2.s0; res.s3 = x2.s1; return res; } __kernel void filtering(float4 offset,__global unsigned char *rawIn,__global float *tensorsIn,__global float *filteredOut) { float4 res = mycubicSolver(-1, 1., 1, 1); }

himanshu_gautam · ‎11-28-2010

There has been many optimizatios in the internal compiler and i do not see any such issue with your kernel. It compiles well with SKA. I also tried a similar kernel and it runs fine.

Any ways you can send in your kernel so that we can verify the issue does not exist anymore.

broxvall · ‎11-29-2010

Hi,

I cannot add the whole program since it realies on 100 MB large data files (with slightly confidential patient data). However, I append a kernel + sourcecode that I just verified to give a bad behaviour when running on a 5870 Mobility Radeon card under Ubuntu 10.10 64bit version (laptop is Asus G73Jh).

I append a kernel as well as the C++ code snipped used to load and compile it. Running with release 2.1 the compilation succeeds after ~1 seconds and the program stops since it needs some more kernels not included here (as it should).

Using release 2.2 the call to Program:build() never terminates, and the whole process starts consuming roughly 1MB/second of CPU RAM memory. After leaving it for a while (> 10 mins) the computer becomes practically crashed due to intense swapping.

If this kernel works with your setup i would suspect that it is either a platform related problem (I've also tried it under Ubuntu 10.04 64bit with radeon 5870 card (not mobility)) - or a problem with how the call to the compiler is made (ie. the C++ bindings?).

Anyway, for now I've just fallen back to using the old 2.1 release and all of the works fine with that one.

/ Mathias

float8 cubicSolver(float a,float b,float c,float d) { float8 res; res.s67 = (float2)(0.f); if(b == 0.f && c == 0.f && d == 0.f) // Special case I: z^3 = 0 res = (float8)(0.f); else if(b == 0.f && c == 0.f) { // Special case II: z^3 = -d/a res.s0 = pow(-d/a, 1.f/3.f); res.s1 = 0.f; res.s2 = res.s0 * -.5f; res.s3 = + res.s0 * sqrt(3.f)/2.f; res.s4 = res.s0 * -.5f; res.s5 = - res.s0 * sqrt(3.f)/2.f; } else if(b == 0 && d == 0) { // Special case III: z(z^2 + c/a)) = 0 res.s0 = 0.f; res.s1 = 0.f; res.s2 = sqrt(-c/a); res.s3 = 0.f; res.s4 = -sqrt(-c/a); res.s5 = 0.f; } /* TODO: should we not have a case when A == 0 ?? */ else { // Generic case float p = c/a - b*b/(3*a*a); float q = d/a - (b*c)/(3*a*a) + b*b*b/(9*a*a*a) - b*b*b/(27*a*a*a*a); float D = pow((p/3),3.f) + pow((q/2),2.f);; float2 u,v; if(D >= 0.f) { // u,v are real numbers. // Calculate a real root u.s0 = pow(-q/2 + sqrt(D), 1/3.f); u.s1 = 0.f; v.s0 = pow(-q/2 - sqrt(D), 1/3.f); v.s1 = 0.f; } else { float R,Phi; // Need to compute a complex root // U* = (-q/2 + sqrt(D))^(1/3) // ---> // Ur = -q/2 // Ui = sqrt(-D) // R = |<Ur,Ui>| // Phi = atan2(Ui,Ur) // R* = R^(1/3.) // Phi* = 1/3. * Phi // U* = R* (cos(Phi*) + i sin(Phi*)) u.s0=-q/2; u.s1=sqrt(-D); R=pow(length(u),1/3.f); Phi=atan2(u.s1,u.s0)*1/3.f; u.s0 = R*cos(Phi); u.s1 = R*sin(Phi); // Same for: V* = (-q/2 - sqrt(D))^(1/3); v.s0=-q/2; v.s1=-sqrt(-D); R=pow(length(v),1/3.f); Phi=atan2(v.s1,v.s0)*1/3.f; v.s0 = R*cos(Phi); v.s1 = R*sin(Phi); } // u+v should never be complex // u-v is complex only if D>=0 // If D<0: u and v are complex and conjugates => u+v is real // To ensure a least one non-imaginary part we use the function real, // otherwise a truncation part may give an imaginary part float2 x1,x2,x3; x1.s0 = u.s0 + v.s0; x1.s1 = 0.f; float2 uv; uv.s0 = u.s0 - v.s0; uv.s1 = u.s1 - v.s1; // x2 = -(u.r+v.r)/2 + i*sqrt(3)*(u-v)/2 // x3 = -(u.r+v.r)/2 i*sqrt(3)*(u-v)/2 x2.s0 = -x1.s0/2 - sqrt(3.f)/2.f * uv.s1; x2.s1 = + sqrt(3.f)/2.f * uv.s0; x3.s0 = -x1.s0/2 + sqrt(3.f)/2.f * uv.s1; x3.s1 = - sqrt(3.f)/2.f * uv.s0; res.s0 = x1.s0; res.s1 = x1.s1; res.s2 = x2.s0; res.s3 = x2.s1; res.s4 = x3.s0; res.s5 = x3.s1; } return res; } __kernel void filtering(float4 offset,__global unsigned char *rawIn,__global float *tensorsIn,__global float *filteredOut) { int4 ts2pos = (int4)(get_global_id(0), get_global_id(1), get_global_id(2), 0); int4 id = ts2pos + (int4)(offset); if(id.s0 > DS0 || id.s1 > DS1 || id.s2 > DS2) return; /* Index in output array */ int rawIndex=id.s0+id.s1*DS0+id.s2*DS0*DS1+KERN_RADIUS*DS0*DS1*DS2; int outIndex=id.s0+id.s1*DS0+id.s2*DS0*DS1; const int framesize=DS0*DS1*DS2; int tensorIndex = (ts2pos.s0 + ts2pos.s1*SUBSIZE + ts2pos.s2*SUBSIZE*SUBSIZE)*1; /* Get the raw reading here for visualization, and possibly to quit early */ int raw = rawIn[rawIndex]; if(raw < 2) { filteredOut[outIndex] = 0.f; return; } float4 tmp2; float16 tensor; tensor.s0123 = vload4(tensorIndex*4+0, tensorsIn); tensor.s4567 = vload4(tensorIndex*4+1, tensorsIn); tensor.s89AB = vload4(tensorIndex*4+2, tensorsIn); tensor.sCDEF = vload4(tensorIndex*4+3, tensorsIn); /* Compute the eigen values for this tensor. */ /* Variables for the interesting parts of the tensor */ float t11 = tensor.s0, t12 = tensor.s1, t13 = tensor.s2; float t22 = tensor.s5, t23 = tensor.s6; float t33 = tensor.sA; float8 res = cubicSolver(-1, t11 + t22 + t33, t23*t23 + t13*t13 + t12*t12 - t11*t22 - t11*t33 - t22*t33, t11*t22*t33 + 2*t12*t23*t13 - t11*t23*t23 - t22*t13*t13 - t33*t12*t12); /* NOTE that the results of the cubicSolver above isn't actually used here! Still the compilation crashes... */ filteredOut[outIndex] = raw/256.f; } ------------------ #include "general.h" #include "shaders.h" #include "vector.h" #include "loadDicom.h" #include "volumetric.h" #include "visualizer.h" #include "opencl.h" /* Stuff for OpenCL's C++ bindings */ #include <utility> #include <cstdio> #include <cstdlib> #include <fstream> #include <iostream> #include <string> #include <iterator> #include <sys/time.h> double gettimef() { struct timeval tv; gettimeofday(&tv,NULL); return tv.tv_sec + 1e-6*tv.tv_usec; } /** Subsize is size of the tensor cubes generated at each pass over the volume. Ie. a larger size of data may be read from the raw data, and generated in the intermediate steps. */ const int OpenCL::subsize = 120; //96 /** Size of the low-pass gaussian filters applied to the output of the quadrature filters */ const int OpenCL::lpradius = 4; /** Kernradius is the size of the quadrature filters used for generating the tensors. */ const int OpenCL::kernradius = 3; /** Size of the tensor window generated in the first step of the passes over the volume. To handle the borders during the lowpass filtering of the tensors this need be larger than the final end result */ const int OpenCL::ts1size = subsize+2*lpradius; /* Simple error checking function for all OpenCL error management. Taken C++ bindings source/examples. */ inline void checkErr(cl_int err, const char * name) { if (err != CL_SUCCESS) { std::cerr << "ERROR: " << name << " (" << err << ")" << std::endl; exit(EXIT_FAILURE); } } /* Singleton instance used for all computations */ OpenCL *OpenCL::openCL; OpenCL::OpenCL(int dims, int raw, int ko, int cpu, int gpu) :filterDimensions(dims), useRawSignal(raw), convolution_unroll(ko), useCPU(cpu), useGPU(gpu) { cl_int err; int dev; Visualizer *visualizer = Visualizer::visualizer; filterDimensions=dims; useRawSignal=raw; cl::vector< cl::Platform > platformList; cl::Platform::get(&platformList); checkErr(platformList.size()!=0 ? CL_SUCCESS : -1, "cl::Platform::get"); // std::string platformVendor; cl::string platformVendor; platformList[0].getInfo(CL_PLATFORM_VENDOR, &platformVendor); std::cerr << "Platform is by: " << platformVendor.c_str() << "\n"; // Iterate over the requested devices (CPU/GPU) for(dev=0;dev<2;dev++) { if(dev == CPU && !useCPU) continue; if(dev == GPU && !useGPU) continue; printf("Preparing context for %s\n",dev==0?"CPU":"GPU"); cl_context_properties cprops[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platformList[0])(), 0}; context[dev] = new cl::Context(dev==CPU ? CL_DEVICE_TYPE_CPU : CL_DEVICE_TYPE_GPU, cprops, NULL, NULL, &err); checkErr(err, "Context::Context()"); cl::vector<cl::Device> devices; devices = context[dev]->getInfo<CL_CONTEXT_DEVICES>(); checkErr(devices.size() > 0 ? CL_SUCCESS : -1, "devices.size() > 0"); std::ifstream file("volumetric.cl"); checkErr(file.is_open() ? CL_SUCCESS:-1, "opening volumetric.cl"); std::string prog(std::istreambuf_iterator<char>(file),(std::istreambuf_iterator<char>())); //cl::Program::Sources source(1,std::make_pair(prog.c_str(), prog.length()+1)); cl::Program::Sources source(1); source[0] = std::make_pair(prog.c_str(), prog.length()+1); //source[1] = std::make_pair(prog2.c_str(), prog2.length()+1); cl::Program program(*context[dev], source); char defines[1024]; sprintf(defines,"-I./ -DDIMENSIONS=%d -DDS0=%d -DDS1=%d -DDS2=%d -DSUBSIZE=%d -DSUBSIZE2=%d -DLP_RADIUS=%d -DKERN_RADIUS=%d -DJUST_RAW=%d -DCONV_UNROLL=%d", filterDimensions, visualizer->dataSize[0],visualizer->dataSize[1],visualizer->dataSize[2], subsize, subsize+lpradius*2, lpradius, kernradius,useRawSignal,convolution_unroll); /* */ /* This is where the build() is called */ /* */ printf("DEFINES: %s\n",defines); fflush(stdout); double t0 = gettimef(); err = program.build(devices,defines); printf("BUILD DONE! (%.3fs)\n",gettimef()-t0); fflush(stdout); printf("ERR = %d\n",err); if(err != CL_SUCCESS) { printf("Building failed:\n"); cl::string buildLog((char*)""); cl::string buildOptions((char*)""); err = program.getBuildInfo<cl::string>(devices[0],CL_PROGRAM_BUILD_LOG,&buildLog); err = program.getBuildInfo<cl::string>(devices[0],CL_PROGRAM_BUILD_OPTIONS,&buildOptions); printf("Build Options: %s\n",buildOptions.c_str()); printf("Build Log:\n%s\n",buildLog.c_str()); //exit(0); } checkErr(file.is_open() ? CL_SUCCESS : -1, "Program::build()"); queue[dev] = new cl::CommandQueue(*context[dev], devices[0], CL_QUEUE_PROFILING_ENABLE | 0, &err); checkErr(err, "CommandQueue::CommandQueue()"); //queue[dev]->setProperty(CL_QUEUE_PROFILING_ENABLE,true,NULL); kern_genTensor[dev] = new cl::Kernel(program,"generateTensor",&err); checkErr(err, "Kernel::Kernel() - generateTensor"); kern_lpTensor[dev] = new cl::Kernel(program,"lpTensor",&err); checkErr(err, "Kernel::Kernel() - lpTensor"); kern_filtering[dev] = new cl::Kernel(program,"filtering",&err); checkErr(err, "Kernel::Kernel() - filtering"); } //printf("All contexts created\n"); fflush(stdout); }

himanshu_gautam · ‎12-04-2010

broxvall,

i have not gone in the code till now.

but i see a lot of header files will it be possible to compile it without them.

inside kernel i checked it with SKA and it gives some variables as not defined.eg DS0,DS1 etc.

I suggest you to send a testcase at streamdeveloper@amd.com

broxvall · ‎03-18-2011

Hi,

sorry for not following up by creating a smaller test case. For those interested I can report that the bug disappeared with the 2.3 release (ie. only version 2.2 had the bug since 2.1 worked). Afaik there was some comment in the release notes relating to build times for 2.3 which may be why this now works.

himanshu_gautam · ‎03-19-2011

Nice to hear that your was resolved.

Archives Discussions

CLC compilation times growing (exponentially?)