6 Replies Latest reply on Apr 19, 2010 8:41 PM by MicahVillmow

    Is the Stream KernelAnalyzer up to date?

    redditisgreat

      I have run some of my OpenCL kernel code through the Stream KernelAnalyzer (newest available version) but the code generated for Juniper/Cypress is a mess.

      Compared with RV790 code the compiler generates 20% more ALU 200% more Controlflow and 300% more fetch and write operations to global memory for RV8XX GPUs.

      The problem seems to be with loop unrolling of small constant sized for loop.

      Does that reflect the newest compilers shipped in the ICD?

      What can we do about it?

      Here the relevant code:

       

      #define ADIM 3 typedef struct _qem { float4 C[3]; float4 b1_c; float4 B[ADIM]; } QEM; //############################################################################################ inline void qemScaledAdd( QEM* dst, QEM const* src, float const s ) { dst->C[0] += s*src->C[0]; dst->C[1] += s*src->C[1]; dst->C[2] += s*src->C[2]; //dst->C[3] += s*src->C[3]; dst->b1_c += s*src->b1_c; for(size_t i=0; i<ADIM; ++i) dst->B[i] += s*src->B[i]; } //############################################################################################ #define ATTR_GLOBAL_SCALE .80f __kernel void initWedgeQEM( __global QEM* tqem , __global uint* arrays , __global uint* qindex , __global QEM* resultq ) { unsigned int const gid = get_global_id(0); size_t const firsti = arrays[ 2*gid ]; size_t const numi = arrays[(2*gid)+1 ]; QEM tmpq; tmpq.C[0] = tmpq.C[1] = tmpq.C[2] = tmpq.b1_c = (float4)(0.f); for(size_t i=0; i<ADIM; ++i) tmpq.B[i] = tmpq.b1_c; for( size_t qi=0; qi<numi; ++qi ){ size_t const tqi = qindex[firsti+qi]>>2; size_t const corneri = qindex[firsti+qi]&0x03; QEM tq = tqem[tqi]; float const scale = (corneri<3)? tq.C[corneri].w: tq.C[2].w * 400.f; tq.C[0].w = (corneri<3)? 1.f : 0.f ; qemScaledAdd( &tmpq, &tq, scale ); } resultq[gid] = tmpq; }