cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

redditisgreat
Journeyman III

Is the Stream KernelAnalyzer up to date?

I have run some of my OpenCL kernel code through the Stream KernelAnalyzer (newest available version) but the code generated for Juniper/Cypress is a mess.

Compared with RV790 code the compiler generates 20% more ALU 200% more Controlflow and 300% more fetch and write operations to global memory for RV8XX GPUs.

The problem seems to be with loop unrolling of small constant sized for loop.

Does that reflect the newest compilers shipped in the ICD?

What can we do about it?

Here the relevant code:

 

#define ADIM 3 typedef struct _qem { float4 C[3]; float4 b1_c; float4 B[ADIM]; } QEM; //############################################################################################ inline void qemScaledAdd( QEM* dst, QEM const* src, float const s ) { dst->C[0] += s*src->C[0]; dst->C[1] += s*src->C[1]; dst->C[2] += s*src->C[2]; //dst->C[3] += s*src->C[3]; dst->b1_c += s*src->b1_c; for(size_t i=0; i<ADIM; ++i) dst->B += s*src->B; } //############################################################################################ #define ATTR_GLOBAL_SCALE .80f __kernel void initWedgeQEM( __global QEM* tqem , __global uint* arrays , __global uint* qindex , __global QEM* resultq ) { unsigned int const gid = get_global_id(0); size_t const firsti = arrays[ 2*gid ]; size_t const numi = arrays[(2*gid)+1 ]; QEM tmpq; tmpq.C[0] = tmpq.C[1] = tmpq.C[2] = tmpq.b1_c = (float4)(0.f); for(size_t i=0; i<ADIM; ++i) tmpq.B = tmpq.b1_c; for( size_t qi=0; qi<numi; ++qi ){ size_t const tqi = qindex[firsti+qi]>>2; size_t const corneri = qindex[firsti+qi]&0x03; QEM tq = tqem[tqi]; float const scale = (corneri<3)? tq.C[corneri].w: tq.C[2].w * 400.f; tq.C[0].w = (corneri<3)? 1.f : 0.f ; qemScaledAdd( &tmpq, &tq, scale ); } resultq[gid] = tmpq; }

0 Likes
6 Replies
bpurnomo
Staff

Stream KernelAnalyzer uses the latest OpenCL ICD in your system (to compile from CL source to IL) and the CAL modules (to compile from IL to ISA) included in the tool.  The latest CAL module included in the tool is from Catalyst 10.1.  We should have another update of the tool soon.

0 Likes

Can any of the developers maybe look over my code and tell me how to avoid the performance penalty on the RV8XX series of GPUs?

 

thanks.

0 Likes

Looking at the statistics generated I don't see much difference between them.

 

 

 

Name,GPR,Scratch Reg,Min,Max,Avg,ALU,Fetch,Write,Est Cycles,ALU:Fetch,BottleNeck,%s\Clock,Throughput Radeon HD 4890,33,0,7.00,1143.52,121.49,98,10,7,121.49,1.36,ALU Ops,0.13,112 M Threads\Sec Radeon HD 5870,29,0,7.00,1143.52,111.55,79,10,7,111.55,1.24,ALU Ops,0.29,244 M Threads\Sec

0 Likes

Those are my values, using CAL 9.12.

Branch coherency 90%, Average and Max loop count set to 6.

Name, Code, GPR,Scratch Reg,ALU,CF,Interp,Fetch,Write,Min,Max,Avg,Est Cycles,ALU:Fetch,BottleNeck,%s\Clock,Throughput Radeon HD 4890, RV790, 35, 0, 81, 47, 0, 14, 22, 14.00, 35.84, 20.50, 20.50, 1.55, Global Write, 0.78, 663 M Threads\Sec Radeon HD 5870, Cypress, 51, 0, 105, 128, 0, 52, 78, 21.00, 111.50, 66.25, 66.25, 1.96, Global Write, 0.24, 205 M Threads\Sec

0 Likes

I think the discrepancy between the two results (n0thing and redditisgreat) is due to: (1) the OpenCL run-time installed in the system, and/or (2) the CAL version set in the tool.

0 Likes

redditisgreat,
The main difference between the codegen is hardware differences between the 7XX and 8XX IO paths. The 8XX IO path is more flexible, but also requires more instructions to occur in software that would normally occur in hardware. Also, the flow control is caused by the ?: operator. This should improve in our upcoming release.
0 Likes