Archives Discussions

Raistmer · ‎11-06-2010

I trying to profile same app with same workload on different GPUs.
And recived strange results.
HD5870:

Method ExecutionOrder GlobalWorkSize GroupWorkSize Time GPRs FCStacks Wavefronts ALUInsts FetchInsts WriteInsts ALUBusy ALUFetchRatio ALUPacking FetchSize FetchUnitBusy FetchUnitStalled WriteUnitStalled FastPath CompletePath PathUtilization ALUStalledByLDS LDSBankConflict
PC_single_pulse_kernel_FFA_update_reduce0_05A76B60 15 { 32 12 256} { 1 1 256} 0,39392 34 2 1536,00 100,34 16,00 1,06 8,71 6,27 86,37 12288,00 92,62 69,47 0,00 1631,00 0,00 100,00 0,00 0,00

HD4870:

Method ExecutionOrder GlobalWorkSize GroupWorkSize Time GPRs FCStacks Wavefronts ALUInsts FetchInsts WriteInsts ALUBusy ALUFetchRatio ALUPacking FetchSize FetchUnitBusy FetchUnitStalled WriteUnitStalled
PC_single_pulse_kernel_FFA_update_reduce0_05C6DFA8 15 { 32 12 256} { 1 1 256} 1,12628 40 2 1536,00 126,34 17,00 1,06 9,37 7,43 79,05 98400,00 77,49 57,53 0,00

While 4870 has bigger (17vs16 on 5870) number of fetch instructions (why, btw?) it fetched smaller memory size (98400 vs 12288).

Kernel (in OpenCL source) just the same.

Please, explain such differencies. fewer fetch instructions for HD5870 while bigger fetched area.

Kernel:

__kernel void PC_single_pulse_kernel_FFA_update_reduce0(__global float4* gpu_power, __constant float4* thresh, __global uint* results, const int num_dchunks, const int need_small_FFA_data,//const int need_large_FFA_data, __global float4* small_neg,__global float4* small_pos//, //__global float4* large_neg,__global float4* large_pos ) { uint tid=get_global_id(0); uint dchunk=get_global_id(1); int mchunk=get_global_id(2); uint next_begin,next_offset; float4 t=thresh[0];//thresholds for first 4 coadds float4 tt0=(float4)(t.x);//R:will keep current threshold for vector operations float4 tt1=(float4)(t.y);//R:will keep current threshold for vector operations float4 tt2=(float4)(t.z);//R:will keep current threshold for vector operations float4 tt3=(float4)(t.w);//R:will keep current threshold for vector operations int4 was_pulse=(int4)(0);////R: simple yes/no flag: was single pulse found somewhere or not; //vector for vectorized ops float4 d0,d1,d2,d3,d4,d5,d6,d7,d8,d9,d10,d11,d12,d13,d14,d15; next_begin=num_dchunks*32*(32768>>3);//R: area just after initial power array next_offset=next_begin+(dchunk*32+tid)*(32768>>7);//R: don't account for mchunk offset {//R:loading data piece of 16*4 samples int m1=(dchunk*32+tid)*(32768>>3)+(mchunk<<4);//R:reading corresponding mchunk (16*4 samples here) from power array - coadd level 0 d0=gpu_power[m1];d1=gpu_power[m1+1];d2=gpu_power[m1+2];d3=gpu_power[m1+3]; d4=gpu_power[m1+4];d5=gpu_power[m1+5];d6=gpu_power[m1+6];d7=gpu_power[m1+7]; d8=gpu_power[m1+8];d9=gpu_power[m1+9];d10=gpu_power[m1+10];d11=gpu_power[m1+11]; d12=gpu_power[m1+12];d13=gpu_power[m1+13];d14=gpu_power[m1+14];d15=gpu_power[m1+15]; was_pulse|=(d0>tt0)|(d1>tt0)|(d2>tt0)|(d3>tt0)|(d4>tt0)|(d5>tt0)|(d6>tt0)|(d7>tt0)| (d8>tt0)|(d9>tt0)|(d10>tt0)|(d11>tt0)|(d12>tt0)|(d13>tt0)|(d14>tt0)|(d15>tt0); //R: now docoadd and save coadded piece in registers d0.xy=d0.xz+d0.yw;d0.zw=d1.xz+d1.yw;d1.xy=d2.xz+d2.yw;d1.zw=d3.xz+d3.yw; d2.xy=d4.xz+d4.yw;d2.zw=d5.xz+d5.yw;d3.xy=d6.xz+d6.yw;d3.zw=d7.xz+d7.yw; d4.xy=d8.xz+d8.yw;d4.zw=d9.xz+d9.yw;d5.xy=d10.xz+d10.yw;d5.zw=d11.xz+d11.yw; d6.xy=d12.xz+d12.yw;d6.zw=d13.xz+d13.yw;d7.xy=d14.xz+d14.yw;d7.zw=d15.xz+d15.yw; //R: next coadd level ready to check for pulses was_pulse|=(d0>tt1)|(d1>tt1)|(d2>tt1)|(d3>tt1)|(d4>tt1)|(d5>tt1)|(d6>tt1)|(d7>tt1); //R: coadd level 2 starts d0.xy=d0.xz+d0.yw;d0.zw=d1.xz+d1.yw;d1.xy=d2.xz+d2.yw;d1.zw=d3.xz+d3.yw; d2.xy=d4.xz+d4.yw;d2.zw=d5.xz+d5.yw;d3.xy=d6.xz+d6.yw;d3.zw=d7.xz+d7.yw; was_pulse|=(d0>tt2)|(d1>tt2)|(d2>tt2)|(d3>tt2); //R: coadd level 3 starts d0.xy=d0.xz+d0.yw;d0.zw=d1.xz+d1.yw;d1.xy=d2.xz+d2.yw;d1.zw=d3.xz+d3.yw; was_pulse|=(d0>tt3)|(d1>tt3); //R: now only coadd and store, need to refresh array in registers so memory read required d0.xy=d0.xz+d0.yw;d0.zw=d1.xz+d1.yw; m1=mchunk+next_offset; gpu_power[m1]=d0; //R: level 4 coadd saves data for small FFA sometimes if(need_small_FFA_data!=-1){//R: after some data chunk we will not store data for small FFA if(tid==0){ //R: Store negative DM sign power data for small FFA small_neg[need_small_FFA_data+mchunk+dchunk*(32768>>7)]=d0; }else if(tid==1){//R: positive DM sign for small FFA small_pos[need_small_FFA_data+mchunk+dchunk*(32768>>7)]=d0; } } } was_pulse.x=was_pulse.x||was_pulse.y||was_pulse.z||was_pulse.w;//result reduction if(was_pulse.x) results[0]=1; }

himanshu_gautam · ‎12-05-2010

I hope you are using latest Catalyst drivers and SDK. Do you see similar behaviour with other samples also. Can you send a simple test case at streamdeveloper@amd.com.

Archives Discussions

Different fetch amount on 5870 and 4870 GPUs - why?