Raistmer

Huge drop in performance on large buffers

Discussion created by Raistmer on Oct 7, 2010
Latest reply on Jan 18, 2011 by MicahVillmow
Looks like some bug for HD4xxx GPUs

When I gradually increased buffer size to increase execution domain size I encountered some limit for HD4870 GPU, that lower than max possible 128MB of single block GPU memory.

With that size many of program kernels hugely increase its execution time.
While some of them started to show bigger Fetch unit stalled values under profiler (that is, run time inrease could be explained, perhaps, by some memory conflicts) there is at least one kernel, that lowers its Fetch unit stalled % but still hugely increases execution time.

Could someone explain such behavior? And what is "proper" way to check if memory bank conflict exists or not for HD4870 GPU ? Looks like it lack of many useful counters that are present in Evergreen GPUs.

And another problem with such buffer size - application not only runs too long, it starts to produce invalid results.

And, finally, why I feel like it's some kind of bug: There is no such drop in speed for HD5870 GPU. It uses that buffer size just well, app execution time just in line with other closer sizes.
And it produces valid results (!) (so I think program logic is out of suspiction).

Now profiler data:
smaller buffer:
Method ExecutionOrder GlobalWorkSize GroupWorkSize Time LDSSize DataTransferSize GPRs ScratchRegs FCStacks Wavefronts ALUInsts FetchInsts WriteInsts ALUBusy ALUFetchRatio ALUPacking FetchSize CacheHit FetchUnitBusy FetchUnitStalled WriteUnitStalled
half_temp_range1_kernel_05AA3B60 13 { 4096 12 1} NULL 0,97183 0 10 0 0 768,00 174,00 64,00 32,00 7,53 2,72 88,05 49152,00 0,00 71,53 25,35 0,00

Larger bufer (problematic one):
Method ExecutionOrder GlobalWorkSize GroupWorkSize Time LDSSize DataTransferSize GPRs ScratchRegs FCStacks Wavefronts ALUInsts FetchInsts WriteInsts ALUBusy ALUFetchRatio ALUPacking FetchSize CacheHit FetchUnitBusy FetchUnitStalled WriteUnitStalled
half_temp_range1_kernel_05A53B60 13 { 4096 13 1} NULL 30,45218 0 10 0 0 832,00 174,00 64,00 32,00 0,25 2,72 88,05 53248,00 0,00 4,46 2,96 0,00

And kernel itself (very simple one):

__kernel void half_temp_range1_kernel(__global float4* gpu_dechirped, __global float4* gpu_power,const float ravg_pow){ uint tid = get_global_id(0); uint dchunk=get_global_id(1); float4 temp; float4 power;//R: processing 4 elements per workitem for(uint i=0;i<2*16;i++){ temp=gpu_dechirped[32*(32768/2)*dchunk+i*(32768/2)+2*tid]; power.x = (temp.x*temp.x + temp.y*temp.y)*ravg_pow; power.y = (temp.z*temp.z + temp.w*temp.w)*ravg_pow; temp=gpu_dechirped[32*(32768/2)*dchunk+i*(32768/2)+2*tid+1]; power.z = (temp.x*temp.x + temp.y*temp.y)*ravg_pow; power.w = (temp.z*temp.z + temp.w*temp.w)*ravg_pow; gpu_power[tid+i*(32768/4/2)+32*dchunk*(32768/4/2)]=power; } }

Outcomes