Archives Discussions

Raistmer · ‎10-07-2010

Looks like some bug for HD4xxx GPUs

When I gradually increased buffer size to increase execution domain size I encountered some limit for HD4870 GPU, that lower than max possible 128MB of single block GPU memory.

With that size many of program kernels hugely increase its execution time.
While some of them started to show bigger Fetch unit stalled values under profiler (that is, run time inrease could be explained, perhaps, by some memory conflicts) there is at least one kernel, that lowers its Fetch unit stalled % but still hugely increases execution time.

Could someone explain such behavior? And what is "proper" way to check if memory bank conflict exists or not for HD4870 GPU ? Looks like it lack of many useful counters that are present in Evergreen GPUs.

And another problem with such buffer size - application not only runs too long, it starts to produce invalid results.

And, finally, why I feel like it's some kind of bug: There is no such drop in speed for HD5870 GPU. It uses that buffer size just well, app execution time just in line with other closer sizes.
And it produces valid results (!) (so I think program logic is out of suspiction).

Now profiler data:
smaller buffer:
Method ExecutionOrder GlobalWorkSize GroupWorkSize Time LDSSize DataTransferSize GPRs ScratchRegs FCStacks Wavefronts ALUInsts FetchInsts WriteInsts ALUBusy ALUFetchRatio ALUPacking FetchSize CacheHit FetchUnitBusy FetchUnitStalled WriteUnitStalled
half_temp_range1_kernel_05AA3B60 13 { 4096 12 1} NULL 0,97183 0 10 0 0 768,00 174,00 64,00 32,00 7,53 2,72 88,05 49152,00 0,00 71,53 25,35 0,00

Larger bufer (problematic one):
Method ExecutionOrder GlobalWorkSize GroupWorkSize Time LDSSize DataTransferSize GPRs ScratchRegs FCStacks Wavefronts ALUInsts FetchInsts WriteInsts ALUBusy ALUFetchRatio ALUPacking FetchSize CacheHit FetchUnitBusy FetchUnitStalled WriteUnitStalled
half_temp_range1_kernel_05A53B60 13 { 4096 13 1} NULL 30,45218 0 10 0 0 832,00 174,00 64,00 32,00 0,25 2,72 88,05 53248,00 0,00 4,46 2,96 0,00

And kernel itself (very simple one):

__kernel void half_temp_range1_kernel(__global float4* gpu_dechirped, __global float4* gpu_power,const float ravg_pow){ uint tid = get_global_id(0); uint dchunk=get_global_id(1); float4 temp; float4 power;//R: processing 4 elements per workitem for(uint i=0;i<2*16;i++){ temp=gpu_dechirped[32*(32768/2)*dchunk+i*(32768/2)+2*tid]; power.x = (temp.x*temp.x + temp.y*temp.y)*ravg_pow; power.y = (temp.z*temp.z + temp.w*temp.w)*ravg_pow; temp=gpu_dechirped[32*(32768/2)*dchunk+i*(32768/2)+2*tid+1]; power.z = (temp.x*temp.x + temp.y*temp.y)*ravg_pow; power.w = (temp.z*temp.z + temp.w*temp.w)*ravg_pow; gpu_power[tid+i*(32768/4/2)+32*dchunk*(32768/4/2)]=power; } }

himanshu_gautam · ‎10-10-2010

Hi raistmer,

Are you saying that kernel with smaller buffer takes 0.97183ms and kernel with larger buffer takes 30.45218ms.Can you provide a simple test case? You can also send the test case to streamdevelpor@amd.com in case your code is proprietary.

Raistmer · ‎10-10-2010

yes, smaller takes <1ms, larger - >30ms.
Take into account that buffer increase only by 1/12 of its initial size.
Size of buffer you can derive from posted profiler data - each element fetched only once so size of buffer == amount of fetched memory.
I will try to reproduce this situation in sample. Actually, one need to just add host code that allocates corresponding buffers and run posted kernel (don't know if it will reproduce original problem though, will try).

Raistmer · ‎10-10-2010

btw, I have some additional question about posted kernel.
To facilitate float4 operations on fetches and stores I fetch 2 float4 elements and store one float4 element.
But how will this influence on possible fetch bank conflicts? Would it be better to fetch one float4 element per workitem and write not float4 but float2 element (this will increase x2 execution domain of cource, but will change memory access pattern). What memory access pattern would be preferable in this case ?

himanshu_gautam · ‎10-10-2010

Hi raistmer,

I tried to reproduce your problem on my juniper system but did not find the values reported by you.I will try to reproduce on 7xx cards.

I would be really kind if you can provide provide a test case.

As far as your Global writes access pattern problem:

Global reads\writes can either be coelesced(all workitems write on consecutive memory elements) or one by one.If coelesced write is not possible try writing the maximum number of elements per workitem.

Also try to use the same memory channel for all workitems in a workgroup(one by one write).This might seem to be serializing when coelescing not used,but it is a good access pattern as many workgroups can write concurrently.

I hope it helps.

Raistmer · ‎10-11-2010

Hi.
Yes, most probably the issue is for HD4xxx GPUs. As I wrote before there is no such speed degradation and invalid results when app runnin gon HD5870 for example.
And also, using HD4870 I'm not able to know if listed kernel will use coalesced writes/reads (or reads can't be coalesced for ATI GPUs at all?). The manual contains some info only for 5xxx GPUs. Where could I find info similar to chanper 4 and addendum D from manual for HD4xxx GPUs ?

I used Cat 10.7 for testing, now I upgraded to 10.9, will report if problem still exists and if yes, will try to provide small test case.

himanshu_gautam · ‎10-12-2010

Raistmer,

The counters provided in profiler are hardware dependent.So we cannot have all the counters that 5xx series has.

Regarding the manual i think all AMD GPUs have similar architectures and so the same principles apply.Although you can expect 5xx GPUs performing much better as the implementation must have improved.

I recommend you to register for the webinar series AMD is organizing.I think most of the optimization techniques and hardware properties will be explained by AMD professionals in a good way.

Raistmer · ‎10-12-2010

Thanks for suggestion. I'll try. Due to big time difference it's hard to attend live webcasts sometimes. I hope after live broadcast they will be provided for downloading and learning later as recorded events ?

About this thread issue: now I can confirm that I see same problem under 10.9 Cat driver too.
So I will try to create test case.

himanshu_gautam · ‎10-13-2010

you can send the test case at streamdeveloper@amd.com if the code is propriatry.

Raistmer · ‎11-02-2010

Can't reproduce this performance drop on test case with only one kernel included.
I could send whole app (it's GPL AFAIK, so sources can be shared).

With my latest experiment with Win7 (huge performance drop there too, even with buffer size good for Vista x86) I starting to think that it's from memory shortage. Looks like from some point runtime starts to recreate buffers on GPU for each kernel call. And this should involve host to device memory transfer hence performance drop.
So I have new question: does such explanation possible? And will such memory transfers be invisible for profiler ?

himanshu_gautam · ‎12-05-2010

raistmer,

Did you try to produce a simple test case. It would be really difficult to go through a large code. But you can send the bigger version too if it cant be avoided.

You can try the SDK 2.3 which is expected to be released in a few weeks. It might improve the performance of your code.

Raistmer · ‎01-15-2011

Well, now I can confirm that problem still exists with SDK 2.3 and Cat 10.12 installed.
Small buffer increase over same (as with prev SDK) specific value will cause 10 fold (!!!!) performance drop.
Moreover, buffer size thershold is different for HD4870 and HD4550 for example. But behavior is very same. After threshold crossing app performance drops hugely.
I can send compiled binary with instructions how to run it under profiler. I can't reproduce this on simple kernel in separate test case.

Raistmer · ‎01-15-2011

I sent E-mail to streamdeveloper.
In case someone interesting in this issue here is the link to binaries: http://files.mail.ru/5G5WAN

How to run:
just unpack whole archive in separate directory (or in directory with sprofile.exe) and run executable with command line parameter -unroll N, where N is integer number.
On my HD4870 it runs reasonably fast for N up to 12 (2 is minimal allowed value). Theoretically max allowed N to stay inside 128MB boundary for single buffer is 16.
But with N=13 app performance decreases in 10 times on HD4870.
Big performance decrease shown also on HD4550, but there N=10 is good and N=12 is bad already (my HD4870 tolerates N=12 well).

Looks like HD5xxx GPUs have no such problems so use HD4xxx ones.

P.S. One can confirm correct parameter setting in stderr.txt file, it will contain string: DATA_CHUNK_UNROLL setted to:N (where N is number you set in command line).

Raistmer · ‎01-15-2011

"Thank you for contacting AMD. Beginning on December 1, 2010, all ATI Stream-related support requests will be handled through our online helpdesk. After that time, AMD will no longer respond to email requests."
Well, will try to use new ways...

MicahVillmow · ‎01-17-2011

Raistmer,
Are you changing the kernel at all between runs? If you are changing the kernel, most likely you are either spilling registers to scratch or you are causing really bad access memory patterns.

Raistmer · ‎01-18-2011

Originally posted by: MicahVillmow

Raistmer,

Are you changing the kernel at all between runs? If you are changing the kernel, most likely you are either spilling registers to scratch or you are causing really bad access memory patterns.

1) No, I don't change kernel, only few (even not all) input params are changed from run to run.
2) If I would get bad performance "everywhere" I would think it's something wrong with app, but take into account these facts:

a) it works fine with smaller buffers on all hardware
b) it works fine on large buffers on HD5xxx hardware where it works very poor on HD4xxx hardware.
c) it works poor on very low HD4xxx hardware with less large buffers than in case of HD4xxx high-end hardware (like HD4870).
d) it work just OK on all possible buffer sizes on NV hardware (they have so better memory access ?

)

So, could you point out some memory-access patterns that could manifest such behavior between different generations of ATi (I know, NV just better it seems

) cards?
Or it's just "simple" another bug in OpenCL implementation for poor old HD4xxx family? (I say it again, there is NO SO BIG SLOWDOWN for HD5xxx family, what is so different in memory access for HD4xxx? ) ...

P.S. did you try binary I sent under profiler? Is your answer based on what you (or any tech support person) had seen or just "generic guesses" ?

nou · ‎01-18-2011

AFAIK 4xxx support only one UAV buffer. and 5xxx support multiple UAV buffers. accesible from one kernel.

Raistmer · ‎01-18-2011

Originally posted by: nou

AFAIK 4xxx support only one UAV buffer. and 5xxx support multiple UAV buffers. accesible from one kernel.

That would explane huge performance drop when I used 2 large buffers in single kernel call (there was such slowdown indeed).

But is single buffer not enough for buffer sizes <128M ?

And why threshold buffer size is different for HD4870 and HD4550 ?

MicahVillmow · ‎01-18-2011

nou,
4XXX might only support a single UAV buffer, but all of the pointers are virtualized in this UAV buffer. The same thing occurs on the 5XXX/6XXX chips.

Raistmer,
The reason the threshold buffer size is different is because the memory layout and sizes are different between chips of the same generation. I haven't ran your code as I don't have any binary and I'm not a tech support guy, but an engineer. But this bit of address calculation points to issues with write/read access pattern((32768/4/2) & (32768/2)). Our hardware has memory channels that are set every N bits depending on the hardware, it looks like with this calculation you are having everything go to the same memory channel. Please see section 4.4.2 on how this can affect your program.

Archives Discussions

Huge drop in performance on large buffers