Please run "BufferBandwidth" AMD APP SDK. One of the options is to use PERSISTENT AMD flag.
That will tell you the story of your machine.
However, I dont think "Persistent AMD" flag is a part of extension... But I am not 100% sure.
PersistentAMD flag is usually used for "writing" from CPU side....and "reading/writing" from GPU side.
Is your usage in accordance with the statement above?
It may appear a very naive question but i thought its the best platform to ask. I am trying to improve the performance of my code so i thought to vary flags for buffer creation. I was of the view that CL_MEM_USE_PERSISTENT_MEM_AMD would give the least execution time but the results didn't show the same. I ran the BufferBandwidth example as pointed by you and it passed but the Gbps values were really low when i set input, output and copy flags as 6(CL_MEM_USE_PERSISTENT_MEM_AMD) as compared to default settings for all the 4 tests.
Can you explain the difference between CL_MEM_USE_PERSISTENT_MEM_AMD, CL_MEM_USE_HOST_PTR, CL_MEM_COPY_HOST_PTR and CL_MEM_ALLOC_HOST_PTR? as the online resources are really confusing.
Currently my code works in the following way. May be anyone can guide me which flag should i use or how can i change my code for better performance.
i initialize an input and output array, put them in clbuffers using CL_MEM_USE_HOST_PTR, pass them as arguments to 3 kernels and do some work on them in kernels. After 3 kernels are done, on APU, I don't have to read back the output buffer; on discrete GPU, i have to. Then i verify the results and print the execution time. The discrete GPU is giving much better performance (low execution time) as compared to fused (kernel time measured by event profiling can be low for discrete but overall time of 3 kernels adding the launching time should be for fused. shouldn't it be the case?)
P.S I am using linux platform (ubuntu 12.04), APU is devastator and cypress is discrete
Which expalins you the flags in details. Hope this will help you..