4 Replies Latest reply on Sep 25, 2013 2:13 AM by himanshu.gautam



      I am trying to use the PERSISTENT_MEM flag for buffer creation with a 7970 GPU. However, it appears to be behaving more like ALLOC_HOST_PTR, i.e. it seems to be transferring data from host to device for every NDRangeKernel call. clinfo shows that the driver version supports virtual memory, but it does not indicate the presence of the 'cl_amd_device_persistent_memory' extension. Is there something extra I have to do to enable that extension, other than installing the 13.1 version driver? I have attached my clinfo output in case that helps.

        • Re: CL_MEM_USE_PERSISTENT_MEM_AMD and Linux

          Please run "BufferBandwidth" AMD APP SDK. One of the options is to use PERSISTENT AMD flag.

          That will tell you the story of your machine.


          However, I dont think "Persistent AMD" flag is a part of extension... But I am not 100% sure.


          PersistentAMD flag is usually used for "writing" from CPU side....and "reading/writing" from GPU side.

          Is your usage in accordance with the statement above?

            • Re: CL_MEM_USE_PERSISTENT_MEM_AMD and Linux



              It may appear a very naive question but i thought its the best platform to ask. I am trying to improve the performance of my code so i thought to vary flags for buffer creation. I was of the view that CL_MEM_USE_PERSISTENT_MEM_AMD would give the least execution time but the results didn't show the same. I ran the BufferBandwidth example as pointed by you and it passed but the Gbps values were really low when i set input, output and copy flags as 6(CL_MEM_USE_PERSISTENT_MEM_AMD) as compared to default settings for all the 4 tests.

              Can you explain the difference between CL_MEM_USE_PERSISTENT_MEM_AMD, CL_MEM_USE_HOST_PTR, CL_MEM_COPY_HOST_PTR and CL_MEM_ALLOC_HOST_PTR? as the online resources are really confusing.


              Currently my code works in the following way. May be anyone can guide me which flag should i use or how can i change my code for better performance.

              i initialize an input and output array, put them in clbuffers using CL_MEM_USE_HOST_PTR, pass them as arguments to 3 kernels and do some work on them in kernels. After 3 kernels are done, on APU, I don't have to read back the output buffer; on discrete GPU, i have to. Then i verify the results and print the execution time. The discrete GPU is giving much better performance (low execution time) as compared to fused (kernel time measured by event profiling can be low for discrete but overall time of 3 kernels adding the launching time should be for fused. shouldn't it be the case?)


              P.S I am using linux platform (ubuntu 12.04), APU is devastator and cypress is discrete