It is probably to do with buffer initialization on the first call.....There will be some overheads when you use "cl_mem" object for the first time....So, Skip the first call and then measure the second & Third calls. Better to loop and average.
Thanks for your reply. I tried to loop 100 times and average, it showed that since the second call, the transmission speed of malloc memory is two times faster than the
mmap memory.Also I checked the memory align, the malloc memory aligned at 16 bytes and the mmap memory aligned at 4k bytes.
okay, Do you know the significance of VM_IO flag? Does it make the memory un-cached? Does it also stop speculative reads (or) any prefetches on these locations?
Try just "memset"ing both the memory locations and also do some "memcpy" inside your process address space.
That will tell you if there is extra read/write latencies associated with your virtual address...
Hi, due to a short holiday, I'm sorry to reply this so late.
I think that the cause of the problem is the way I try to share the memory between kernel and user space.
I tried to take the methods mentioned in the book <linux device driver>, but there had been some problems.
Using file map to share memory, the method remap_pfn_range will automatically set the VM_IO flag to the VMA struct even if
I didn't set it by myself above in my_mmap method.
The interesting thing is that when I made some tricks before I sent the memory ptr to OPENCL api, I cleared the VM_IO flags
in VMA struct to make it the same like a normal malloc memory, the catalyst driver complained in dmesg to
remind me that some error occured because of the VMA flags. I got that the catalyst driver rely on the flags to transfer data between host and device even though it was only an normal memory.In the dmesg, it mentioned the methods get_user_pages and vmap that the driver using to access the user memory when it was trying to transfer data. These set me straight that I
could use these two methods to share memory. Also I made the experment, it showed me that the speed was reduced by
almost 10%, it was an acceptable result. I think my problem had been solved.
But I am still wondering are there some AMD official materials to help developer handle the shared memory or how to
speed up the data transfer from kernel space. I think these will be a big advantage over the NV.