Hi,
Here is the situation:
kernel side:
1.we alloc a big area of physical memory when linux start up using alloc_boot_mem
2.we write a simple char device driver implementing the mmap function in the file_operations struct to map the physical memory to user space.
user space side:
1.we open the char device and then using the mmap api to get the virtual address of the physical memory we alloced in kernel space.
2.we alloc a same size of virtual memory using malloc.
the test:
we used the opencl api clEnqueueWriteBuffer to transfer these two memory area to the device, but it showed a different transfer speed, if we transfer the malloc memory firstly and then transfer
the mmap memory area secondly, the transfer speeds were 1.5GB/S vs 500MB/s. if we transfer the mmap memory firstly and then transfer the malloc memory area secondly, the transfer speeds
were 700MB/s vs 900MB/s.
Any one can tell me why? And how can i to make the mmap memory transfer speed the same as the malloc memory.
Thanks.
Here is the mmap function we implemented in the char device driver:
int my_mmap(struct file *filp, struct vm_area_struct *vma)
{
vma->vm_flags |= VM_RESERVED | VM_IO;
if(remap_pfn_range(vma, vma->vm_start, virt_to_phys((void *)my_boot_mem)>>PAGE_SHIFT,
boot_mem_size, vma->vm_page_prot))
{
return -EAGAIN;
}
return 0;
}
Solved! Go to Solution.
Hi, due to a short holiday, I'm sorry to reply this so late.
I think that the cause of the problem is the way I try to share the memory between kernel and user space.
I tried to take the methods mentioned in the book <linux device driver>, but there had been some problems.
Using file map to share memory, the method remap_pfn_range will automatically set the VM_IO flag to the VMA struct even if
I didn't set it by myself above in my_mmap method.
The interesting thing is that when I made some tricks before I sent the memory ptr to OPENCL api, I cleared the VM_IO flags
in VMA struct to make it the same like a normal malloc memory, the catalyst driver complained in dmesg to
remind me that some error occured because of the VMA flags. I got that the catalyst driver rely on the flags to transfer data between host and device even though it was only an normal memory.In the dmesg, it mentioned the methods get_user_pages and vmap that the driver using to access the user memory when it was trying to transfer data. These set me straight that I
could use these two methods to share memory. Also I made the experment, it showed me that the speed was reduced by
almost 10%, it was an acceptable result. I think my problem had been solved.
But I am still wondering are there some AMD official materials to help developer handle the shared memory or how to
speed up the data transfer from kernel space. I think these will be a big advantage over the NV.
It is probably to do with buffer initialization on the first call.....There will be some overheads when you use "cl_mem" object for the first time....So, Skip the first call and then measure the second & Third calls. Better to loop and average.
Hi,
Thanks for your reply. I tried to loop 100 times and average, it showed that since the second call, the transmission speed of malloc memory is two times faster than the
mmap memory.Also I checked the memory align, the malloc memory aligned at 16 bytes and the mmap memory aligned at 4k bytes.
okay, Do you know the significance of VM_IO flag? Does it make the memory un-cached? Does it also stop speculative reads (or) any prefetches on these locations?
Try just "memset"ing both the memory locations and also do some "memcpy" inside your process address space.
That will tell you if there is extra read/write latencies associated with your virtual address...
-
Bruha...
Hi, due to a short holiday, I'm sorry to reply this so late.
I think that the cause of the problem is the way I try to share the memory between kernel and user space.
I tried to take the methods mentioned in the book <linux device driver>, but there had been some problems.
Using file map to share memory, the method remap_pfn_range will automatically set the VM_IO flag to the VMA struct even if
I didn't set it by myself above in my_mmap method.
The interesting thing is that when I made some tricks before I sent the memory ptr to OPENCL api, I cleared the VM_IO flags
in VMA struct to make it the same like a normal malloc memory, the catalyst driver complained in dmesg to
remind me that some error occured because of the VMA flags. I got that the catalyst driver rely on the flags to transfer data between host and device even though it was only an normal memory.In the dmesg, it mentioned the methods get_user_pages and vmap that the driver using to access the user memory when it was trying to transfer data. These set me straight that I
could use these two methods to share memory. Also I made the experment, it showed me that the speed was reduced by
almost 10%, it was an acceptable result. I think my problem had been solved.
But I am still wondering are there some AMD official materials to help developer handle the shared memory or how to
speed up the data transfer from kernel space. I think these will be a big advantage over the NV.