data transfer using streamRead and streamWrite is very slow (maximum is around 160 M bytes/second on
my machine: CentOS 5 x86-64 (RHEL 5 like) AMD/ATI 4870 and SDK 1.2beta).
there are several function calls related:
calResAllocLocal1D, calResAllocLocal2D, calResAllocRemote1D,calResAllocRemote2D and calMemCopy
(which is DMA transfer according to the programming guide), calResMap, calResUnmap
As I understand from the programming guide, I can allocate either GPU memory(local) or CPU memory(remote). There are two ways for data transfer between CPU and GPU:
1>>>Before I transfer the data to the allocated memory resources, I have to map the resource to get a CPU pointer. Using the CPU pointer, on CPU-side, I may transfer the data as regular indexed array reads and writes. What is the difference between local memory and remote memory in this way then? I guess for remote memory, this read/write is like regular CPU memory read/write. Only during kernel execution, when data are demanded, the actual data transfer to GPU happens. For local memory,
the actual data transfer happens right after calResUnmap using DMA or maybe other slow ways?
2>>>before I transfer the data, I need to have two same-size memory resources allocated (local and remote). On CPU-side, I do data write to the remote memory. And then I use calMemCopy (DMA) from remote to local. I would guess this is similar to CudaMemcpy. However if I count them all, the bandwidth is around 280M bytes/second. If I exclude the CPU-side data transfer to/from the remote memory, the bandwidth is above 1G bytes/second,which is similiar to CudaMemcpy using pageable memory. BTW: it also seems that no "cuda pinned memory" concept here in CAL.
I see the second method would be the fastest way. I would appreciate any further explanation and discussion.