Archives Discussions

jfkong · ‎10-15-2008

how to speedup the data transfer between CPU and GPU?

For Brook+
data transfer using streamRead and streamWrite is very slow (maximum is around 160 M bytes/second on
my machine: CentOS 5 x86-64 (RHEL 5 like) AMD/ATI 4870 and SDK 1.2beta).

For CAL:
there are several function calls related:
calResAllocLocal1D, calResAllocLocal2D, calResAllocRemote1D,calResAllocRemote2D and calMemCopy
(which is DMA transfer according to the programming guide), calResMap, calResUnmap

As I understand from the programming guide, I can allocate either GPU memory(local) or CPU memory(remote). There are two ways for data transfer between CPU and GPU:

1>>>Before I transfer the data to the allocated memory resources, I have to map the resource to get a CPU pointer. Using the CPU pointer, on CPU-side, I may transfer the data as regular indexed array reads and writes. What is the difference between local memory and remote memory in this way then? I guess for remote memory, this read/write is like regular CPU memory read/write. Only during kernel execution, when data are demanded, the actual data transfer to GPU happens. For local memory,
the actual data transfer happens right after calResUnmap using DMA or maybe other slow ways?

2>>>before I transfer the data, I need to have two same-size memory resources allocated (local and remote). On CPU-side, I do data write to the remote memory. And then I use calMemCopy (DMA) from remote to local. I would guess this is similar to CudaMemcpy. However if I count them all, the bandwidth is around 280M bytes/second. If I exclude the CPU-side data transfer to/from the remote memory, the bandwidth is above 1G bytes/second,which is similiar to CudaMemcpy using pageable memory. BTW: it also seems that no "cuda pinned memory" concept here in CAL.

I see the second method would be the fastest way. I would appreciate any further explanation and discussion.

MicahVillmow · ‎10-16-2008

jfkong,
The equivalent to cuda pinned memory is calCtxResCreate which is in cal_ext.h. We are working on an example that shows how to use this correctly. For analysis of memory, it isn't as simple as you state as there are cases where 1 is faster and cases where 2 is faster. Please view http://coachk.cs.ucf.edu/courses/CDA6938/ and look at the performance modeling slides for more information.

jfkong · ‎10-16-2008

Originally posted by: MicahVillmow jfkong, The equivalent to cuda pinned memory is calCtxResCreate which is in cal_ext.h. We are working on an example that shows how to use this correctly. For analysis of memory, it isn't as simple as you state as there are cases where 1 is faster and cases where 2 is faster. Please view http://coachk.cs.ucf.edu/courses/CDA6938/ and look at the performance modeling slides for more information.

My point is that CUDA has only one way of transferring data between CPU and GPU(cudamemcpy either sync or async), however for CAL, there are basically two. You either allocate remote/local memory, work on derived CPU pointer and hope that data transfer will automatically be handled by CAL. Or you allocate two equivalent remote/local memories, work on derived CPU pointer and the explicitly invoke calMemcpy to do the DMA. There are differences, which I cannot see any explanation in the documentation. BTW: I took the class.

MicahVillmow · ‎10-16-2008

Except for application specific performance differences between the two methods, the major difference is that calResMap/calResUnmap on a local surface is an implicit synchronous copy, whereas calMemCopy can be considered asynchronous.

On a sidenote, comparing CAL's options to Cuda's are a bit off because they are different levels of abstraction. If you look at Cuda's device driver interface, they also have multiple ways of allocating and copying memory. An equivalent comparison would be to use CUDA's memory copies versus brook+'s memory copies(which is sync or async depending on heuristics in the runtime) and then compare CAL to the Cuda lower level interface.

As for the documentation issue, is the documentation in the doc/html directory insufficient? If there is something you believe should be added please post it to the 1.2sdk feedback page.

jfkong · ‎10-16-2008

Thanks for the reply.

I guess it is remote/local memory and Resmap/ResUnmap that is confusing me. For a software developer, I guess that he/she would be only interested in moving the data quickly between CPU and GPU. We'd like to see the performance data for the different ways supported by CAL. A good code example would be perfect. Currently DownloadReadback in tutorial is not sufficient for that purpose. We'd like to see bandwidth data such as host-to-device, device-to-host, device-to-device etc. Actually I am doing that right now. haha

MicahVillmow · ‎10-16-2008

I'll add that as a feature request for future SDK releases.

Archives Discussions

data transfer rate between CPU and GPU supported from brook+ and CAL