Archives Discussions

zhuzxy · ‎09-20-2011

For GPGPU, we can use multip work items do copy, but for CPU, as work item number may be very small, what's the best practise for memcpy? e.g copy 17 line and each line with 17 char datas ,what's the best practise in theory? copy the bytes one by one?

LeeHowes · ‎09-20-2011

How would you do it normally on the CPU? I would assume an unrolled SSE loop per core. So do that in OpenCL too, but for your convenience you can use the vector types instead of SSE intrinsics.

It depends on what you're trying to achieve, though. If you're just doing a massive memcpy you might be better off creating a native kernel that calls an external library with a parallel memcpy routine.

twentz · ‎09-21-2011

I've never benchmarked it, but there's an OpenCL kernel function called "async_work_group_copy", and I'm going to take a guess at that that function optimizes memory movement (although it would also require synchronization, depending on your program)

notzed · ‎09-24-2011

Originally posted by: twentz I've never benchmarked it, but there's an OpenCL kernel function called "async_work_group_copy", and I'm going to take a guess at that that function optimizes memory movement (although it would also require synchronization, depending on your program)

I would guess that async_work_group_copy is really just a way to access the asynchronous DMA transfer system on a CELL BE: it pretty much maps 1:1 to a simplified view of the hardware interface (and without it, CELL BE is pretty much knee-capped). I imagine every other implementation is just there for completeness but might not necessarily be as optimised, or asynchronous.

Although there's nothing to say it couldn't be.

Archives Discussions

how to do optimized memcpy in kernel for opencl on CPU?