Archives Discussions

zhuzxy · ‎10-14-2011

For native C code, the memcpy() function is very efficient. But for opencl on CPU, I did not know how to copy a certain data from appointed position to another memory.

In my problem, I need copy many pieces of data from an image to an array. the copy start positions come from previous kernel calculation result. the copy length is various. I have tried to copy it one by one, and it is extermly inefficient. And for alignment problem, I can not do the copy using vector type.

Can anyone tell me how to do it efficiently?

antzrhere · ‎10-14-2011

Have you tried clEnqueueCopyImageToBuffer() ?

corry · ‎10-14-2011

I think he means inside his OpenCL kernel, but without more description, I don't know...

That said, I think inside your kernel, everything *probably* is already aligned...though I don't do much with OpenCL so don't quote me on that. If not, and the data is sufficently large, you can wind up by copying the single elements until you get to an aligned boundry (use & instead of % to determine), and once aligned, use vector copies until you run out of 16 byte elements to move, and unwind back to single 4 byte, or single byte elements (depending on what you're using) again to finish up. There may be better ways in OpenCL to do it, things to get the compiler to generate the burst copying at the hardware level, but that's beyond me...for GPGPU I work lower (and more painful) level...

notzed · ‎10-16-2011

Generally for efficiency, all threads in the same work-group should cooperate on the same copy.

Ideally each thread reads consecutive memory addresses, or for images, close x/y locations e.g. see the AMD opencl programming guide section 4.15.2 "Memory Tiling" for why this is so for images.

How you map this to your particular problem is where the difficulty lies ...

zhuzxy · ‎10-16-2011

I'd like to exlain my problem more clearly. In my problem, there are multiple kernels. and the previous kernel's output will be the input for next kernel.

In the last kernel, I have an image like 800x600 saved in a 1 dimension array, and I get a collections fo x&y information from previous kernel output, What I need is copy a 19x19 rect based on the x&y info into a new continuous array and do some calculation on the rect. As the x&y value is random. it is hard to use vector because of alignment problem on both src and dest address. the x&y set is fairly big, and it seems the opencl on cpu can not provide an efficient way to do the copy compared with the native host side C code. And this casue the performance downgrade for overall algorithms.

notzed · ‎10-17-2011

Sorry - i missed the bit about you using opencl on cpu before (maybe i need glasses since it's right there in the topic). opencl on cpu is pretty much the same as c on the cpu, and memcpy isn't really all that magic either.

If this is related to your other post about vectors i'd say the overhead of marshalling to a vector is outweighing any advange of a single vector multiply and that's where your problem is, not with the memory loads.

corry · ‎10-17-2011

Originally posted by: notzed Sorry - i missed the bit about you using opencl on cpu before (maybe i need glasses since it's right there in the topic). opencl on cpu is pretty much the same as c on the cpu, and memcpy isn't really all that magic either.

If this is related to your other post about vectors i'd say the overhead of marshalling to a vector is outweighing any advange of a single vector multiply and that's where your problem is, not with the memory loads.

Ha! I missed it too! I guess then, my question is why OpenCL? Are you planning on hitting up the GPU/APU side later?

Anyhow, I would expect everything to be aligned already, so hopefully there is no marshalling into vectors, since, again, I'd hope, their vectors are juat float*, int*, etc...16 byte aligned blocks of contiguous blocks of 16 bytes of data. That maps well to SSE, and GPGPU, almost 0 overhead, etc.

I guess the problem is what is your image format? Is each element 8 bits, 16 bits, 32 bits? 19x19 doesn't give us a lot to go on, that could be 2 16 byte copies, or it could be 5 16 byte copies. Its true, since its an image, reading 19x19 from an arbitrary location isn't going to be as fast as it could, but here is what I'd do.

First off, do you have to copy. Memory operations, even on the CPU are expensive. Even L1 cache can have an extra cycle associated with it. If you can just calculate 19 row offsets, it's probably faster than copying 361 bytes at 8 bits per pixel (greyscale, or 256 color), and especially 1,444 bytes at 8 bits per channel, 4 channel rgba. 19 operations vs 361 in x86 mode not using SSE for copies. Basically, if you can destroy your original image, and operations on pixels don't trigger re-reads of the 19x19 section, you're better off calculating the offsets, and destroying the original data.

If you have to copy, you have a few options, with various complexity, and performance. Best bet, is to do aligned reads. Calculate an offset before the start of the 19x19 data you need, and overread a little. Read 16 bytes at a time, pad the end of the image by 16 bytes, so if you overread a little, it won't crash. This means still calculating the offsets, but you could break your image into 19 19 element buffers, so there's only 1 offset.

Next best bet would probably be no aligned reads, but still using vectors to get 16 bytes at a time. OpenCL might not let you do that, I don't know, SSE can use movdqu for unaligned data, with a penalty. lddqu was supposed to do what I described above, but doesn't seem to on anything but netburst, and that didn't help netburst much...

Next best bet is to move unsigned longs at a time. In openCL thats 8 bytes at a time, if you're in x64 mode, thats a single register at a time.

You can probably see where this is going, but next best bet is unsigned ints at a time, which is 4 bytes at a time, or a single x86 protected mode register at a time.

Anything less, and you're wasting cycles 🙂

notzed · ‎10-17-2011

Originally posted by: corry
Originally posted by: notzed Sorry - i missed the bit about you using opencl on cpu before (maybe i need glasses since it's right there in the topic). opencl on cpu is pretty much the same as c on the cpu, and memcpy isn't really all that magic either.

If this is related to your other post about vectors i'd say the overhead of marshalling to a vector is outweighing any advange of a single vector multiply and that's where your problem is, not with the memory loads.

Ha! I missed it too! I guess then, my question is why OpenCL? Are you planning on hitting up the GPU/APU side later?

Anyhow, I would expect everything to be aligned already, so hopefully there is no marshalling into vectors, since, again, I'd hope, their vectors are juat float*, int*, etc...16 byte aligned blocks of contiguous blocks of 16 bytes of data. That maps well to SSE, and GPGPU, almost 0 overhead, etc.

I was referring to this post: http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=155781&enterthread=y

Which actually looks buggy anyway, it's not getting a 19x19 chunk, it's getting 3 consecutive runs of 19 elements offset by x,y ...

And looking at that code it is unclear why one would need a copy anyway (unless those same 19x19 tiles are being reprocessed many times).

Next best bet would probably be no aligned reads, but still using vectors to get 16 bytes at a time. OpenCL might not let you do that, I don't know, SSE can use movdqu for unaligned data, with a penalty. lddqu was supposed to do what I described above, but doesn't seem to on anything but netburst, and that didn't help netburst much...

vloadn() is the only way to load unaligned data. That is probably the only option here.

zhuzxy · ‎10-17-2011

The post you refered is part of algorithm I am using to deal with the data after memcpy. It is just a sample and I want to find out what's the right way to do such calculations using Opencl. I am just studying how to use OpenCL on CPU platform.

The reason I use memcpy is I did a test, if I do not do memcpy and do calculation on the original image directly, the performance will be like 4 ms. And after I do the memcpy amd calculate on the copied data, the performance will be like 2.8 ms. I guess the cache helps for the performance. But problem is the memcpy operation make all the performance gain away.

I may use the vload and have a try see if I can get performance gain.

And thanks for your advice.

corry · ‎10-18-2011

Originally posted by: notzed

I was referring to this post: http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=155781&enterthread=y
vloadn() is the only way to load unaligned data. That is probably the only option here.

Oh my...

I'm not sure how vloadn works, nor how the CPU would implement it vs the GPU....the CPU with vectors would likely do a movdqu for an unaligned mov, which works better than 2 movdqa's for 2 aligned moves (then unpacking to get at the aunaligned data). The GPU, I would suspect would have to do the 2 unaligned movs...

However, I think thats a discussion well beyond your problem. notzed is right, you need to rethink you algorithm, and really everything to use SIMD if you want to use vectors. CPU with vectors is SIMD, with SSE. GPUs, well as I just found out aren't exactly SIMD, but it maps well, and will make better use of the GPU...

Archives Discussions

How to do a lot of small memcpy for Opencl on CPU?