Hi all, I have a optical flow algorithm which is sequential. I recently parallelized it using OpenCL. When I ran the code on nvidia GPU, the speedup is promising. But when I ran it on AMD or Intel CPU, it's worse than the sequential algorithm on CPU, can anyone give me an idea what caused this??
by the, I profile the memory copy time, it takes a large portion of the total time. If the program runs on CPU, the data should be in CPU memory, right? Then why it takes so long to copy?