I'm developing a program that keeps a lot of data in GPU memory, and it seems like the global buffer does the job.
However, when I want to read the results back after kernel execution, there doesn't seem to be a way to only map or copy parts of the global memory, which slows my application down considerably, since it seems like even mapping the memory uses time proportional to the amount of global memory allocated, even though I only need to read back one word per thread.
I tried to allocate several global buffers, but that doesn't seem to work. Is there any solution to this problem?