what is the best buffer for this?
Many of my kernels need to return some small (<1kB) vector of flags that determines if subsequent GPU memory transfer to host is needed or not.
Do I understand right, that the best memory buffer for this vector would be
pre-pinned memory allocated on host and accessed by host via map/unmap commands? Also, GPU should use that bufffer directly.
i.e.
1. buffer = clCreateBuffer(CL_MEM_ALLOC_HOST_PTR) (but w/o CL_MEM_READ_ONLY flag)
2. address = clMapBuffer( buffer )
3. memset( address )
4. clEnqueueUnmapMemObject( buffer )
5. clEnqueueNDRangeKernel( buffer )
6. address = clMapBuffer( buffer )
7. read by CPU to check if flag==1
8. goto 3.
Also, maybe buffer = clCreateBuffer(CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY) should be used instead of buffer = clCreateBuffer(CL_MEM_ALLOC_HOST_PTR ) ?
What is better (provided only 1 launch from few hundreds will change flag from zero to 1) - to speedup GPU access with uncached memory usage or leave it cached to speedup subsequent checking by CPU?