Originally posted by: Raistmer Many of my kernels need to return some small (<1kB) vector of flags that determines if subsequent GPU memory transfer to host is needed or not. Do I understand right, that the best memory buffer for this vector would be pre-pinned memory allocated on host and accessed by host via map/unmap commands? Also, GPU should use that bufffer directly. i.e. 1. buffer = clCreateBuffer(CL_MEM_ALLOC_HOST_PTR) (but w/o CL_MEM_READ_ONLY flag) 2. address = clMapBuffer( buffer ) 3. memset( address ) 4. clEnqueueUnmapMemObject( buffer ) 5. clEnqueueNDRangeKernel( buffer ) 6. address = clMapBuffer( buffer ) 7. read by CPU to check if flag==1 8. goto 3. Also, maybe buffer = clCreateBuffer(CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY) should be used instead of buffer = clCreateBuffer(CL_MEM_ALLOC_HOST_PTR ) ? What is better (provided only 1 launch from few hundreds will change flag from zero to 1) - to speedup GPU access with uncached memory usage or leave it cached to speedup subsequent checking by CPU?
Your buffer type is correct for this situation. One more thing : you should use CL_MEM_WRITE_ONLY as kernel writes into this buffer.