Dear fellow developers,
it seems that when creating an OpenCL buffer and specifying both CL_MEM_READ_ONLY and CL_MEM_ALLOC_HOST_PTR will result in the AMD platform allocating write-combined host memory. A simple example to reproduce this behavior is posted below. (I am using a Radeon Pro WX 5100, Windows 10 (64-bit, but I am compiling the example as a 32-bit application) and the latest Radeon Pro driver.)
One thing that is rather curious is that when not passing CL_MEM_ALLOC_HOST_PTR but calling the map-command directly instead, the host allocation made available by the runtime is not allocated as write-combined (cf. the output generated by the program.)
#define __CL_ENABLE_EXCEPTIONS
// C++ includes
#include <iostream>
#include <string>
#include <vector>
// Windows API
#include <Windows.h>
// OpenCL includes
#include <CL/cl.hpp>
int main( void ) {
try {
std::vector< cl::Platform > platforms;
std::vector< cl::Device > devices;
// Platform selection
cl::Platform::get( &platforms );
const cl::Platform &platform = platforms[ 0 ];
// Device selection
platform.getDevices( CL_DEVICE_TYPE_GPU, &devices );
const cl::Device &device = devices[ 0 ];
// Print platform information
std::string name;
std::string version;
platform.getInfo( CL_PLATFORM_NAME, &name );
platform.getInfo( CL_PLATFORM_VERSION, &version );
std::cout << "(Using the platform " << name << " at version " << version << ")" << std::endl;
cl_context_properties props[ 3 ] = { CL_CONTEXT_PLATFORM, (cl_context_properties) (platform) (), 0 };
cl::Context ctx( device, props );
cl::CommandQueue queue( ctx, device );
size_t bufferSize = 2048 * 1024 * sizeof( float );
{
cl::Buffer buffer = cl::Buffer( ctx, CL_MEM_READ_ONLY, bufferSize );
float *bufferHost = static_cast<float*>(queue.enqueueMapBuffer( buffer, CL_TRUE, CL_MAP_READ, 0, bufferSize ));
MEMORY_BASIC_INFORMATION memInfo;
if ( VirtualQuery( reinterpret_cast<void*>(bufferHost), &memInfo, sizeof( memInfo ) ) )
{
std::cout << "Host allocation as write-combined: " << ((memInfo.AllocationProtect & PAGE_WRITECOMBINE) ? "Yes" : "No") << std::endl;
std::cout << "Host memory is write-combined: " << ((memInfo.Protect & PAGE_WRITECOMBINE) ? "Yes" : "No") << std::endl;
}
queue.enqueueUnmapMemObject( buffer, bufferHost );
}
{
cl::Buffer buffer = cl::Buffer( ctx, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, bufferSize );
float *bufferHost = static_cast<float*>(queue.enqueueMapBuffer( buffer, CL_TRUE, CL_MAP_READ, 0, bufferSize ));
MEMORY_BASIC_INFORMATION memInfo;
if ( VirtualQuery( reinterpret_cast<void*>(bufferHost), &memInfo, sizeof( memInfo ) ) )
{
std::cout << "Host allocation as write-combined: " << ((memInfo.AllocationProtect & PAGE_WRITECOMBINE) ? "Yes" : "No") << std::endl;
std::cout << "Host memory is write-combined: " << ((memInfo.Protect & PAGE_WRITECOMBINE) ? "Yes" : "No") << std::endl;
}
queue.enqueueUnmapMemObject( buffer, bufferHost );
}
queue.finish();
} catch ( cl::Error &error ) {
std::cerr << "OpenCL C++ API Exception during " << error.what() << ": " << error.err() << std::endl;
}
return 0;
}
I would like to argue that automatically allocating host memory associated with a CL_MEM_READ_ONLY | CL_MEM_ALLOCATE_HOST_PTR buffer as write-combined is not a good idea and indeed believe that this should be classified as a bug. You could have, e.g. one thread filling a buffer, using that buffer during a computation and have another thread reading that buffer in order to save it to a log file. When allocating as write-combined, this reading of the buffer will take a long time (up to 26 times slower on my system). Instead I would like to suggest that the runtime should only allocate the host memory as write-combined if, in addition, CL_MEM_HOST_WRITE_ONLY is specified (as is kind of suggested by the OpenCL specification).
Any comments on this observation? Thanks in advance for your replies.
Kind regards
bcaf01
PS: I would appreciate it if someone could add me to the white-list and move this topic to the appropriate developer forum!
Changed the title to give a better description of the issue.
You've been whitelisted now.
Regards,
Hi,
Usually, buffers created with CL_MEM_READ_ONLY | CL_MEM_ALLOCATE_HOST_PTR indicates that the programmer wants to create a pre-pinned (zero-copy) buffer to pass the data from host to kernel. Host will write the data that will be read by the kernel. Because it's read-only at kernel side, there is little sense to read the buffer once again at host side. It's one directional in general.
Also note that, CL_MEM_HOST_WRITE_ONLY was added later into the spec.
One thing that is rather curious is that when not passing CL_MEM_ALLOC_HOST_PTR but calling the map-command directly instead, the host allocation made available by the runtime is not allocated as write-combined (cf. the output generated by the program.)
Not passing CL_MEM_ALLOC_HOST_PTR makes it as regular device buffer. Mapping it in read-only mode indicates that the host only wants to read the buffer, not write. So, it's not same as allocating a pre-pinned buffer with CL_MEM_READ_ONLY | CL_MEM_ALLOCATE_HOST_PTR.
Regards,