Archives Discussions

bcaf01 · ‎08-10-2017

Dear fellow developers,

it seems that when creating an OpenCL buffer and specifying both CL_MEM_READ_ONLY and CL_MEM_ALLOC_HOST_PTR will result in the AMD platform allocating write-combined host memory. A simple example to reproduce this behavior is posted below. (I am using a Radeon Pro WX 5100, Windows 10 (64-bit, but I am compiling the example as a 32-bit application) and the latest Radeon Pro driver.)

One thing that is rather curious is that when not passing CL_MEM_ALLOC_HOST_PTR but calling the map-command directly instead, the host allocation made available by the runtime is not allocated as write-combined (cf. the output generated by the program.)

#define __CL_ENABLE_EXCEPTIONS
// C++ includes
#include <iostream>
#include <string>
#include <vector>
// Windows API
#include <Windows.h>
// OpenCL includes
#include <CL/cl.hpp>
int main( void ) {
    try {
        std::vector< cl::Platform > platforms;
        std::vector< cl::Device > devices;
        // Platform selection
        cl::Platform::get( &platforms );
        const cl::Platform &platform = platforms[ 0 ];
        // Device selection
        platform.getDevices( CL_DEVICE_TYPE_GPU, &devices );
        const cl::Device &device = devices[ 0 ];
        // Print platform information
        std::string name;
        std::string version;
        platform.getInfo( CL_PLATFORM_NAME, &name );
        platform.getInfo( CL_PLATFORM_VERSION, &version );
        std::cout << "(Using the platform " << name << " at version " << version << ")" << std::endl;
        cl_context_properties props[ 3 ] = { CL_CONTEXT_PLATFORM, (cl_context_properties) (platform) (), 0 };
        cl::Context ctx( device, props );
        cl::CommandQueue queue( ctx, device );
        size_t bufferSize = 2048 * 1024 * sizeof( float );
        {
            cl::Buffer buffer = cl::Buffer( ctx, CL_MEM_READ_ONLY, bufferSize );
            float *bufferHost = static_cast<float*>(queue.enqueueMapBuffer( buffer, CL_TRUE, CL_MAP_READ, 0, bufferSize ));
            MEMORY_BASIC_INFORMATION memInfo;
            if ( VirtualQuery( reinterpret_cast<void*>(bufferHost), &memInfo, sizeof( memInfo ) ) )
            {
                std::cout << "Host allocation as write-combined: " << ((memInfo.AllocationProtect & PAGE_WRITECOMBINE) ? "Yes" : "No") << std::endl;
                std::cout << "Host memory is write-combined: " << ((memInfo.Protect & PAGE_WRITECOMBINE) ? "Yes" : "No") << std::endl;
            }
            queue.enqueueUnmapMemObject( buffer, bufferHost );
        }
        {
            cl::Buffer buffer = cl::Buffer( ctx, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, bufferSize );
            float *bufferHost = static_cast<float*>(queue.enqueueMapBuffer( buffer, CL_TRUE, CL_MAP_READ, 0, bufferSize ));
            MEMORY_BASIC_INFORMATION memInfo;
            if ( VirtualQuery( reinterpret_cast<void*>(bufferHost), &memInfo, sizeof( memInfo ) ) )
            {
                std::cout << "Host allocation as write-combined: " << ((memInfo.AllocationProtect & PAGE_WRITECOMBINE) ? "Yes" : "No") << std::endl;
                std::cout << "Host memory is write-combined: " << ((memInfo.Protect & PAGE_WRITECOMBINE) ? "Yes" : "No") << std::endl;
            }
            queue.enqueueUnmapMemObject( buffer, bufferHost );
        }
        queue.finish();
    } catch ( cl::Error &error ) {
        std::cerr << "OpenCL C++ API Exception during " << error.what() << ": " << error.err() << std::endl;
    }
    return 0;
}

I would like to argue that automatically allocating host memory associated with a CL_MEM_READ_ONLY | CL_MEM_ALLOCATE_HOST_PTR buffer as write-combined is not a good idea and indeed believe that this should be classified as a bug. You could have, e.g. one thread filling a buffer, using that buffer during a computation and have another thread reading that buffer in order to save it to a log file. When allocating as write-combined, this reading of the buffer will take a long time (up to 26 times slower on my system). Instead I would like to suggest that the runtime should only allocate the host memory as write-combined if, in addition, CL_MEM_HOST_WRITE_ONLY is specified (as is kind of suggested by the OpenCL specification).

Any comments on this observation? Thanks in advance for your replies.

Kind regards

bcaf01

PS: I would appreciate it if someone could add me to the white-list and move this topic to the appropriate developer forum!

Changed the title to give a better description of the issue.

dipak · ‎08-11-2017

You've been whitelisted now.

Regards,

dipak · ‎08-17-2017

Hi,

Usually, buffers created with CL_MEM_READ_ONLY | CL_MEM_ALLOCATE_HOST_PTR indicates that the programmer wants to create a pre-pinned (zero-copy) buffer to pass the data from host to kernel. Host will write the data that will be read by the kernel. Because it's read-only at kernel side, there is little sense to read the buffer once again at host side. It's one directional in general.

Also note that, CL_MEM_HOST_WRITE_ONLY was added later into the spec.

One thing that is rather curious is that when not passing CL_MEM_ALLOC_HOST_PTR but calling the map-command directly instead, the host allocation made available by the runtime is not allocated as write-combined (cf. the output generated by the program.)

Not passing CL_MEM_ALLOC_HOST_PTR makes it as regular device buffer. Mapping it in read-only mode indicates that the host only wants to read the buffer, not write. So, it's not same as allocating a pre-pinned buffer with CL_MEM_READ_ONLY | CL_MEM_ALLOCATE_HOST_PTR.

Regards,

Archives Discussions

Buffers with CL_MEM_READ_ONLY | CL_MEM_ALLOCATE_HOST_PTR always have write-combined host allocation