bcaf01

Buffers with CL_MEM_READ_ONLY | CL_MEM_ALLOCATE_HOST_PTR always have write-combined host allocation

Discussion created by bcaf01 on Aug 10, 2017
Latest reply on Aug 17, 2017 by dipak

Dear fellow developers,

 

 

it seems that when creating an OpenCL buffer and specifying both CL_MEM_READ_ONLY and CL_MEM_ALLOC_HOST_PTR will result in the AMD platform allocating write-combined host memory. A simple example to reproduce this behavior is posted below. (I am using a Radeon Pro WX 5100, Windows 10 (64-bit, but I am compiling the example as a 32-bit application) and the latest Radeon Pro driver.)

 

One thing that is rather curious is that when not passing CL_MEM_ALLOC_HOST_PTR but calling the map-command directly instead, the host allocation made available by the runtime is not allocated as write-combined (cf. the output generated by the program.)

 

#define __CL_ENABLE_EXCEPTIONS

// C++ includes
#include <iostream>
#include <string>
#include <vector>

// Windows API
#include <Windows.h>

// OpenCL includes
#include <CL/cl.hpp>


int main( void ) {
    try {
        std::vector< cl::Platform > platforms;
        std::vector< cl::Device > devices;

        // Platform selection
        cl::Platform::get( &platforms );
        const cl::Platform &platform = platforms[ 0 ];

        // Device selection
        platform.getDevices( CL_DEVICE_TYPE_GPU, &devices );
        const cl::Device &device = devices[ 0 ];

        // Print platform information
        std::string name;
        std::string version;
        platform.getInfo( CL_PLATFORM_NAME, &name );
        platform.getInfo( CL_PLATFORM_VERSION, &version );
        std::cout << "(Using the platform " << name << " at version " << version << ")" << std::endl;

        cl_context_properties props[ 3 ] = { CL_CONTEXT_PLATFORM, (cl_context_properties) (platform) (), 0 };
        cl::Context ctx( device, props );
        cl::CommandQueue queue( ctx, device );

        size_t bufferSize = 2048 * 1024 * sizeof( float );

        {
            cl::Buffer buffer = cl::Buffer( ctx, CL_MEM_READ_ONLY, bufferSize );

            float *bufferHost = static_cast<float*>(queue.enqueueMapBuffer( buffer, CL_TRUE, CL_MAP_READ, 0, bufferSize ));

            MEMORY_BASIC_INFORMATION memInfo;
            if ( VirtualQuery( reinterpret_cast<void*>(bufferHost), &memInfo, sizeof( memInfo ) ) )
            {
                std::cout << "Host allocation as write-combined: " << ((memInfo.AllocationProtect & PAGE_WRITECOMBINE) ? "Yes" : "No") << std::endl;
                std::cout << "Host memory is write-combined: " << ((memInfo.Protect & PAGE_WRITECOMBINE) ? "Yes" : "No") << std::endl;
            }

            queue.enqueueUnmapMemObject( buffer, bufferHost );
        }
        {
            cl::Buffer buffer = cl::Buffer( ctx, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, bufferSize );

            float *bufferHost = static_cast<float*>(queue.enqueueMapBuffer( buffer, CL_TRUE, CL_MAP_READ, 0, bufferSize ));

            MEMORY_BASIC_INFORMATION memInfo;
            if ( VirtualQuery( reinterpret_cast<void*>(bufferHost), &memInfo, sizeof( memInfo ) ) )
            {
                std::cout << "Host allocation as write-combined: " << ((memInfo.AllocationProtect & PAGE_WRITECOMBINE) ? "Yes" : "No") << std::endl;
                std::cout << "Host memory is write-combined: " << ((memInfo.Protect & PAGE_WRITECOMBINE) ? "Yes" : "No") << std::endl;
            }

            queue.enqueueUnmapMemObject( buffer, bufferHost );
        }

        queue.finish();
    } catch ( cl::Error &error ) {
        std::cerr << "OpenCL C++ API Exception during " << error.what() << ": " << error.err() << std::endl;
    }

    return 0;
}

 

I would like to argue that automatically allocating host memory associated with a CL_MEM_READ_ONLY | CL_MEM_ALLOCATE_HOST_PTR buffer as write-combined is not a good idea and indeed believe that this should be classified as a bug. You could have, e.g. one thread filling a buffer, using that buffer during a computation and have another thread reading that buffer in order to save it to a log file. When allocating as write-combined, this reading of the buffer will take a long time (up to 26 times slower on my system). Instead I would like to suggest that the runtime should only allocate the host memory as write-combined if, in addition, CL_MEM_HOST_WRITE_ONLY is specified (as is kind of suggested by the OpenCL specification).

 

Any comments on this observation? Thanks in advance for your replies.

 

 

Kind regards

bcaf01

 

PS: I would appreciate it if someone could add me to the white-list and move this topic to the appropriate developer forum!

 

Changed the title to give a better description of the issue.

Outcomes