cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

bcaf01
Journeyman III

Buffers with CL_MEM_READ_ONLY | CL_MEM_ALLOCATE_HOST_PTR always have write-combined host allocation

Dear fellow developers,

it seems that when creating an OpenCL buffer and specifying both CL_MEM_READ_ONLY and CL_MEM_ALLOC_HOST_PTR will result in the AMD platform allocating write-combined host memory. A simple example to reproduce this behavior is posted below. (I am using a Radeon Pro WX 5100, Windows 10 (64-bit, but I am compiling the example as a 32-bit application) and the latest Radeon Pro driver.)

One thing that is rather curious is that when not passing CL_MEM_ALLOC_HOST_PTR but calling the map-command directly instead, the host allocation made available by the runtime is not allocated as write-combined (cf. the output generated by the program.)

#define __CL_ENABLE_EXCEPTIONS

// C++ includes

#include <iostream>

#include <string>

#include <vector>

// Windows API

#include <Windows.h>

// OpenCL includes

#include <CL/cl.hpp>

int main( void ) {

    try {

        std::vector< cl::Platform > platforms;

        std::vector< cl::Device > devices;

        // Platform selection

        cl::Platform::get( &platforms );

        const cl::Platform &platform = platforms[ 0 ];

        // Device selection

        platform.getDevices( CL_DEVICE_TYPE_GPU, &devices );

        const cl::Device &device = devices[ 0 ];

        // Print platform information

        std::string name;

        std::string version;

        platform.getInfo( CL_PLATFORM_NAME, &name );

        platform.getInfo( CL_PLATFORM_VERSION, &version );

        std::cout << "(Using the platform " << name << " at version " << version << ")" << std::endl;

        cl_context_properties props[ 3 ] = { CL_CONTEXT_PLATFORM, (cl_context_properties) (platform) (), 0 };

        cl::Context ctx( device, props );

        cl::CommandQueue queue( ctx, device );

        size_t bufferSize = 2048 * 1024 * sizeof( float );

        {

            cl::Buffer buffer = cl::Buffer( ctx, CL_MEM_READ_ONLY, bufferSize );

            float *bufferHost = static_cast<float*>(queue.enqueueMapBuffer( buffer, CL_TRUE, CL_MAP_READ, 0, bufferSize ));

            MEMORY_BASIC_INFORMATION memInfo;

            if ( VirtualQuery( reinterpret_cast<void*>(bufferHost), &memInfo, sizeof( memInfo ) ) )

            {

                std::cout << "Host allocation as write-combined: " << ((memInfo.AllocationProtect & PAGE_WRITECOMBINE) ? "Yes" : "No") << std::endl;

                std::cout << "Host memory is write-combined: " << ((memInfo.Protect & PAGE_WRITECOMBINE) ? "Yes" : "No") << std::endl;

            }

            queue.enqueueUnmapMemObject( buffer, bufferHost );

        }

        {

            cl::Buffer buffer = cl::Buffer( ctx, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, bufferSize );

            float *bufferHost = static_cast<float*>(queue.enqueueMapBuffer( buffer, CL_TRUE, CL_MAP_READ, 0, bufferSize ));

            MEMORY_BASIC_INFORMATION memInfo;

            if ( VirtualQuery( reinterpret_cast<void*>(bufferHost), &memInfo, sizeof( memInfo ) ) )

            {

                std::cout << "Host allocation as write-combined: " << ((memInfo.AllocationProtect & PAGE_WRITECOMBINE) ? "Yes" : "No") << std::endl;

                std::cout << "Host memory is write-combined: " << ((memInfo.Protect & PAGE_WRITECOMBINE) ? "Yes" : "No") << std::endl;

            }

            queue.enqueueUnmapMemObject( buffer, bufferHost );

        }

        queue.finish();

    } catch ( cl::Error &error ) {

        std::cerr << "OpenCL C++ API Exception during " << error.what() << ": " << error.err() << std::endl;

    }

    return 0;

}

I would like to argue that automatically allocating host memory associated with a CL_MEM_READ_ONLY | CL_MEM_ALLOCATE_HOST_PTR buffer as write-combined is not a good idea and indeed believe that this should be classified as a bug. You could have, e.g. one thread filling a buffer, using that buffer during a computation and have another thread reading that buffer in order to save it to a log file. When allocating as write-combined, this reading of the buffer will take a long time (up to 26 times slower on my system). Instead I would like to suggest that the runtime should only allocate the host memory as write-combined if, in addition, CL_MEM_HOST_WRITE_ONLY is specified (as is kind of suggested by the OpenCL specification).

Any comments on this observation? Thanks in advance for your replies.

Kind regards

bcaf01

PS: I would appreciate it if someone could add me to the white-list and move this topic to the appropriate developer forum!

Changed the title to give a better description of the issue.

0 Likes
2 Replies
dipak
Big Boss

You've been whitelisted now.

Regards,

0 Likes
dipak
Big Boss

Hi,

Usually, buffers created with CL_MEM_READ_ONLY | CL_MEM_ALLOCATE_HOST_PTR indicates that the programmer wants to create a pre-pinned (zero-copy) buffer to pass the data from host to kernel. Host will write the data that will be read by the kernel. Because it's read-only at kernel side, there is little sense to read the buffer once again at host side. It's one directional in general.

Also note that, CL_MEM_HOST_WRITE_ONLY was added later into the spec.

One thing that is rather curious is that when not passing CL_MEM_ALLOC_HOST_PTR but calling the map-command directly instead, the host allocation made available by the runtime is not allocated as write-combined (cf. the output generated by the program.)

Not passing CL_MEM_ALLOC_HOST_PTR makes it as regular device buffer. Mapping it in read-only mode indicates that the host only wants to read the buffer, not write. So, it's not same as allocating a pre-pinned buffer with CL_MEM_READ_ONLY | CL_MEM_ALLOCATE_HOST_PTR.

Regards,

0 Likes