OpenCL

FriedrichGuenther · ‎07-09-2021

I have already posted this question to the Khronos Forums as well as Stack Overflow to no avail. For a small program I wrote, the use of image2d_t memory objects as opposed to regular buffers would be beneficial (I think I could save on logic and compute on the ALU/FPUs). For computations I read pgm grayscale images to matrices stored in row-major order and copy those to buffers.

If I understand it correctly, for image2d_t memory objects I would have to blow up those matrices to matrices with four times as many entries, specifying for OpenCL what data is put in what channel. I tried doing something like this:

std::vector<float> A(256, 0.0);
for(int i=0; i<256; i++)
{
	A[i] = 255.0f - i;
	std::cout << A[i] << "  ";
}
std::cout << std::endl;
	
// Blow up to float4 array
std::vector<float> A_img(1024, 0.0);
for(int i=0; i<256; i++)
{
	A_img[4*i] = A[i];
	A_img[(4*i)+3] = 1;
}

Maybe due to my device being little endian, the order of entries in the float4-vectors of the image2d_t would be reversed; I am not sure on that.

However, I can't seem to get this working. Attached you find a minimal working example of my efforts. The code is supposed to create a 16x16 grayscale image with a white pixel in the top left corner, a black pixel in the bottom right corner and any grayscale in between, "blow it up to a float4-vector", copy it to a device, read it, write it to another image and read back the copy.

If i run the code posted below on my Macbook, the "copied image" by the kernel contains only zeroes.

Any input on what I missed or am doing wrong would be greatly appreciated. Even a hint on literature or an open source project where I might find information on this would be really helpful, I couln't find anything.

#define CL_HPP_TARGET_OPENCL_VERSION 120
#define CL_HPP_MINIMUM_OPENCL_VERSION 120
#if defined(__APPLE__) 
#include <OpenCL/cl2.hpp>
#else 
#include <CL/cl2.hpp>
#endif
#include <iostream>
#include <string>
#include <vector>
#include <fstream>

int main(void)
{
	// Set up platform, device and context
	std::vector<cl::Platform> platforms;
	std::vector<cl::Device> devices;
	cl::Device default_device;
	cl::Platform::get(&platforms);
	
	if (platforms.size() == 0)
	{
		std::cout << "No OpenCL platform found, check installation!" << std::endl;
		exit(-1);
	}
	platforms[0].getDevices(CL_DEVICE_TYPE_ALL, &devices);
	
	if (devices.size() == 0)
	{
		std::cout << "No devices found in platform, check installation!" << std::endl;
		exit(-1);
	}
	default_device = devices[0];
	cl::Context context(default_device);
	
	std::ifstream program_file("read_write_image.cl");
	std::string program_string(std::istreambuf_iterator<char>(program_file), (std::istreambuf_iterator<char>()));
	cl::Program::Sources source { program_string };
	cl::Program dummy_program(context, source);
    if (dummy_program.build()!=CL_SUCCESS)
	{
        std::cout << "Error building: " << dummy_program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(default_device)<< std::endl;
        exit(-1);
    }
	cl::Kernel kernel(dummy_program, "read_write_image");
	cl::CommandQueue queue(context, default_device);
	
	// Set up dummy grayscale image
	std::vector<float> A(256, 0.0);
	for(int i=0; i<256; i++)
	{
		A[i] = 255.0f - i;
		std::cout << A[i] << "  ";
	}
	std::cout << std::endl;
	
	// Blow up to float4 array
	std::vector<float> A_img(1024, 0.0);
	for(int i=0; i<256; i++)
	{
		A_img[4*i] = A[i];
		A_img[(4*i)+3] = 1;
	}
	std::vector<float> B_img(1024, 0.0);
	
	cl::ImageFormat grayscale(CL_R, CL_FLOAT);
	cl::Image2D Input_Image(context, CL_MEM_READ_ONLY, grayscale, 16, 16);
	cl::Image2D Output_Image(context, CL_MEM_WRITE_ONLY, grayscale, 16, 16);
	
	std::array<cl::size_type, 3> origin {0,0,0};
	std::array<cl::size_type, 3> region {16, 16, 1};

	queue.enqueueWriteImage(Input_Image, CL_TRUE, origin, region, 0, 0, &A_img[0]);

	kernel.setArg(0, Input_Image);
	kernel.setArg(1, Output_Image);
	
	queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(16,16), cl::NullRange, NULL); 
	queue.enqueueReadImage(Output_Image, CL_TRUE, origin, region, 0, 0, &B_img[0]);

	for(int i=0; i<1024; i++)
	{
		std::cout << B_img[i] << "  ";
	}	
	std::cout << std::endl;
	
	return EXIT_SUCCESS;
}

Now the kernel code:

__constant sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE|CLK_ADDRESS_CLAMP|CLK_FILTER_LINEAR;

__kernel void read_write_image(read_only image2d_t input_image, write_only image2d_t output_image)
{
	int i = get_global_id(0);
	int j = get_global_id(1);
	float tmp = read_imagef(input_image, sampler, (int2) (i,j)).x;
	write_imagef(output_image, (int2) (i,j), (float4) (tmp,0,0,1));
}

dipak · ‎07-14-2021

I think, the "Blow up to float4 array" code section is not required for this image format: "cl::ImageFormat grayscale(CL_R, CL_FLOAT)". In this case, each image element is just a float value (i.e. single channel of "float" type or 1 x 4bytes). In enqueueWriteImage call, expected buffer size is (image width * image height * sizeof(float)). So, you can directly pass the "A" buffer to the enqueueWriteImage API. Similarly, "B_img" buffer needs to be modified accordingly while reading the output image.

Another point to note. If you expect that output_image should be same as input_image, please modify the kernel code as below:

Option1: Use "filter mode" as CLK_FILTER_NEAREST

OR

Option2: For "filter mode" as CLK_FILTER_LINEAR, call read_imagef with "coord " value as "(float2) (i + 0.5f, j + 0.5f)"

Thanks.

View solution in original post

dipak · ‎07-14-2021

I think, the "Blow up to float4 array" code section is not required for this image format: "cl::ImageFormat grayscale(CL_R, CL_FLOAT)". In this case, each image element is just a float value (i.e. single channel of "float" type or 1 x 4bytes). In enqueueWriteImage call, expected buffer size is (image width * image height * sizeof(float)). So, you can directly pass the "A" buffer to the enqueueWriteImage API. Similarly, "B_img" buffer needs to be modified accordingly while reading the output image.

Another point to note. If you expect that output_image should be same as input_image, please modify the kernel code as below:

Option1: Use "filter mode" as CLK_FILTER_NEAREST

OR

Option2: For "filter mode" as CLK_FILTER_LINEAR, call read_imagef with "coord " value as "(float2) (i + 0.5f, j + 0.5f)"

Thanks.

FriedrichGuenther · ‎07-14-2021

@dipak Thank you very much for the input. I have tried switching A_img and B_img for A and B respectively, but with image format CL_R I always get segmentation faults on my AMD GPU and no program output at all on my Intel GPU (macOS Big Sur).

Switching the Image format to CL_RGBA and only writing to the first component showed me a few things:

My device being little endian does not reverse the order of the components of my vectors
The minimal working example is actually a working example now
Reading integer coordinates with CL_FILTER_LINEAR does not work at all (which is fine, I need to write floating point coordinates anyways)

I used the blown up versions A_img and B_img in the first place, because I understood Table 6.2 from Chapter 6 of "OpenCL in Action" that image2d_t memory objects are basically float4-vectors, which probably isn't the case, but the read_imagef functions seem to be supposed to return float4 vectors in any case.

Due to portability and compatibility, I will probably switch to the CL_RGBA channel format. Here is the updated minimal working example:

#define CL_HPP_TARGET_OPENCL_VERSION 120
#define CL_HPP_MINIMUM_OPENCL_VERSION 120
#if defined(__APPLE__) 
#include <OpenCL/cl2.hpp>
#else 
#include <CL/cl2.hpp>
#endif
#include <iostream>
#include <string>
#include <vector>
#include <fstream>

int main(void)
{
	// Set up platform, device and context
	std::vector<cl::Platform> platforms;
	std::vector<cl::Device> devices;
	cl::Device default_device;
	cl::Platform::get(&platforms);
	
	if (platforms.size() == 0)
	{
		std::cout << "No OpenCL platform found, check installation!" << std::endl;
		exit(-1);
	}
	platforms[0].getDevices(CL_DEVICE_TYPE_ALL, &devices);
	
	if (devices.size() == 0)
	{
		std::cout << "No devices found in platform, check installation!" << std::endl;
		exit(-1);
	}
	default_device = devices[0];
	cl::Context context(default_device);
	
	std::ifstream program_file("read_write_image.cl");
	std::string program_string(std::istreambuf_iterator<char>(program_file), (std::istreambuf_iterator<char>()));
	cl::Program::Sources source { program_string };
	cl::Program dummy_program(context, source);
    if (dummy_program.build()!=CL_SUCCESS)
	{
        std::cout << "Error building: " << dummy_program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(default_device)<< std::endl;
        exit(-1);
    }
	cl::Kernel kernel(dummy_program, "read_write_image");
	cl::CommandQueue queue(context, default_device);
	
	// Set up dummy grayscale image
	std::vector<float> A(256, 0.0);
	for(int i=0; i<256; i++)
	{
		A[i] = 255.0f - i;
	}
	
	// Blow up to float4 array
	std::vector<float> A_img(1024, 0.0);
	for(int i=0; i<256; i++)
	{
		A_img[4*i] = A[i];
	}
	std::vector<float> B_img(1024, 0.0);
	
	cl::ImageFormat grayscale(CL_RGBA, CL_FLOAT);
	cl::Image2D Input_Image(context, CL_MEM_READ_ONLY, grayscale, 16, 16);
	cl::Image2D Output_Image(context, CL_MEM_WRITE_ONLY, grayscale, 16, 16);
	
	std::array<cl::size_type, 3> origin {0,0,0};
	std::array<cl::size_type, 3> region {16, 16, 1};

	queue.enqueueWriteImage(Input_Image, CL_TRUE, origin, region, 0, 0, &A_img[0]);

	kernel.setArg(0, Input_Image);
	kernel.setArg(1, Output_Image);
	
	queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(16,16), cl::NullRange, NULL); 
	queue.enqueueReadImage(Output_Image, CL_TRUE, origin, region, 0, 0, &B_img[0]);

	for(int i=0; i<1024; i++)
	{
		std::cout << B_img[i] << "  ";
	}	
	std::cout << std::endl;
	
	return EXIT_SUCCESS;
}

together with the kernel

__constant sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE|CLK_ADDRESS_CLAMP|CLK_FILTER_NEAREST;

__kernel void read_write_image(read_only image2d_t input_image, write_only image2d_t output_image)
{
	int i = get_global_id(0);
	int j = get_global_id(1);
	float tmp = read_imagef(input_image, sampler, (int2) (i,j)).x;
	// printf("%5.2f  ", tmp);
	write_imagef(output_image, (int2) (i,j), (float4) (tmp,0,0,1));
}

dipak · ‎07-14-2021

Hi @FriedrichGuenther ,

 I have tried switching A_img and B_img for A and B respectively, but with image format CL_R I always get segmentation faults

Could you please try the attached host code to see if it works for image format CL_R?

Thanks.

FriedrichGuenther · ‎07-14-2021

@dipak Hey! Thank you very much for the enthusiasm, I appreciate it a lot! Does the code work for you? Running it on my Macbook yields the attached output. If possible, I want to run the code on both my Linux machine and my Macbook, because I will be using the Macbook for the project presentation. (I compiled your code to the executable "MWE").

Screenshot 2021-07-15 at 00.09.21.png

dipak · ‎07-15-2021

Just for a quick test, I ran the code on a Windows laptop with vega device and it worked fine.

Thanks.

FriedrichGuenther · ‎07-15-2021

@dipakI am very sorry, on Linux using my AMD GPU i get the following output:

No idea what the issue is 😞

dipak · ‎07-19-2021

Thanks for trying the code. I'm also not sure about the issue because it's working fine on my setup.

If you want, you may put some debugging code to identify the problematic code-section or api. The printf statements inside the kernel can be helpful to check the color values.

Thanks.

FriedrichGuenther · ‎07-20-2021

@dipakHi! Attached you find my modified program with control prints and the produced output:

#define CL_HPP_TARGET_OPENCL_VERSION 120
#define CL_HPP_MINIMUM_OPENCL_VERSION 120
#if defined(__APPLE__) 
#include <OpenCL/cl2.hpp>
#else 
#include <CL/cl2.hpp>
#endif
#include <iostream>
#include <string>
#include <vector>
#include <fstream>


int main(void)
{
	// Set up platform, device and context
	std::vector<cl::Platform> platforms;
	std::vector<cl::Device> devices;
	cl::Device default_device;
	cl::Platform::get(&platforms);

	if (platforms.size() == 0)
	{
		std::cout << "No OpenCL platform found, check installation!" << std::endl;
		exit(-1);
	}
	platforms[0].getDevices(CL_DEVICE_TYPE_ALL, &devices);

	if (devices.size() == 0)
	{
		std::cout << "No devices found in platform, check installation!" << std::endl;
		exit(-1);
	}
	default_device = devices[0];
	cl::Context context(default_device);

	std::ifstream program_file("read_write_image.cl");
	std::string program_string(std::istreambuf_iterator<char>(program_file), (std::istreambuf_iterator<char>()));
	cl::Program::Sources source{ program_string };
	cl::Program dummy_program(context, source);
	if (dummy_program.build() != CL_SUCCESS)
	{
		std::cout << "Error building: " << dummy_program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(default_device) << std::endl;
		exit(-1);
	}
	cl::Kernel kernel(dummy_program, "read_write_image");
	cl::CommandQueue queue(context, default_device);

	// Set up dummy grayscale image
	std::vector<float> A(256, 0.0);
	for (int i = 0; i < 256; i++)
	{
		A[i] = 255.0f - i;
		// std::cout << A[i] << "  ";
	}
	// std::cout << std::endl;

	// the below section is not required
	// Blow up to float4 array
	/*
	std::vector<float> A_img(256, 0.0);  //A_img(1024, 0.0)
	for (int i = 0; i < 256; i++)
	{
		//A_img[4 * i] = A[i];
		//A_img[(4 * i) + 3] = 1;
	}
	*/

	std::vector<float> B_img(256, 0.0); //B_img(1024, 0.0)
	
	std::cout << "Allocated memory for B_img" << std::endl;
	
	cl::ImageFormat grayscale(CL_R, CL_FLOAT);
	cl::Image2D Input_Image(context, CL_MEM_READ_ONLY, grayscale, 16, 16);
	cl::Image2D Output_Image(context, CL_MEM_WRITE_ONLY, grayscale, 16, 16);

	std::cout << "Set up image format and image2d_t memory objects" << std::endl;

	std::array<size_t, 3> origin = { 0, 0, 0 };
	std::array<size_t, 3> region = { 16, 16, 1 };

	std::cout << "Starting image transfer" << std::endl;
	queue.enqueueWriteImage(Input_Image, CL_TRUE, origin, region, 0, 0, &A[0]);
	std::cout << "Image transfered to GPU" << std:: endl;

	kernel.setArg(0, Input_Image);
	kernel.setArg(1, Output_Image);
	
	std::cout << "Kernel arguments set" << std::endl;

	queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(16, 16), cl::NullRange, NULL);
	std::cout << "Kernel enqueued" << std::endl;
	queue.enqueueReadImage(Output_Image, CL_TRUE, origin, region, 0, 0, &B_img[0]);

	bool matched = true;
	for (int i = 0; i < 256; i++)
	{
		if (A[i] != B_img[i]) {
			std::cout << "i = " << i << " A: " << A[i] << " B_img: " << B_img[i] << std::endl;
			matched = false;
			break;
		}
	}

	if(matched) 
	{
		std::cout << "Image matched";
	}
	else 
	{
		std::cout << "Image mismatch";
	}

	std::cout << std::endl;

	return EXIT_SUCCESS;
}

... and the output (on my linux machine running the RX 6900XT):

Allocated memory for B_img
Set up image format and image2d_t memory objects
Starting image transfer
Image transfered to GPU
Kernel arguments set
Segmentation fault (core dumped)

The print-statement in the kernel was uncommented btw. The program segfaults immediately. If there is anything else I can do, let me know. If I should try this whole thing again without POCL installed, I can set up a fresh install of Ubuntu 20.04 LTS.

FriedrichGuenther · ‎07-20-2021

I have just finished setting up a fresh Ubuntu with AMD-drivers, OpenCL headers, C++ wrapper for OpenCL and clinfo. Now the minimal working example also works for me.

dipak · ‎07-21-2021

It's good to hear that the code is working fine now.

Thanks.

OpenCL

OpenCL question: How to write an array to image2d_t?