Archives Discussions

binghy · ‎05-06-2015

Hi everybody,

I know that probably the error I'm incurring in is due to buffer indexing, but since I don't see it I'm really getting frustrated. I need help a bit. Or maybe it could depend on some flag related to buffer allocation/way of use.

Explanation: VS2010 project which executes different NDRange kernels, such multiplications/additions and FFTs/IFFTs (in this case with use of full local memory), then as a last step I have a data sequence (placed on a buffer) with a size of 85,932 integer elements. On this sequence I've got to find the max, so I've chosen to use local memory, exploiting it completely and reducing so the number of groups involved. Even if it's a classical algorithm to be executed on the CPU, I wanna try implementing it on the GPU to reduce the amount of data downloaded on the host (1 integer instead of 85,932 integer values).

GPU features (AMD ATI Mobility Radeon HD 5650): 1) max local size = 256 WIs

2) local memory size = 32,768 bytes

3) CUs = 5

So: points per work group = 32,768 / sizeof(int) = 8,192 values

points per WI = 8,192 / 256 = 32

Moreover, since a work space size of 85,932 WIs doesn't fit the wavefront number (since it's not a multiple), I thought to adjust it to 90,112 WIs, adding just zeros to the input data sequence, but this is another matter I guess. I also don't know exactly if it is necessary to execute this buffer copy for data consistency, or I can directly pass the buffer (of 85,932 integer elements) in a kernel where the global size is equal to 90,112 WIs, with the GPU handling it automatically.

Anyway, I explain the full code:

a) ON THE HOST:

cl_buffer_region region_max;

int input[90112] = {0};

[...]

previous_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE, 85932*sizeof(int), NULL, &err); // Device buffer, output of the previous kernel

input_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY(USE)_HOST_PTR, sizeof(input), &input, &err); // Pinned buffer or Uncached if USE specified?

output_buffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(int), NULL, &err); // Device buffer

region_max.size = 85932*sizeof(int);

clEnqueueCopyBuffer(queue, previous_buffer, input_buffer, 0, 0, region_max.size, 1, &previous_kernel, &actual_copy);

err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &input_buffer);

err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &output_buffer);

err |= clSetKernelArg(kernel, 2, size_t(local_mem_size), NULL); ---->>>>> local_mem_size = 32,768 bytes (full)

err |= clSetKernelArg(kernel, 3, sizeof(points_per_work_group), &points_per_work_group);

global_size = ... = 90112 (input data size) / 32 (points per WI) = 2,816;

local_size = 256; (GPU max)

clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_size, local_size, 1, &actual_copy, &actual_kernel);

b) ON THE GPU:

__kernel void max_kernel(__global int *input_buffer, __global int *output_buffer, __local int *l_data, uint points_per_group)

{

uint g_ID, g_size, l_ID, l_size, n_ID, i, points_per_item, l_addr, g_addr, N2;

int x1, x2;

g_ID = get_global_id(0);

l_ID = get_local_id(0);

n_ID = get_group_id(0);

g_size = get_global_size(0);

l_size = get_local_size(0);

points_per_item = points_per_group / l_size; // = 32

N2 = points_per_item / 2; // = 16

l_addr = l_ID * N2; // so to have on the first iteration each group handling data input from 0 to 4080 buffer index, incrementing by 16

g_addr = n_ID*points_per_group + l_ID*(N2*2); // so to have indexing input buffer from 0 to 90080, with each WI handling 32 values, covering completely the input buffer size (90,112 elements)

// 1) PERFORMING MAX SEARCH ON ALL THE POINTS HANDLED BY EACH WI (32 points/WI)

for (i=0; i<points_per_item; i+=2)

{

x1 = input_buffer[g_addr];

x2 = input_buffer[g_addr+1];

l_data[l_addr] = max(x1,x2);

l_addr++;

g_addr += 2;

}

barrier(CLK_LOCAL_MEM_FENCE);

[...]

...And then I continue max search algorithm working on smaller slices, to finally compare max values of each group. But this first loop gets my kernel stucked with the error "the thread tried to read from or write to a virtual address the kernel does not have access".

I am quite sure it's a problem of indexing related to reading from global buffer or writing to local memory, even if controlling index evolution with CodeXL, as long as the kernel doesn't get stucked, does not seem to me uncorrect. As I told before, don't know if it could be instead a problem of buffer flags used (buffer involved in copy process are different, Device buffer and Pinned(?) buffer), or a problem due to the repetitive use of full local memory (in other kernels before), or a problem related to the number/size of all the buffer used in the VS project (I read there was a limit somwhere in the specs).

Hope I've been clear, not annoying and to receive a feedback.

Best regards,

Marco

dipak · ‎06-10-2015

My apologies for this delayed reply.

Have you found the problem or does it still exist? If it still exists, it would be helpful if you can provide the complete reproducible code-base such that we can check it at our end.

Regards,

binghy · ‎06-19-2015

Hi dipak,

thanks for replying. No matter, I worked on it and other stuffs, I solved it at the end. Just an error on the buffer allocation on the host to read computed data. -_-'

Maybe this post can be deleted to clean a bit.

Now I've got an other problem, but I guess it is better to open a new topic.

Regards,

Marco

dipak · ‎06-22-2015

Good to hear that you solved the problem.

Yes, its better to open a new thread for new issue rather than continuing the existing one. We always encourage the same.

Regards,

Archives Discussions

max search kernel. Error type "read from or write to a virtual address the kernel does not have access".