3 Replies Latest reply on Jun 22, 2015 5:18 AM by dipak

    max search kernel. Error type "read from or write to a virtual address the kernel does not have access".


      Hi everybody,

      I know that probably the error I'm incurring in is due to buffer indexing, but since I don't see it I'm really getting frustrated. I need help a bit. Or maybe it could depend on some flag related to buffer allocation/way of use.


      Explanation: VS2010 project which executes different NDRange kernels, such multiplications/additions and FFTs/IFFTs (in this case with use of full local memory), then as a last step I have a data sequence (placed on a buffer) with a size of 85,932 integer elements. On this sequence I've got to find the max, so I've chosen to use local memory, exploiting it completely and reducing so the number of groups involved. Even if it's a classical algorithm to be executed on the CPU, I wanna try implementing it on the GPU to reduce the amount of data downloaded on the host (1 integer instead of 85,932 integer values).


      GPU features (AMD ATI Mobility Radeon HD 5650):  1) max local size = 256 WIs

                                                                                           2) local memory size = 32,768 bytes

                                                                                           3) CUs =  5


      So: points per work group = 32,768 / sizeof(int) = 8,192 values

            points per WI = 8,192 / 256 = 32


      Moreover, since a work space size of 85,932 WIs doesn't fit the wavefront number (since it's not a multiple), I thought to adjust it to 90,112 WIs, adding just zeros to the input data sequence, but this is another matter I guess. I also don't know exactly if it is necessary to execute this buffer copy for data consistency, or I can directly pass the buffer (of 85,932 integer elements) in a kernel where the global size is equal to 90,112 WIs, with the GPU handling it automatically.


      Anyway, I explain the full code:



      a) ON THE HOST:


      cl_buffer_region region_max;

      int input[90112] = {0};


      previous_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE, 85932*sizeof(int), NULL, &err);                                                                          // Device buffer, output of the previous kernel

      input_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY(USE)_HOST_PTR, sizeof(input), &input, &err);                         // Pinned buffer or Uncached if USE specified?

      output_buffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(int), NULL, &err);                                                                                         // Device buffer


      region_max.size = 85932*sizeof(int);


      clEnqueueCopyBuffer(queue, previous_buffer, input_buffer, 0, 0, region_max.size, 1, &previous_kernel, &actual_copy);


      err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &input_buffer);

      err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &output_buffer);

      err |= clSetKernelArg(kernel, 2, size_t(local_mem_size), NULL);           ---->>>>>      local_mem_size = 32,768 bytes (full)    

      err |= clSetKernelArg(kernel, 3, sizeof(points_per_work_group), &points_per_work_group);


      global_size = ... = 90112 (input data size) / 32 (points per WI) = 2,816;

      local_size = 256;  (GPU max)


      clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_size, local_size, 1, &actual_copy, &actual_kernel);



      b) ON THE GPU:


      __kernel void max_kernel(__global int *input_buffer, __global int *output_buffer, __local int *l_data, uint points_per_group)


           uint g_ID, g_size, l_ID, l_size, n_ID, i, points_per_item, l_addr, g_addr, N2;

           int x1, x2;


           g_ID = get_global_id(0);

           l_ID = get_local_id(0);

           n_ID = get_group_id(0);

           g_size = get_global_size(0);

           l_size = get_local_size(0);


           points_per_item = points_per_group / l_size;               // = 32

           N2 = points_per_item / 2;                                             // = 16

           l_addr = l_ID * N2;                                                        // so to have on the first iteration each group handling data input from 0 to 4080 buffer index, incrementing by 16

           g_addr = n_ID*points_per_group + l_ID*(N2*2);         // so to have indexing input buffer from 0 to 90080, with each WI handling 32 values, covering completely the input buffer size (90,112 elements)




           for (i=0; i<points_per_item; i+=2)


                x1 = input_buffer[g_addr];

                x2 = input_buffer[g_addr+1];


                l_data[l_addr] = max(x1,x2);



                g_addr += 2;





      ...And then I continue max search algorithm working on smaller slices, to finally compare max values of each group. But this first loop gets my kernel stucked with the error "the thread tried to read from or write to a virtual address the kernel does not have access".

      I am quite sure it's a problem of indexing related to reading from global buffer or writing to local memory, even if controlling index evolution with CodeXL, as long as the kernel doesn't get stucked, does not seem to me uncorrect. As I told before, don't know if it could be instead a problem of buffer flags used (buffer involved in copy process are different, Device buffer and Pinned(?) buffer), or a problem due to the repetitive use of full local memory (in other kernels before), or a problem related to the number/size of all the buffer used in the VS project (I read there was a limit somwhere in the specs).


      Hope I've been clear, not annoying and to receive a feedback.


      Best regards,