AnsweredAssumed Answered

Question regarding : memory management,  NDrange() and optimization

Question asked by Anon5710 on Mar 1, 2012
Latest reply on Mar 1, 2012 by notzed



I am currently writing my thesis about genome assembly, currently i am working on parallel approach based upon the overlap principle.

(each string is compared with  all other strings to find the best overlap )


1.jpeg 2.jpeg


The current kernel i use has one input buffer where all reads are serialized, by use of the get_global_id(0) function, creative index usage and some for loops, each instance of the kernel (thread?),   compares his read get_global_id(0) read with all the other reads and records the overlap information in the output buffer according the structure shown above.


Here's my kernel (or see as attachment) :

#pragma OPENCL EXTENSION cl_amd_printf : enable
#define MAX_READ_LENGTH 750
__kernel void compare_string( __global char* input, __global int* output, const unsigned int count, const unsigned int N )
    //global_id -> get index value of start string1 
    int x = get_global_id(0); // 0 -> count 
    int y,k,j;

    for(y=0;y<count;y++) // y  : get index value of string 2
            if(input[x*(MAX_READ_LENGTH-1)+j] == input[y*(MAX_READ_LENGTH-1)])
                    if(input[x*(MAX_READ_LENGTH-1)+j+k] != input[y*(MAX_READ_LENGTH-1)+k])
                if( k+j == MAX_READ_LENGTH-1 )
                    output[x * count *3+ y *3+0] = j;
                    output[x * count *3+ y *3+1] = 0;
                    output[x * count *3+ y *3+2] = k;    




Now with this in mind i have some open-cl questions :-)


Question 1 : NDRange (from 1D to 2D ?) (easy ?)

Currently i initialize and execute the open cl kernel like this.


//setting up input & output OpenCL buffers
    cl::Buffer inputBuffer ( context , CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR , sizeof(cl_char)*MAX_READ_LENGTH*count, input_data);
    cl::Buffer resultBuffer( context , CL_MEM_READ_WRITE , sizeof (cl_int )*count*count*3 );    //3 integers per compare

    //starting kernel 
    cl::Kernel kernel(program , "compare_string" ) ;
    cl::KernelFunctor kernel_func = kernel.bind( cmdqueue ,cl::NDRange((int)((count/256)+1)*256), cl::NDRange(256));

    //executing kernel
    kernel_func(inputBuffer,resultBuffer,count,count );

    //data retrieval ?
    cmdqueue.enqueueReadBuffer( resultBuffer , true , 0 , sizeof(cl_int)*count*count*3, output_data);

My NDRange goes from 0 to N, with N the number of reads there are. Now in the kernel code @line 9 i have a for loop that goes from 0 to N as-well.

Isn't it possible to remove the for loop on line 9 and replace it with get_global_id(1) ?  I've tried to this but   simply changing line 7 does not work.


cl::KernelFunctor kernel_func = kernel.bind( cmdqueue ,cl::NDRange((int)((count/256)+1)*256,(int)((count/256)+1)*256), cl::NDRange(256,256));


Wich results in this error : OpenCL error: clEnqueueNDRangeKernel(-54) ,not really helpfull.

Anyone a idea how to do this right ?


Question 2 : memory management (concurrent access)


When question 1 is resolved i am going to have some other problems (after a few adjustments), namely several threads could try to access the same global variable at the same time.


My understanding of the barrier function is as follows :

some code
some code
if( condition )
     write to global memory 


each thread will execute line 1 & 2, wait for all other running threads on line 3, and than each thread will excute one ather the other (serial ? ) ensuring correct acces to the global memory .


Is this understanding correct ?


Question 3 : general optimization


I am now comparing characters with each other, would this go faster if i was using integers ?  (i do have heard about vectors, but any supplemental information is welcome )


I calculate my indexes quite frequently, i assume allocating a private memory space, and storing the static part would result in less calculation ?

(or is this already resolved when compiling ?




Regards, Kevin