AnsweredAssumed Answered

Question regarding : memory management,  NDrange() and optimization

Question asked by Anon5710 on Mar 1, 2012
Latest reply on Mar 1, 2012 by notzed

Hello,

 

I am currently writing my thesis about genome assembly, currently i am working on parallel approach based upon the overlap principle.

(each string is compared with  all other strings to find the best overlap )

 

1.jpeg 2.jpeg

 

The current kernel i use has one input buffer where all reads are serialized, by use of the get_global_id(0) function, creative index usage and some for loops, each instance of the kernel (thread?),   compares his read get_global_id(0) read with all the other reads and records the overlap information in the output buffer according the structure shown above.

 

Here's my kernel (or see compare2.cl as attachment) :

#pragma OPENCL EXTENSION cl_amd_printf : enable
#define MAX_READ_LENGTH 750
__kernel void compare_string( __global char* input, __global int* output, const unsigned int count, const unsigned int N )
{
    //global_id -> get index value of start string1 
    int x = get_global_id(0); // 0 -> count 
    int y,k,j;

    for(y=0;y<count;y++) // y  : get index value of string 2
    {
         for(j=1;j<MAX_READ_LENGTH-1;j++)
                {
            if(input[x*(MAX_READ_LENGTH-1)+j] == input[y*(MAX_READ_LENGTH-1)])
                    {
                for(k=1;k<MAX_READ_LENGTH-1-j;k++)
                {
                    if(input[x*(MAX_READ_LENGTH-1)+j+k] != input[y*(MAX_READ_LENGTH-1)+k])
                    {
                        break;
                    }
                          }
                if( k+j == MAX_READ_LENGTH-1 )
                        {
                    output[x * count *3+ y *3+0] = j;
                    output[x * count *3+ y *3+1] = 0;
                    output[x * count *3+ y *3+2] = k;    
                        } 
            }            
        }

    }
}

 

 

Now with this in mind i have some open-cl questions :-)

 

Question 1 : NDRange (from 1D to 2D ?) (easy ?)

Currently i initialize and execute the open cl kernel like this.

 

//setting up input & output OpenCL buffers
    cl::Buffer inputBuffer ( context , CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR , sizeof(cl_char)*MAX_READ_LENGTH*count, input_data);
    cl::Buffer resultBuffer( context , CL_MEM_READ_WRITE , sizeof (cl_int )*count*count*3 );    //3 integers per compare

    //starting kernel 
    cl::Kernel kernel(program , "compare_string" ) ;
    cl::KernelFunctor kernel_func = kernel.bind( cmdqueue ,cl::NDRange((int)((count/256)+1)*256), cl::NDRange(256));

    //executing kernel
    kernel_func(inputBuffer,resultBuffer,count,count );

    //data retrieval ?
    cmdqueue.enqueueReadBuffer( resultBuffer , true , 0 , sizeof(cl_int)*count*count*3, output_data);


My NDRange goes from 0 to N, with N the number of reads there are. Now in the kernel code @line 9 i have a for loop that goes from 0 to N as-well.

Isn't it possible to remove the for loop on line 9 and replace it with get_global_id(1) ?  I've tried to this but   simply changing line 7 does not work.

 

cl::KernelFunctor kernel_func = kernel.bind( cmdqueue ,cl::NDRange((int)((count/256)+1)*256,(int)((count/256)+1)*256), cl::NDRange(256,256));

 

Wich results in this error : OpenCL error: clEnqueueNDRangeKernel(-54) ,not really helpfull.

Anyone a idea how to do this right ?

 

Question 2 : memory management (concurrent access)

 

When question 1 is resolved i am going to have some other problems (after a few adjustments), namely several threads could try to access the same global variable at the same time.

 

My understanding of the barrier function is as follows :

some code
some code
barrier(GLOBAL_MEM)
if( condition )
     write to global memory 

 

each thread will execute line 1 & 2, wait for all other running threads on line 3, and than each thread will excute one ather the other (serial ? ) ensuring correct acces to the global memory .

 

Is this understanding correct ?

 

Question 3 : general optimization

 

I am now comparing characters with each other, would this go faster if i was using integers ?  (i do have heard about vectors, but any supplemental information is welcome )

 

I calculate my indexes quite frequently, i assume allocating a private memory space, and storing the static part would result in less calculation ?

(or is this already resolved when compiling ?

 

 

 

Regards, Kevin

 

 

 


Attachments

Outcomes