cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

himanshu_gautam
Grandmaster

Re: Problem with the radix sort code written

Jump to solution

l_zeroes and l_ones are just 2 buffers. I dont understand why you are allocating 5 buffers - each 1024*sizeof(int) = 4K big.

Also, are you able to launch the kernel if you reduce your local memory footprint?

0 Likes
shreedhar_pawar
Adept II

Re: Problem with the radix sort code written

Jump to solution

I am not able to launch the kernel even if I reduce the size to just 15.. and I have just created three local buffers , rest of them are global....! 

Also in the atomic operation explained by you above I don't understand how you have dumped the local arrays in the global arrays...?  i.e. the following part of the code...

int gPos;

if (get_local_id(0) == 0) {

    int localCounter = oneCounter;

    gPos = atomic_add(globalOneCounter, localCounter);  

}

barrier(CLK_LOCAL_MEM_FENCE);

for(int tid = get_local_id(0); tid<oneCounter; tid+=get_local_size(0))

{

       g_ones[gPos + tid] = l_ones[tid];

}

barrier(CLK_LOCAL_MEM_FENCE);

I don't understand the working of  the for loop and what is globalOneCounter, you haven't declared it..., why do you just do if(get_local_id==0), and what is gPos...? Please Explain...

0 Likes
himanshu_gautam
Grandmaster

Re: Problem with the radix sort code written

Jump to solution

g_ones[] is a __global cl_mem object that you allocate in host. It has the same size of the "data[]" array.

globalOneCounter and globalZeroCounter are just 2 integers that you allocate and initialize to 0.

(initializing can be done via a separate kernel -- that will be faster. I am not sure if OpenCL runtime will initialize them to  by default if you declare them as "__global uint globalOneCounter, globalZeroCounter".)

As far as the dumping code, it first atomically adds to the global counter value and finds out which position in the g_ones[] array it should start updating. (read semantics of atomic_add)

The FOR loop is the usual GPGPU way of running a FOR loop inside a workgroup so that all workitems iterate ove an array. This FOR loop just copies the data from Local Ones array to global ones array.

0 Likes
bsp2020
Elite

Re: Problem with the radix sort code written

Jump to solution

Both CUDA and OpenCL are based on same programming model (CPU coordinate work, kernel on GPU actually do parallel work). You will benefit enormously from the class as it will teach you parallel programming concepts and algorithms. CUDA is used to implement the idea described in the lecture video. But if you are adventurous, you can implement the algorithms in OpenCL yourself. If you are not interested in learning CUDA at all, just watch the video and do the in-video quiz and you can come back to your current project after unit 4 (hint: unit 4 assignment is writing fast radix sort )

Yes. I am talking about the counting and doing calculation on CPU and GPU in each iteration. You won't get much speed up because moving data back and forth over PCIe bus will negate any performance gain you might get. Also, if you know how to do it in parallel, counting in parallel using GPU is much faster than doing it on CPU.

shreedhar_pawar
Adept II

Re: Problem with the radix sort code written

Jump to solution

Thanks buddy... will surely take the course..!

0 Likes