Archives Discussions

shunyo · ‎05-17-2013

I adapted the parallel reduction example for openCL for a bunch of floats. Now I wanted to expand the code to include cl_float3. So I want to find the minimum among a array of cl_float3. I thought it was a straight forward expansion from float to float3 in kernel. But I am receiving garbage values when i return from the kernel. Below is the kernel:

__kernel void pmin3(__global float3 *src,

__global float3 *gmin,

__local float3 *lmin,

__global float *dbg,

uint nitems,

uint dev)

{

uint count = nitems / get_global_size(0);

uint idx = (dev == 0) ? get_global_id(0) * count

: get_global_id(0);

uint stride = (dev == 0) ? 1 : get_global_size(0);

// Private min for the work-item

float3 pmin = (float3)(pow(2.0,32.0)-1,pow(2.0,32.0)-1,pow(2.0,32.0)-1);

for (int n = 0; n < count; n++, idx += stride) {

pmin.x = min(pmin.x,src[idx].x);

pmin.y = min(pmin.y,src[idx].y);

pmin.z = min(pmin.z,src[idx].z);

}

// Reduce values within the work-group into local memory

barrier(CLK_LOCAL_MEM_FENCE);

if (get_local_id(0) == 0)

lmin[0] = (float3)(pow(2.0,32.0)-1,pow(2.0,32.0)-1,pow(2.0,32.0)-1);

for (int n = 0; n < get_local_size(0); n++) {

barrier(CLK_LOCAL_MEM_FENCE);

if (get_local_id(0) == n) {

lmin[0].x = min(lmin[0].x,pmin.x);

lmin[0].y = min(lmin[0].y,pmin.y);

lmin[0].z = min(lmin[0].z,pmin.z);

}

barrier(CLK_LOCAL_MEM_FENCE);

// Write to __global gmin which will contain the work-group minima

if (get_local_id(0) == 0)

gmin[get_group_id(0)] = lmin[0];

// Collect debug information

if (get_global_id(0) == 0) {

dbg[0] = get_num_groups(0);

dbg[1] = get_global_size(0);

dbg[2] = count;

dbg[3] = stride;

}

__kernel void min_reduce3( __global float3 *gmin)

{

for (int n = 0; n < get_global_size(0); n++) {

barrier(CLK_GLOBAL_MEM_FENCE);

if (get_global_id(0) == n) {

gmin[0].x = min(gmin[0].x,gmin.x);

gmin[0].y = min(gmin[0].y,gmin.y);

gmin[0].z = min(gmin[0].z,gmin.z);

}

barrier(CLK_GLOBAL_MEM_FENCE);

}

I think it is the problem with get_global_id(0) and get_global_size() which gives the entire size instead of the only the number of rows to be given. Any suggestions?

himanshu_gautam · ‎05-20-2013

That is a very bad way to write a reduce operation. You are effectively blocking all threads and running just one thread at a time. And from correctness point CLK_GLOBAL_MEM_FENCE cannot provide global synchronization. So errors are expected.

You should learn from the Reduction SDK Sample from APP SDK, on parallelizing reduction type operations.

You need to do something like (assuming 1024 total threads, and 256 workgroup size). Stage 1, Each thread reads 2 elements from global buffer, and selects the minimum among it. Stage 2, only half the threads work (only 128 threads per workgroup), and read the two selected elements from Stage 1, and find minimum among it. Stage 3, only 64 threads per workgroup should work, and so on....

View solution in original post

himanshu_gautam · ‎05-20-2013

That is a very bad way to write a reduce operation. You are effectively blocking all threads and running just one thread at a time. And from correctness point CLK_GLOBAL_MEM_FENCE cannot provide global synchronization. So errors are expected.

You should learn from the Reduction SDK Sample from APP SDK, on parallelizing reduction type operations.

You need to do something like (assuming 1024 total threads, and 256 workgroup size). Stage 1, Each thread reads 2 elements from global buffer, and selects the minimum among it. Stage 2, only half the threads work (only 128 threads per workgroup), and read the two selected elements from Stage 1, and find minimum among it. Stage 3, only 64 threads per workgroup should work, and so on....

shunyo · ‎05-20-2013

Hey thanks for the reply, I will look through the example from APP SDK. Thanks for the pointer.

Arpan

Archives Discussions

Using cl_float3 in parallel reduction example