aj_guillon

__constant effect on kernel bandwidth

Discussion created by aj_guillon on Oct 18, 2010
Latest reply on Oct 19, 2010 by himanshu.gautam

I have a kernel that uses an array of properties like this: typedef struct { float a; float b; float c; } type_t.

Upon close reading of the OpenCL programming guide, I've realized that my __constant type_t* properties kernel argument might not lead to efficient performance.  According to the guide, if each thread accesses a different property the __constant data will simply be put into __global.  In my application, I have many data points which store an index into the properties array, and this is used for computation.  In particular, I do something like:

...computation... * properties[thread.property_id].a

My understanding is that because the access is not known at compile-time, and because all threads will take different paths, the __constant space will not be used.  Because every thread (on the order of 1 000 000) requires access to these properties (on the order of 50), it seems like my code will encounter a huge bottleneck if the constant data is simply in global memory in one palce, since access to that memory will be serialized (according to the guide).

My question is... how can I prevent this given the problem I've outlined?  If global memory is used, is the compiler intelligent enough to perhaps duplicate the constants in global memory so that there are no bank conflicts?  One solution I can think of... is to traverse all properties with a select() statement... so I can do something like

for(int i = 0; i < CONSTANT_SIZE; ++i)  { my_constant = select(i == thread.property_id, properties.a, my_constant); }  But for around 50 constants, this can be a tad wasteful.

Finally, what would you say the real effect on kernel bandwidth can be?  In my case, I compute one function as operating at a bandwidth of 50GB/s (on an HD 5970), but this problem exists with __constant.  Could accesses to __constant as __global as outlined in the guide cause a significant drop in performance (in the order or 20GB/s) ?

Thanks, I hope this question is clear.

Outcomes