cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

aj_guillon
Adept I

__constant effect on kernel bandwidth

I have a kernel that uses an array of properties like this: typedef struct { float a; float b; float c; } type_t.

Upon close reading of the OpenCL programming guide, I've realized that my __constant type_t* properties kernel argument might not lead to efficient performance.  According to the guide, if each thread accesses a different property the __constant data will simply be put into __global.  In my application, I have many data points which store an index into the properties array, and this is used for computation.  In particular, I do something like:

...computation... * properties[thread.property_id].a

My understanding is that because the access is not known at compile-time, and because all threads will take different paths, the __constant space will not be used.  Because every thread (on the order of 1 000 000) requires access to these properties (on the order of 50), it seems like my code will encounter a huge bottleneck if the constant data is simply in global memory in one palce, since access to that memory will be serialized (according to the guide).

My question is... how can I prevent this given the problem I've outlined?  If global memory is used, is the compiler intelligent enough to perhaps duplicate the constants in global memory so that there are no bank conflicts?  One solution I can think of... is to traverse all properties with a select() statement... so I can do something like

for(int i = 0; i < CONSTANT_SIZE; ++i)  { my_constant = select(i == thread.property_id, properties.a, my_constant); }  But for around 50 constants, this can be a tad wasteful.

Finally, what would you say the real effect on kernel bandwidth can be?  In my case, I compute one function as operating at a bandwidth of 50GB/s (on an HD 5970), but this problem exists with __constant.  Could accesses to __constant as __global as outlined in the guide cause a significant drop in performance (in the order or 20GB/s) ?

Thanks, I hope this question is clear.

0 Likes
1 Reply
himanshu_gautam
Grandmaster

Hi aj_guillon
Constant elements can only be allocated in constant cache(BW~2TB/s) in the direct addressed mode i.e before compilation. If the address of element needs computation they are fetched from global memory at runtime.

But constant elements are both L1 & L2 cached while global variables are just L1 cached.So if you have appropriate access pattern you can get much larger performance using __constant identifier.

Noteo not exceed the limit of allocating constant buffers.(Use clinfo sample find out for your GPU).

Can you post the outputs of Globalbandwidth sample and constant bandwidth sample. You can notice the difference there. 

0 Likes