cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Fr4nz
Journeyman III

Bit count in OpenCL

Does it exist in OpenCL a fast way (a function or an extension to the standard) to count "1" bits present in a word (for ie, an unsigned int), without having to use loops, shifts and masks or fast techniques like precomputed lookup tables (which would require using a portion of constant memory...)?

If I'm not wrong, ATI videocards have an instruction that does this...I'm guessing if this is usable in OpenCL, given that there's noting written about this topic in the OpenCL specs.

0 Likes
9 Replies
Fr4nz
Journeyman III

Up!

0 Likes

Fr4nz,

There is no functionality currently to do this.

0 Likes

While the native bitcount isn't available for OpenCL (yet?), you can possibly be inspired by this page:

http://gurmeetsingh.wordpress.com/2008/08/05/fast-bit-counting-routines/

It lists quite a few methods, a few of them without loops and without lookup tables yet are still very fast.

There are even benchmarks but as they are written in C and measured on a CPU, you probably shouldn't rely on it.

0 Likes

Originally posted by: galmok While the native bitcount isn't available for OpenCL (yet?), you can possibly be inspired by this page:

 

http://gurmeetsingh.wordpress.com/2008/08/05/fast-bit-counting-routines/

 

It lists quite a few methods, a few of them without loops and without lookup tables yet are still very fast.

 

There are even benchmarks but as they are written in C and measured on a CPU, you probably shouldn't rely on it.



Thank you galmok. I've already considered the methods reported in that page, but there are problems when using these with vectorized variables:

- precomputed tables: can be vectorized but are slightly slower that naive-vectorized for loop. Dunno why...;

- dense/sparse methods: this method, when using vectorized variables, requires the use of a condition like "any(var != 0))" in the for loop, which uses the function any: the use of this function has a cost, and this method turns out to be slower than the naive-vectorized loop approach;

- MIT Hakmem: this method, in theory, should be perfect because it allows the use of vectorized variables and it makes few arithmetical operations (only 10 ops per 32-bit word, instead of 64 ops). In reality, I've found that it doesn't always give better result than the naive approach, and this is quite a mistery for me. Maybe it uses too much registers? Or we have some inefficiences at a driver level? I've reported a kernel below that uses this approach, maybe Omkaranathan can shed some light about this issue...

__kernel void secondpass(__global uint4 *inputVD, __global uint *inputPrefixCk, __global uint *inputPosAllKeyCandidates, __local uint *tmpCandItems, __global uint4 *outputKeyIntersections, __global uint *outputCounts, const int numCandidates, const int dimRow4, const int k, const int soglia, const int start, const int end) { const int gid = get_global_id(0); const int lid = get_local_id(0); const int lsize = get_local_size(0); // Private vectorized variables uint4 tmpIntersect; // Used when computing temporary AND-intersections; uint4 tmpCOUNTS = 0; // Used when computing the amount of 1 bits in a word; uint4 init; // Dummy variable used for "for lops" uint4 tmp; // Dummy variable used during bit counting; init.x = (gid < numCandidates) ? soglia : k; for( ; init.x<k; init.x++) tmpCandItems[lid + init.x*lsize] = inputPrefixCk[gid*k + init.x]; for(init.x=start; init.x<end; init.x++) { if(gid < numCandidates) { tmpIntersect = outputKeyIntersections[inputPosAllKeyCandidates[gid]*(end-start) + (init.x-start)]; for(init.y=soglia; init.y<k; init.y++) tmpIntersect &= inputVD[tmpCandItems[lid + init.y*lsize]*dimRow4 + init.x]; // BIT COUNT: counts bit in tmpIntersect vectorized variable (MIT Hakmem method) tmp = tmpIntersect - ((tmpIntersect >> (uint4)1) & (uint4)033333333333) - ((tmpIntersect >> (uint4)2) & (uint4)011111111111); tmpCOUNTS += ((tmp + (tmp >> (uint4)3)) & (uint4)030707070707) % (uint4)63; } } if(gid < numCandidates) outputCounts[gid] += (tmpCOUNTS.x + tmpCOUNTS.y + tmpCOUNTS.z + tmpCOUNTS.w); }

0 Likes

Hmm the problem could be the limited integer size (24 bits on 5870) and that only 1 multiply or addition is possible per core per clock (again, 5870). The special core can do a multiplication or a bit shift (if I read it right... I would like to know if this is true).

So, you want to avoid divisions (modulos) and bit shift in order to keep the code on the stream cores.

A few other bitcount algorithms here:

http://www-graphics.stanford.edu/~seander/bithacks.html

 

0 Likes

Originally posted by: galmok Hmm the problem could be the limited integer size (24 bits on 5870) and that only 1 multiply or addition is possible per core per clock (again, 5870). The special core can do a multiplication or a bit shift (if I read it right... I would like to know if this is true).


Where did you get these infos (especially the 24-bit thing)? I thought 5xxx could manage 32-bit integer words...

 

A few other bitcount algorithms here:

http://www-graphics.stanford.edu/~seander/bithacks.html



Algos contained in this link are mostly the same reported above, so nothing useful for our GPU 😕

Anyway it would be reaaally nice if OpenCL let us use directly the bit count instruction available in our videocards...

0 Likes

I read it here:

http://www.tomshardware.com/reviews/radeon-hd-5870,2422-4.html

 

0 Likes

galmok,
All integer operations(except for special 24bit operations) are 32bits. The 24bit operations are faster and can execute on all pipes whereas some 32bit integer operations can only operate on the special function pipe. The review you read is referring the 'fast' 24bit operations only, not that the GPU only does 24bit integer math.
0 Likes

Originally posted by: MicahVillmow galmok, All integer operations(except for special 24bit operations) are 32bits. The 24bit operations are faster and can execute on all pipes whereas some 32bit integer operations can only operate on the special function pipe. The review you read is referring the 'fast' 24bit operations only, not that the GPU only does 24bit integer math.


Micah, what about the performance issues I encounter under certain conditions when using MIT Hakmem counting method (also reported in the kernel posted above)? Does it use too many register? Or what?

0 Likes