9 Replies Latest reply on Mar 15, 2010 5:25 PM by Fr4nz

    Bit count in OpenCL

    Fr4nz

      Does it exist in OpenCL a fast way (a function or an extension to the standard) to count "1" bits present in a word (for ie, an unsigned int), without having to use loops, shifts and masks or fast techniques like precomputed lookup tables (which would require using a portion of constant memory...)?

      If I'm not wrong, ATI videocards have an instruction that does this...I'm guessing if this is usable in OpenCL, given that there's noting written about this topic in the OpenCL specs.

        • Bit count in OpenCL
          Fr4nz

          Up!

            • Bit count in OpenCL
              omkaranathan

              Fr4nz,

              There is no functionality currently to do this.

                • Bit count in OpenCL
                  galmok

                  While the native bitcount isn't available for OpenCL (yet?), you can possibly be inspired by this page:

                  http://gurmeetsingh.wordpress.com/2008/08/05/fast-bit-counting-routines/

                  It lists quite a few methods, a few of them without loops and without lookup tables yet are still very fast.

                  There are even benchmarks but as they are written in C and measured on a CPU, you probably shouldn't rely on it.

                    • Bit count in OpenCL
                      Fr4nz

                       

                      Originally posted by: galmok While the native bitcount isn't available for OpenCL (yet?), you can possibly be inspired by this page:

                       

                      http://gurmeetsingh.wordpress.com/2008/08/05/fast-bit-counting-routines/

                       

                      It lists quite a few methods, a few of them without loops and without lookup tables yet are still very fast.

                       

                      There are even benchmarks but as they are written in C and measured on a CPU, you probably shouldn't rely on it.



                      Thank you galmok. I've already considered the methods reported in that page, but there are problems when using these with vectorized variables:

                      - precomputed tables: can be vectorized but are slightly slower that naive-vectorized for loop. Dunno why...;

                      - dense/sparse methods: this method, when using vectorized variables, requires the use of a condition like "any(var != 0))" in the for loop, which uses the function any: the use of this function has a cost, and this method turns out to be slower than the naive-vectorized loop approach;

                      - MIT Hakmem: this method, in theory, should be perfect because it allows the use of vectorized variables and it makes few arithmetical operations (only 10 ops per 32-bit word, instead of 64 ops). In reality, I've found that it doesn't always give better result than the naive approach, and this is quite a mistery for me. Maybe it uses too much registers? Or we have some inefficiences at a driver level? I've reported a kernel below that uses this approach, maybe Omkaranathan can shed some light about this issue...

                      __kernel void secondpass(__global uint4 *inputVD, __global uint *inputPrefixCk, __global uint *inputPosAllKeyCandidates, __local uint *tmpCandItems, __global uint4 *outputKeyIntersections, __global uint *outputCounts, const int numCandidates, const int dimRow4, const int k, const int soglia, const int start, const int end) { const int gid = get_global_id(0); const int lid = get_local_id(0); const int lsize = get_local_size(0); // Private vectorized variables uint4 tmpIntersect; // Used when computing temporary AND-intersections; uint4 tmpCOUNTS = 0; // Used when computing the amount of 1 bits in a word; uint4 init; // Dummy variable used for "for lops" uint4 tmp; // Dummy variable used during bit counting; init.x = (gid < numCandidates) ? soglia : k; for( ; init.x<k; init.x++) tmpCandItems[lid + init.x*lsize] = inputPrefixCk[gid*k + init.x]; for(init.x=start; init.x<end; init.x++) { if(gid < numCandidates) { tmpIntersect = outputKeyIntersections[inputPosAllKeyCandidates[gid]*(end-start) + (init.x-start)]; for(init.y=soglia; init.y<k; init.y++) tmpIntersect &= inputVD[tmpCandItems[lid + init.y*lsize]*dimRow4 + init.x]; // BIT COUNT: counts bit in tmpIntersect vectorized variable (MIT Hakmem method) tmp = tmpIntersect - ((tmpIntersect >> (uint4)1) & (uint4)033333333333) - ((tmpIntersect >> (uint4)2) & (uint4)011111111111); tmpCOUNTS += ((tmp + (tmp >> (uint4)3)) & (uint4)030707070707) % (uint4)63; } } if(gid < numCandidates) outputCounts[gid] += (tmpCOUNTS.x + tmpCOUNTS.y + tmpCOUNTS.z + tmpCOUNTS.w); }

                • Bit count in OpenCL
                  MicahVillmow
                  galmok,
                  All integer operations(except for special 24bit operations) are 32bits. The 24bit operations are faster and can execute on all pipes whereas some 32bit integer operations can only operate on the special function pipe. The review you read is referring the 'fast' 24bit operations only, not that the GPU only does 24bit integer math.
                    • Bit count in OpenCL
                      Fr4nz

                       

                      Originally posted by: MicahVillmow galmok, All integer operations(except for special 24bit operations) are 32bits. The 24bit operations are faster and can execute on all pipes whereas some 32bit integer operations can only operate on the special function pipe. The review you read is referring the 'fast' 24bit operations only, not that the GPU only does 24bit integer math.


                      Micah, what about the performance issues I encounter under certain conditions when using MIT Hakmem counting method (also reported in the kernel posted above)? Does it use too many register? Or what?