5 Replies Latest reply on Mar 20, 2010 11:25 PM by Fr4nz

    Performance inconsistencies when testing various bit-counting methods

    Fr4nz

      Yesterday I spent part of my day testing various bit-counting methods on my ATI 5770 and I'm incurring in strange performance inconsistencies, that is, bit-counting methods that should be faster are slower. I'm here in order to understand if these inconsistencies have to do with hardware limits, poor kernel design or inefficiences/bugs in current ATI OpenCL implementation.


      All the kernels I'm reporting below as references makes these simple things: consider an uint4 variable, let's call it tmpIntersect, that is continuously updated by some AND-operations. Then, every time this variable is updated:

      1) Apply a bit-counting method in order to count the bits contained in tmpIntersect;
      2) Update the partial sum, tmpCOUNTS, with the result obtained in "1)";


      The bit-counting methods I've tested are 3 (I'm leaving out the MIT-Hakmem method for the moment), all taken from http://tinyurl.com/6p5rm9 and adapted for GPU and vectorization:

      1) Naive-vectorized method (in theory the slowest one): it consists of a FOR loop that executes 32 times 3 operations for every tmpIntersect component: 1) an AND with a mask, in order to check if the least significant bit is equal to "1", 2) a SUM that increases the partial sum with the result obtained from the AND and, in the end, 3) a RIGHT SHIFT.
      All these operations can be made using vectorized variables, so we have parallelism at operation level;

      2) Usage of a precomputed 8-bit lookup table (in theory faster than naive-vectorized method): at the host side we precompute a table with 256 entries, each one containing the number of "1" bits that its corresponding index, thought as the numerical value of an 8-bit unsigned word, contains. This table is loaded in constant memory as an uchar array.
      In the kernel, when we want to compute the number of "1" bits in tmpIntersect, for every component we mask the first 24 bits, then we make a lookup in the table and update the partial sum; we repeat this procedure 3 times, every time doing an 8-bit shift to the right. In total we have 10 operations per component. Also this method can be completely vectorized, as you can see from the second kernel reported;

      3) Usage of a precomputed 16-bit lookup table (in theory the fastest method): the idea is the same as above, the advantage is that we can analyze larger parts (16 bits instead of 8) of every word, lessening the total number of instructions used (5 operations per component). Also this method is completely vectorizable;


      Okay, the problem is that method "2)" is only lightly faster than "1)" and method "3)" is slower than "2)" but also than "1)" !, which is quite nonsensical!
      I'm starting to think that we have some inefficiences in current OpenCL implementation, but I'd wish to know from AMD staff why method 2) gives me poor performance results, if compared to what I expected, and, above all, why method 3), which should be the fastest one, is actually the slowest.

      Thank you for any answer!

      PS: My specs => Ubuntu 9.04 64bit, ATI 5770, Catalyst 10.2 and OpenCL 2.01 implementation.

      /** * FIRST KERNEL: using naive-vectorized bit counting method **/ __kernel void intersect(__global uint4 *inputVD, __global uint *inputPrefixCk, __local uint4 *tmpBuffer, __local uint *tmpCandItems, __global uint *outputCounts, const int k, const int numItems, const int dimRowVec4, const int numIntersections) { // Work-group properties; const int gid = get_global_id(0); const int lid = get_local_id(0); const int lsize = get_local_size(0); // Mask used during bit-counting const uint4 MASK = 1; // Counting variable uint4 tmpCOUNTS = 0; // Temporary variable used for AND-intersections computing; uint4 tmpIntersect; // Load of the item of the itemset given to this work-item; int init = (gid < numIntersections) ? 0 : k; for( ; init<k; init++) tmpCandItems[lid + init*lsize] = inputPrefixCk[gid*k + init]; // Beginning of current slice processing for(int i=0; i<dimRowVec4; i++) { // Load of the current slice in local memory init = (lid < numItems) ? lid : numItems; for( ; init<numItems; init+=lsize) tmpBuffer[init] = inputVD[init*dimRowVec4 + i]; barrier(CLK_LOCAL_MEM_FENCE); if(gid < numIntersections) { // First AND-intersection; tmpIntersect = tmpBuffer[tmpCandItems[lid + (k-1)*lsize]] & tmpBuffer[tmpCandItems[lid + (k-2)*lsize]]; // Successive AND-intersections; for(init=(k-3); init>=0; init--) tmpIntersect &= tmpBuffer[tmpCandItems[lid + init*lsize]]; // Naive-vectorized for-loop bit-count method for(init=0; init<32; init++) { tmpCOUNTS += (tmpIntersect & MASK); tmpIntersect = tmpIntersect >> MASK; } } } // End of current slice processing // write final result if(gid < numIntersections) outputCounts[gid] = tmpCOUNTS.x + tmpCOUNTS.y + tmpCOUNTS.z + tmpCOUNTS.w; } /** * SECOND KERNEL: using 8-bit precomputed lookup table **/ __kernel void intersect(__global uint4 *inputVD, __global uint *inputPrefixCk, __local uint4 *tmpBuffer, __local uint *tmpCandItems, __global uint *outputCounts, __constant uchar lookupTable[256], const int k, const int numItems, const int dimRowVec4, const int numIntersections) { // Work-group properties; const int gid = get_global_id(0); const int lid = get_local_id(0); const int lsize = get_local_size(0); // Mask used during bit-counting; const uint4 MASK = 255; // variable containing partial sum; uint4 tmpCOUNTS = 0; // Temporary variable used for AND-intersections computing; uint4 tmpIntersect; // Variable dummy; uint4 dummy; // Variable used when retrieving values from the lookup table; uint4 lookupValues; // Load of the item of the itemset given to this work-item; int init = (gid < numIntersections) ? 0 : k; for( ; init<k; init++) tmpCandItems[lid + init*lsize] = inputPrefixCk[gid*k + init]; // Beginning of current slice processing for(int i=0; i<dimRowVec4; i++) { // Load of the current slice in local memory init = (lid < numItems) ? lid : numItems; for( ; init<numItems; init+=lsize) tmpBuffer[init] = inputVD[init*dimRowVec4 + i]; barrier(CLK_LOCAL_MEM_FENCE); if(gid < numIntersections) { // First AND-intersection; tmpIntersect = tmpBuffer[tmpCandItems[lid + (k-1)*lsize]] & tmpBuffer[tmpCandItems[lid + (k-2)*lsize]]; // Successive AND-intersections; for(init=(k-3); init>=0; init--) tmpIntersect &= tmpBuffer[tmpCandItems[lid + init*lsize]]; // Beginning bit-counting: using precomputed 8-bit lookup table; dummy = tmpIntersect & MASK; lookupValues = (uint4)(lookupTable[dummy.x], lookupTable[dummy.y], lookupTable[dummy.z], lookupTable[dummy.w]); tmpCOUNTS += lookupValues; dummy = (tmpIntersect >> (uint4)8); dummy &= MASK; lookupValues = (uint4)(lookupTable[dummy.x], lookupTable[dummy.y], lookupTable[dummy.z], lookupTable[dummy.w]); tmpCOUNTS += lookupValues; dummy = (tmpIntersect >> (uint4)16); dummy &= MASK; lookupValues = (uint4)(lookupTable[dummy.x], lookupTable[dummy.y], lookupTable[dummy.z], lookupTable[dummy.w]); tmpCOUNTS += lookupValues; dummy = (tmpIntersect >> (uint4)24); dummy &= MASK; lookupValues = (uint4)(lookupTable[dummy.x], lookupTable[dummy.y], lookupTable[dummy.z], lookupTable[dummy.w]); tmpCOUNTS += lookupValues; // END of bit-counting } } // End of current slice processing // write final result if(gid < numIntersections) outputCounts[gid] = tmpCOUNTS.x + tmpCOUNTS.y + tmpCOUNTS.z + tmpCOUNTS.w; } /** * THIRD KERNEL: using 16-bit precomputed lookup table **/ __kernel void intersect(__global uint4 *inputVD, __global uint *inputPrefixCk, __local uint4 *tmpBuffer, __local uint *tmpCandItems, __global uint *outputCounts, __constant uchar lookupTable[65536], const int k, const int numItems, const int dimRowVec4, const int numIntersections) { // Work-group properties; const int gid = get_global_id(0); const int lid = get_local_id(0); const int lsize = get_local_size(0); // Mask used during bit-counting; const uint4 MASK = 65535; // variable containing partial sum; uint4 tmpCOUNTS = 0; // Temporary variable used for AND-intersections computing; uint4 tmpIntersect; // Variable dummy; uint4 dummy; // Variable used when retrieving values from the lookup table; uint4 lookupValues; // Load of the item of the itemset given to this work-item; int init = (gid < numIntersections) ? 0 : k; for( ; init<k; init++) tmpCandItems[lid + init*lsize] = inputPrefixCk[gid*k + init]; // Beginning of current slice processing for(int i=0; i<dimRowVec4; i++) { // Load of the current slice in local memory init = (lid < numItems) ? lid : numItems; for( ; init<numItems; init+=lsize) tmpBuffer[init] = inputVD[init*dimRowVec4 + i]; barrier(CLK_LOCAL_MEM_FENCE); if(gid < numIntersections) { // First AND-intersection; tmpIntersect = tmpBuffer[tmpCandItems[lid + (k-1)*lsize]] & tmpBuffer[tmpCandItems[lid + (k-2)*lsize]]; // Successive AND-intersections; for(init=(k-3); init>=0; init--) tmpIntersect &= tmpBuffer[tmpCandItems[lid + init*lsize]]; // Beginning bit-counting: using precomputed 16-bit lookup table; dummy = tmpIntersect & MASK; lookupValues = (uint4)(lookupTable[dummy.x], lookupTable[dummy.y], lookupTable[dummy.z], lookupTable[dummy.w]); tmpCOUNTS += lookupValues; dummy = (tmpIntersect >> (uint4)16); dummy &= MASK; lookupValues = (uint4)(lookupTable[dummy.x], lookupTable[dummy.y], lookupTable[dummy.z], lookupTable[dummy.w]); tmpCOUNTS += lookupValues; } } // End of current slice processing // write final result if(gid < numIntersections) outputCounts[gid] = tmpCOUNTS.x + tmpCOUNTS.y + tmpCOUNTS.z + tmpCOUNTS.w; }

        • Performance inconsistencies when testing various bit-counting methods
          MicahVillmow
          I haven't analyzed the kernels to much, but just from looking at the ISA:
          Approach 1: 161 ALU bundles, 46 CF instructions, 31 Registers
          Approach 2: 124 ALU bundles, 63 CF instructions, 32 Registers
          Approach 3: 102 ALU bundles, 55 CF instructions, 24 Registers

          The ratio's of ALU to CF is low, ideally you want > 5:1 ratio, i.e. too much control flow, a simple way to fix this is unroll loops. The first approach goes down the fast memory path(; RatOpIsUsed = 0x2), the second and third approach go down the slower memory path(; RatOpIsUsed = 0x402). This is caused by your reads from the uchar pointer which in the current implementation go down the slow path.

          Hopefully this helps in better optimizing your code.
            • Performance inconsistencies when testing various bit-counting methods
              Fr4nz

               

              Originally posted by: MicahVillmow I haven't analyzed the kernels to much, but just from looking at the ISA: Approach 1: 161 ALU bundles, 46 CF instructions, 31 Registers Approach 2: 124 ALU bundles, 63 CF instructions, 32 Registers Approach 3: 102 ALU bundles, 55 CF instructions, 24 Registers The ratio's of ALU to CF is low, ideally you want > 5:1 ratio, i.e. too much control flow, a simple way to fix this is unroll loops. The first approach goes down the fast memory path(; RatOpIsUsed = 0x2), the second and third approach go down the slower memory path(; RatOpIsUsed = 0x402). This is caused by your reads from the uchar pointer which in the current implementation go down the slow path. Hopefully this helps in better optimizing your code.


               

              Hi micah, thank you for the quick answer. Well, 2) and 3) actually don't use loops when bit-counting (you can see that with your own eyes, it's all unrolled), so maybe the only thing that could keep ALU to CF ratio low is that the operations made when bit-counting contained in kernels 2) and 3) aren't properly vectorized. What do you think?

              The other point of your answer, that is, using uint instead of uchar for the lookup table, was partly useful: I had some performance improvement in "2)" ("3)" can't use uint because of the lookup table size, which would be 65536*4 bytes = 256 Kbytes > constant mem size). Better than nothing

              Do you plan to use the "fast path" approach when reading uchar variables in upcoming releases?

            • Performance inconsistencies when testing various bit-counting methods
              MicahVillmow
              Fr4nz,
              __constant uchar lookupTable[65536] is equivalent to __constant uint lookupTable[16384], if 4 uchar's are packed into a single uint.

              The new release should correctly put the uchar loads down the fast path.
                • Performance inconsistencies when testing various bit-counting methods
                  malcolm3141

                  Take a look at http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive

                   

                  Then use the following code which doesn't require any tables...

                   

                  v = v - ((v >> 1) & 0x55555555); // reuse input as temporary
                  v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // temp
                  c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count



                  Malcolm


                  EDIT: Note also that Cypress has ICOUNTBITS IL instruction which corresponds to BCNT_INT ASM instruction. OpenCL doesn't have an equivalent builtin AFAIK.
                    • Performance inconsistencies when testing various bit-counting methods
                      Fr4nz

                       

                      Originally posted by: malcolm3141 Take a look at http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive

                       

                       

                       

                      Then use the following code which doesn't require any tables...

                       

                       

                       

                      v = v - ((v >> 1) & 0x55555555); // reuse input as temporary
                      v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // temp
                      c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count



                      Malcolm


                      EDIT: Note also that Cypress has ICOUNTBITS IL instruction which
                      corresponds to BCNT_INT ASM instruction. OpenCL doesn't have an
                      equivalent builtin AFAIK.

                       

                      Hi Malcom,

                      I'm aware there are various fast bit-counting methods (another one I've tried is "MIT Hakmem") which makes few operations and are fully vectorizable, potentially resulting very good for a GPU implementation. Unfortunately it seems that, under certain conditions, there are some performance problems with the actual AMD OpenCL implementation.

                      For example, I've implemented a specific algorithm in two ways: the first one consist in a monolithic-kernel version, and in this case I've found that these "fast" bit-counting methods work flawlessly and very well.

                      The second way consist in a two-kernel version (kernel A, executed first, and kernel B, executed after A, which also implements the bit counting phase): under certain conditions (especially when lots of threads are executed), using these fast bit-count approaches worsen performance instead of improving them.

                      I've tried to reduce register usage (for example splitting the operations made by these methods above many lines), but I had no luck...so, the only reasonable culprit is the actual OpenCL implementation.

                       

                      PS: Yes, it's very bad that OpenCL doesn't have a dedicated function for bit counting like CUDA...