37 Replies Latest reply on Sep 16, 2009 4:45 PM by ryta1203

    Compute versus Pixel, odd results

    ryta1203

      I'm looking at some results I got from running pixel and compute modes.

      For almost the same code, the compute shader runs way worse while alu is not the bottleneck (even then it's still a little slower). Are there any ideas why this is the case? There is only 1 output and I'm only using the global buffer for the output, the inputs are still using texture fetches.

      For example, with 12 inputs, ALU:Fetch of 4.0 (according to SKA equation, which is actually 16.0), 1 output 5000 iterations, I get the following times:

      CS: 18.639 (ALU is not the bottleneck here, I'm not sure what is, it would appear to be memory but there is only 1 output, the inputs are texture fetches)

      PS: 11.5 (ALU IS the bottleneck here)

      Anyone have an idea why the big difference?

       

      EDIT: Forgot to mention, run on HD4870, no branching, no data reuse, float4 data types, the code is almost exactly the same for both kernels (minus the domain calculations for the compute shader).

        • Compute versus Pixel, odd results
          MicahVillmow
          Ryta,
          The most likely culprit here is how you are hitting the cache's with the texture fetches. If you block the CS group access so it is the same as PS then you should get equivalent or better results. Right now the CS is requesting data in a 64x1 manner, but a better method would to request it in a 16x4 or 8x8 manner.
            • Compute versus Pixel, odd results
              ryta1203

              I get 0 for the cache counter in pixel shader mode and little more than that in CS mode (~3.0), which should be the percentage of hits, so in my pixel shader mode kernels I'm not getting any cache hits.

              If I use an 8x8 manner, how does that effect my domain calculation?

              I hope that the docs get updated to include compute shader mode in v2.0 GPU, I'm just going off of samples right now and they all use 64x1 it looks like, unless I'm misunderstanding you. Is that right?

            • Compute versus Pixel, odd results
              MicahVillmow
              Yes, they do only use 64x1 however if you look at the performance of compute_matmult versus simple_matmult over the whole range(64x64->4kx4k) you will see a point where compute_matmult just dies and simple_matmult keeps on performing well. This is where the caching structure helps.
              • Compute versus Pixel, odd results
                MicahVillmow
                The only sample that I know that does this in any similiar manner is lds_transpose which breaks it into a strided 8x8.
                • Compute versus Pixel, odd results
                  MicahVillmow
                  ATI CAL 1.4.0_beta\samples\app\lds_transpose\docs

                  is where the doc should reside in your installation.
                  • Compute versus Pixel, odd results
                    MicahVillmow
                    Well, the basics are just doing a 1D->2D transform and then a 2D->2D transform and then do another 2D->1D transform.
                    So, you have a 64x1x1 group size and you want to transform it into a 16x4, the easiest way to do this is to calculate x = vTidInGrpFlat & 0x3 and y =vTidInGrpFlat >> 2, this gives you the x and y positions within any group.
                    Now to transform your group id correctly, you need to find out how many groups fit in a row. So you have width, and you want to divide it by 16 and that is the max number of groups per row.
                    So vGrpIdFlat % (width >> 4) gives you the threads group id in the x dimension and vGrpIdFlat / (width >> 4) give you the threads group id in the y dimension.
                    Now in order to get your correct x/y location, you need to multiply your x group location by 16 and your y group location by 4(since your group size is 16x4) and then add in your position within a group.
                    This should give you an x/y location for each thread in a blocking pattern similar to what the LDS_Transpose doc does and getting your new AbsTid is as simple as y * width + x.

                    There might be some logic problems with the math, but if you write it out on a sheet of paper it becomes a lot easier to understand and you can probably catch any errors.

                      • Compute versus Pixel, odd results
                        riza.guntur

                        MicahVillmow, that means in Brook+ the performance of compute shader would never near that of CAL's compute shader since Brook+ doesn't support 2D or 3D group size

                          • Compute versus Pixel, odd results
                            godsic

                            Some times I cant understand AMD. Why you include possibilities even if they slow and inefficient.

                            My experience with OpenGL( CAL is just simplified version API of OpenGL) show the true face of GPU:

                            1. Try to avoid texture fetching in shaders (kernels) because of TMU limitation and inefficient samplers.

                            2. NEVER USE INT TYPES!!!!!! NEVER!!!! Only 1 from 5 scalar units can perform INT operations.  Same rule for some FP operations (read R600 doc carefuly)

                            3. Try to avoid flowing in shaders, number of FCU is limited. If you need flowing please try to use it carefully, at least 16 pixels (elements) must be coherent.

                            AND THE MAIN RULE - VECTRIZE all operations to 4 components.

                             

                            ALWAYS USE NATIVE TYPE FLOAT4!!! FOR CONSTS, RESOURCERS etc.

                              • Compute versus Pixel, odd results
                                Gipsel

                                 

                                Originally posted by: godsic

                                2. NEVER USE INT TYPES!!!!!! NEVER!!!! Only 1 from 5 scalar units can perform INT operations.  Same rule for some FP operations (read R600 doc carefuly)



                                 

                                Not entirely true. With RV7x0 integer multiplications are handled by the t unit. Addition, binary operations and shifts can be handled by all ALU units. RV6x0 restricts shifts also to the t unit.

                                And yes, I have here some code using more than 4 of the 5 slots on average exclusively with integer operations.

                            • Compute versus Pixel, odd results
                              ryta1203

                               

                              Originally posted by: MicahVillmow Well, the basics are just doing a 1D->2D transform and then a 2D->2D transform and then do another 2D->1D transform. So, you have a 64x1x1 group size and you want to transform it into a 16x4, the easiest way to do this is to calculate x = vTidInGrpFlat & 0x3 and y =vTidInGrpFlat >> 2, this gives you the x and y positions within any group. Now to transform your group id correctly, you need to find out how many groups fit in a row. So you have width, and you want to divide it by 16 and that is the max number of groups per row. So vGrpIdFlat % (width >> 4) gives you the threads group id in the x dimension and vGrpIdFlat / (width >> 4) give you the threads group id in the y dimension. Now in order to get your correct x/y location, you need to multiply your x group location by 16 and your y group location by 4(since your group size is 16x4) and then add in your position within a group. This should give you an x/y location for each thread in a blocking pattern similar to what the LDS_Transpose doc does and getting your new AbsTid is as simple as y * width + x. There might be some logic problems with the math, but if you write it out on a sheet of paper it becomes a lot easier to understand and you can probably catch any errors.


                              Once you calculate the x and y positions, why can't you just sample off of these? Do you really need to calculate the AbsTid since the sample is 2D?

                              • Compute versus Pixel, odd results
                                ryta1203

                                 

                                Originally posted by: MicahVillmow Well, the basics are just doing a 1D->2D transform and then a 2D->2D transform and then do another 2D->1D transform. So, you have a 64x1x1 group size and you want to transform it into a 16x4, the easiest way to do this is to calculate x = vTidInGrpFlat & 0x3 and y =vTidInGrpFlat >> 2, this gives you the x and y positions within any group. Now to transform your group id correctly, you need to find out how many groups fit in a row. So you have width, and you want to divide it by 16 and that is the max number of groups per row. So vGrpIdFlat % (width >> 4) gives you the threads group id in the x dimension and vGrpIdFlat / (width >> 4) give you the threads group id in the y dimension. Now in order to get your correct x/y location, you need to multiply your x group location by 16 and your y group location by 4(since your group size is 16x4) and then add in your position within a group. This should give you an x/y location for each thread in a blocking pattern similar to what the LDS_Transpose doc does and getting your new AbsTid is as simple as y * width + x. There might be some logic problems with the math, but if you write it out on a sheet of paper it becomes a lot easier to understand and you can probably catch any errors.


                                 

                                Micah,

                                Sorry, a few more questions:

                                1. Why are you ANDing by 3?

                                2. What are "rows"?

                                3. Can you adjust the block size, not just the group size?

                                4. What do you mean by "add in your position within the group"? How do we find this?

                                Thanks.

                              • Compute versus Pixel, odd results
                                MicahVillmow
                                Ryta,
                                We are always trying to improve our documentation with the feedback that we receive but I can't give any specifics on what does/does not make it into future releases.
                                • Compute versus Pixel, odd results
                                  MicahVillmow
                                  The read is based off of X/Y positions, but the write is based of of the new AbsTid.
                                    • Compute versus Pixel, odd results
                                      ryta1203

                                      Ok, this makes sense, thanks.

                                      • Compute versus Pixel, odd results
                                        ryta1203

                                         

                                        Originally posted by: MicahVillmow The read is based off of X/Y positions, but the write is based of of the new AbsTid.


                                        Micah,

                                          Does this look right to you:

                                        Given a domain size of 256x256 and a Group Size of 16x4x1 (instead of 64x1x1).

                                        Accessing AbsTid 34 of 64x1x1 is:

                                        1. Block ID: 32 >> 4 = 2

                                        2. Tid/Block: 32 AND 15 = 2

                                        3a. Block ID X-dim: 2 MOD (256/16) = 2

                                        3b. Block ID Y-dim: 2/16 = 0

                                        4a. Tid/Block X-dim: 2 AND 3 = 2

                                        4b. Tid/Block Y-dim: 2 >> 2 = 0

                                        5a. Start Block X-dim: 2*4 = 8

                                        5b. Start Block Y-dim: 0*16 = 0

                                        6a. Start Thread X-dim: 8

                                        6b. Start Thread Y-dim: 0 + 2 = 2

                                        So I'm accesing coordinates 8, 2 for this problem? Why doesn't this look right to me?

                                        EDIT: Sorry that should be 34 AbsTid, not 32, sorry, fixed.

                                      • Compute versus Pixel, odd results
                                        MicahVillmow
                                        Ryta,
                                        Actually I got my rows and columns backward, AND'ing by 3 would give you a 4x16 block instead of a 16x4 block.
                                        So, let me go into a little more detailed explanation that I hope will answer your questions. This is in a similiar vein to the LDS_transpose doc but should be more general.
                                        Given an execution domain of NxM and a group size G, we can execute CEIL((N*M)/G) groups, represented as P.
                                        So, in the 1D space we will have Group ID's from 0-(P-1), we have ThreadID from 0-(G-1) and AbsThreadId from 0 - (N*M-1)
                                        Now the problem is that this linear progression of AbsThreadID's from 0-(N*M-1) does not fit well into the 2D caching structure of our graphics chips, so we want to block our groups to get spacial locality in our memory accesses.
                                        So to get better cache performance, we would then block our groups into a more compact structure, and in this case 16x4.
                                        The restrictions of doing this then becomes that N must be evenly divisible by 16 and M must be evenly divisible by 4.
                                        So, the first step is to transform ThreadInGrpID from a single dimension of 0-(G-1) to a two dimension of 0-16 and 0-4, called G`.
                                        The easiest way to do this is to calculate it via the following:
                                        G`.x = G & 0xF and G`.y = G >> 4
                                        Now we have to modify P so that our group id is now in two dimensions. In order to do this we pass down via constant buffers the number of groups per row(GPR) and 1/GPR, and this calculation gives us P`.
                                        P`.x = P / GPR, P`.y = P * (1/GPR), and these need to be scaled based on our new blocking for a group so we multiply P`.x by 16 and P`.y by 4.
                                        P` gives the thread location in the 2 dimensional space of each block and G` gives the thread location within each block, so to get the absolute thread id in 2 dimensions we just add P` and G` together.

                                        Now, you can adjust the block size and the group size, but the optimal size is usually a multiple of the wavefront size and on HD48XX the wavefront size is 64 threads. The reason you don't want more than one wavefront in a group is that synchronization is required(i.e. barrier) to guarantee that all wavefronts in a group exit the SIMD before another wavefront from another group enters the SIMD. This barrier is not a cheap operation as it is emulated in software using something akin to a spinlock.
                                          • Compute versus Pixel, odd results
                                            ryta1203

                                             

                                            Originally posted by: MicahVillmow Ryta, Actually I got my rows and columns backward, AND'ing by 3 would give you a 4x16 block instead of a 16x4 block. So, let me go into a little more detailed explanation that I hope will answer your questions. This is in a similiar vein to the LDS_transpose doc but should be more general. Given an execution domain of NxM and a group size G, we can execute CEIL((N*M)/G) groups, represented as P. So, in the 1D space we will have Group ID's from 0-(P-1), we have ThreadID from 0-(G-1) and AbsThreadId from 0 - (N*M-1) Now the problem is that this linear progression of AbsThreadID's from 0-(N*M-1) does not fit well into the 2D caching structure of our graphics chips, so we want to block our groups to get spacial locality in our memory accesses. So to get better cache performance, we would then block our groups into a more compact structure, and in this case 16x4. The restrictions of doing this then becomes that N must be evenly divisible by 16 and M must be evenly divisible by 4. So, the first step is to transform ThreadInGrpID from a single dimension of 0-(G-1) to a two dimension of 0-16 and 0-4, called G`. The easiest way to do this is to calculate it via the following: G`.x = G & 0xF and G`.y = G >> 4 Now we have to modify P so that our group id is now in two dimensions. In order to do this we pass down via constant buffers the number of groups per row(GPR) and 1/GPR, and this calculation gives us P`. P`.x = P / GPR, P`.y = P * (1/GPR), and these need to be scaled based on our new blocking for a group so we multiply P`.x by 16 and P`.y by 4. P` gives the thread location in the 2 dimensional space of each block and G` gives the thread location within each block, so to get the absolute thread id in 2 dimensions we just add P` and G` together. Now, you can adjust the block size and the group size, but the optimal size is usually a multiple of the wavefront size and on HD48XX the wavefront size is 64 threads. The reason you don't want more than one wavefront in a group is that synchronization is required(i.e. barrier) to guarantee that all wavefronts in a group exit the SIMD before another wavefront from another group enters the SIMD. This barrier is not a cheap operation as it is emulated in software using something akin to a spinlock.


                                            OK, lol, one more question (whew!): What are the number of groups per row? What is a row?

                                              • Compute versus Pixel, odd results
                                                ryta1203

                                                 

                                                Originally posted by: ryta1203
                                                Originally posted by: MicahVillmow Ryta, Actually I got my rows and columns backward, AND'ing by 3 would give you a 4x16 block instead of a 16x4 block. So, let me go into a little more detailed explanation that I hope will answer your questions. This is in a similiar vein to the LDS_transpose doc but should be more general. Given an execution domain of NxM and a group size G, we can execute CEIL((N*M)/G) groups, represented as P. So, in the 1D space we will have Group ID's from 0-(P-1), we have ThreadID from 0-(G-1) and AbsThreadId from 0 - (N*M-1) Now the problem is that this linear progression of AbsThreadID's from 0-(N*M-1) does not fit well into the 2D caching structure of our graphics chips, so we want to block our groups to get spacial locality in our memory accesses. So to get better cache performance, we would then block our groups into a more compact structure, and in this case 16x4. The restrictions of doing this then becomes that N must be evenly divisible by 16 and M must be evenly divisible by 4. So, the first step is to transform ThreadInGrpID from a single dimension of 0-(G-1) to a two dimension of 0-16 and 0-4, called G`. The easiest way to do this is to calculate it via the following: G`.x = G & 0xF and G`.y = G >> 4 Now we have to modify P so that our group id is now in two dimensions. In order to do this we pass down via constant buffers the number of groups per row(GPR) and 1/GPR, and this calculation gives us P`. P`.x = P / GPR, P`.y = P * (1/GPR), and these need to be scaled based on our new blocking for a group so we multiply P`.x by 16 and P`.y by 4. P` gives the thread location in the 2 dimensional space of each block and G` gives the thread location within each block, so to get the absolute thread id in 2 dimensions we just add P` and G` together. Now, you can adjust the block size and the group size, but the optimal size is usually a multiple of the wavefront size and on HD48XX the wavefront size is 64 threads. The reason you don't want more than one wavefront in a group is that synchronization is required(i.e. barrier) to guarantee that all wavefronts in a group exit the SIMD before another wavefront from another group enters the SIMD. This barrier is not a cheap operation as it is emulated in software using something akin to a spinlock.


                                                OK, lol, one more question (whew!): What are the number of groups per row? What is a row?

                                                And how does this differ from the number of blocks per row (as you mentioned earlier in your walkthrough example)?

                                            • Compute versus Pixel, odd results
                                              MicahVillmow
                                              Ryta,
                                              1 is incorrect. Block ID is AbsTid >> 6 because your block size is still 64 threads even though it is 16x4.
                                              So by my calculations
                                              1. Block ID: 34 >> 6 = 0
                                              2. TidInBlock: 34 & 63 = 34
                                              3a. Block ID X-dim: 0 since BlockID is 0(but would normally be BlockID % (numBlocksPerRow))
                                              3b. Block ID Y-dim: 0 since BlockID is 0(but would normally be BlockID / (numBlocksPerRow))
                                              4a. Tid/Block X-dim: TidInBlock(34) & 0xF = 2
                                              4b. Tid/Block Y-dim: 2 >> 4 = 2
                                              5a. Start Block X-dim: 0*4 = 0
                                              5b. Start Block Y-dim: 0*16 = 0
                                              6a. Start Thread X-dim: 0 + 2 = 2
                                              6b. Start Thread Y-dim: 0 + 2 = 2
                                                • Compute versus Pixel, odd results
                                                  ryta1203

                                                  Micah,

                                                     Then why do they use 4 in the lds_transpose example? l0.x is 4

                                                  • Compute versus Pixel, odd results
                                                    ryta1203

                                                     

                                                    Originally posted by: MicahVillmow Ryta, 1 is incorrect. Block ID is AbsTid >> 6 because your block size is still 64 threads even though it is 16x4. So by my calculations 1. Block ID: 34 >> 6 = 0 2. TidInBlock: 34 & 63 = 34 3a. Block ID X-dim: 0 since BlockID is 0(but would normally be BlockID % (numBlocksPerRow)) 3b. Block ID Y-dim: 0 since BlockID is 0(but would normally be BlockID / (numBlocksPerRow)) 4a. Tid/Block X-dim: TidInBlock(34) & 0xF = 2 4b. Tid/Block Y-dim: 2 >> 4 = 2 5a. Start Block X-dim: 0*4 = 0 5b. Start Block Y-dim: 0*16 = 0 6a. Start Thread X-dim: 0 + 2 = 2 6b. Start Thread Y-dim: 0 + 2 = 2


                                                    So this should work, providing BlockID/(numBlocksPerRow) = width/16 for 16x4 block size.

                                                     

                                                    "dcl_literal l0, 6, 63, 15, 4\n" "dcl_literal l1, 16, 0, 0, 0\n" "ishr r0.x, vaTid0.x, l0.x\n" // Block ID "and r0.y, vaTid0.x, l0.y\n" // Tid within a block "umod r0.z, r0.x, cb0[0].z\n" // Get the block id in the x direction "udiv r0.w, r0.x, cb0[0].z\n" // Get the block id in the y direction "iand r1.z, r0.y, l0.z\n" // Get the tid within a block in the x direction "ishr r1.w, r0.y, l0.w\n" // Get the tid within a block in the y direction "imul r1.x, r0.z, l0.w\n" // Get the starting X position of the block "imul r1.y, r0.w, l1.x\n" // Get the starting y position of the block "iadd r11.x, r1.x, r1.z\n" // Get the x position of the thread "iadd r11.y, r1.y, r1.w\n" // Get the y position of the thread "imad r10.x, r11.y, cb0[0].x, r11.x\n"

                                                      • Compute versus Pixel, odd results
                                                        ryta1203

                                                         

                                                        Originally posted by: ryta1203
                                                        Originally posted by: MicahVillmow Ryta, 1 is incorrect. Block ID is AbsTid >> 6 because your block size is still 64 threads even though it is 16x4. So by my calculations 1. Block ID: 34 >> 6 = 0 2. TidInBlock: 34 & 63 = 34 3a. Block ID X-dim: 0 since BlockID is 0(but would normally be BlockID % (numBlocksPerRow)) 3b. Block ID Y-dim: 0 since BlockID is 0(but would normally be BlockID / (numBlocksPerRow)) 4a. Tid/Block X-dim: TidInBlock(34) & 0xF = 2 4b. Tid/Block Y-dim: 2 >> 4 = 2 5a. Start Block X-dim: 0*4 = 0 5b. Start Block Y-dim: 0*16 = 0 6a. Start Thread X-dim: 0 + 2 = 2 6b. Start Thread Y-dim: 0 + 2 = 2


                                                        So this should work, providing BlockID/(numBlocksPerRow) = width/16 for 16x4 block size.

                                                         

                                                         

                                                        So this kernel does not produce expected results... only zeroes. I'm not sure this is the right method that you described above.

                                                          • Compute versus Pixel, odd results
                                                            ryta1203

                                                            I appears that when the "iand r0.y, vAbsTidFlat.x, l0.x" the only result I ever get is zero.

                                                            Even for the simple kernel below, I still just get zero. So vAbsTidFlat.x is always zero?? What register should I be using here instead?

                                                            const char * HILKernel = "il_cs_2_0\n" "dcl_num_thread_per_group 64\n" "dcl_literal l0, 0x6, 0x3F, 0xFFFF, 0x4\n" "iand r0.y, vAbsTidFlat.x, l0.z\n" "mov g[vAbsTidFlat.x], r0.y\n" "ret_dyn\n" "end\n";

                                                              • Compute versus Pixel, odd results
                                                                ryta1203

                                                                Actually, after using vAbsTidFlat and "and"ing it with a texture read in value it's fine.

                                                                I only get zeroes when I use a literal. Apparently I am using the literals wrong or something, not sure exactly. Even if I use a constant buffer with the same value as the literal I get the correct results but not when I use the literal.

                                                                  • Compute versus Pixel, odd results
                                                                    ryta1203

                                                                    Is there anyway that we can get an advance document on this subject? Something in the general case with good diagrams/figures?

                                                                      • Compute versus Pixel, odd results
                                                                        ryta1203

                                                                        Micah,

                                                                          I have solved this issue.... as always it's something a lot simpler than expected.... I forgot to convert the texture coordinates into float from int (the thread id is given in int but the texture coords need to be in float).... as soon as I did that, BAM, worked fine... I knew I was missing something, didn't think it was that though.

                                                                        I believe I am using a 16x4 approach (though I'm copying a lot of the lds example and it supposedly uses an 8x8 approach).

                                                                        EDIT: Forgot why I originally posted: to say Thanks, so Thanks!

                                                            • Compute versus Pixel, odd results
                                                              MicahVillmow
                                                              The lds_transpose example is using hardware specific feature to get a very high performing transpose. The sample does 4 16x16 transposes per wavefront, so sets the group size to 64. Each thread transposes 4 float4's, so we need to offset our thread id's within the block to a stride of 4. I looked at the doc recently and it needs to be more clear, so I'll try to clean it up a little before the next release.
                                                                • Compute versus Pixel, odd results
                                                                  ryta1203

                                                                   

                                                                  Originally posted by: MicahVillmow The lds_transpose example is using hardware specific feature to get a very high performing transpose. The sample does 4 16x16 transposes per wavefront, so sets the group size to 64. Each thread transposes 4 float4's, so we need to offset our thread id's within the block to a stride of 4. I looked at the doc recently and it needs to be more clear, so I'll try to clean it up a little before the next release.


                                                                  Also, how about some better figures... ones that show everything, similar to what the CUDA Programming Guide has to offer. Their guide easily shows how the threads and blocks are organizedm, very clearly from just the figures.

                                                                • Compute versus Pixel, odd results
                                                                  MicahVillmow
                                                                  There are 4 blocks per group, so the number of groups per row is number of blocks per row / 4. A row is the width of the 2D texture + the height of a group. So if you had a 64x64 texture, there would be 4 blocks per row, since each block is 16x16 and there would be 1 group per row since each group is 4 blocks.
                                                                    • Compute versus Pixel, odd results
                                                                      ryta1203

                                                                       

                                                                      Originally posted by: MicahVillmow There are 4 blocks per group, so the number of groups per row is number of blocks per row / 4. A row is the width of the 2D texture + the height of a group. So if you had a 64x64 texture, there would be 4 blocks per row, since each block is 16x16 and there would be 1 group per row since each group is 4 blocks.


                                                                      So if you were going with an 8x8 (block size) arrangement then you would have:

                                                                      8 blocks per group

                                                                      2 Groups per row

                                                                      A row would be 64+8=72

                                                                      8 blocks per row

                                                                      ??

                                                                       

                                                                    • Compute versus Pixel, odd results
                                                                      MicahVillmow
                                                                      So, if you were still doing 4 blocks per group, and 8x8 block size, then you would have 2 groups per row. But if you kept it at 1 group = 4 * 16x16, then you would have 8 blocks per group and 1 group per row.
                                                                        • Compute versus Pixel, odd results
                                                                          ryta1203

                                                                          Is it possible to do 8 blocks per group of 8x8 block size? This makes sense to me, since that would still fit in a row (1 group per row)

                                                                          And by your general description you don't need the absolute thread id to start with at all?

                                                                          You just need to know the group size, the block size you want, the groups per row and the number of groups. Are you flip-flopping the term group with block? The transpose uses Block ID and Thread ID a lot and you haven't used either really and you also haven't needed to access the absolute thread id to calculate the 2D absolute thread id.