
Compute versus Pixel, odd results
MicahVillmow Aug 12, 2009 5:49 PM (in response to ryta1203)Ryta,
The most likely culprit here is how you are hitting the cache's with the texture fetches. If you block the CS group access so it is the same as PS then you should get equivalent or better results. Right now the CS is requesting data in a 64x1 manner, but a better method would to request it in a 16x4 or 8x8 manner.
Compute versus Pixel, odd results
ryta1203 Aug 12, 2009 6:02 PM (in response to MicahVillmow)I get 0 for the cache counter in pixel shader mode and little more than that in CS mode (~3.0), which should be the percentage of hits, so in my pixel shader mode kernels I'm not getting any cache hits.
If I use an 8x8 manner, how does that effect my domain calculation?
I hope that the docs get updated to include compute shader mode in v2.0 GPU, I'm just going off of samples right now and they all use 64x1 it looks like, unless I'm misunderstanding you. Is that right?


Compute versus Pixel, odd results
MicahVillmow Aug 12, 2009 6:16 PM (in response to ryta1203)Yes, they do only use 64x1 however if you look at the performance of compute_matmult versus simple_matmult over the whole range(64x64>4kx4k) you will see a point where compute_matmult just dies and simple_matmult keeps on performing well. This is where the caching structure helps.
Compute versus Pixel, odd results
ryta1203 Aug 12, 2009 6:57 PM (in response to MicahVillmow)Micah,
Let me rephrase my question:
How do I implement the 8x8 manner (or anything different from 64x1)?


Compute versus Pixel, odd results
MicahVillmow Aug 12, 2009 7:08 PM (in response to ryta1203)The only sample that I know that does this in any similiar manner is lds_transpose which breaks it into a strided 8x8.
Compute versus Pixel, odd results
ryta1203 Aug 13, 2009 10:06 PM (in response to MicahVillmow)Micah,
Looking at the lds_transpose, it's not easy to tell where this happens? Are you refering to when the x/y dimensions are calculated for texture fetches?
Is there any documentation that talks about this?


Compute versus Pixel, odd results
MicahVillmow Aug 13, 2009 10:11 PM (in response to ryta1203)ATI CAL 1.4.0_beta\samples\app\lds_transpose\docs
is where the doc should reside in your installation.
Compute versus Pixel, odd results
ryta1203 Aug 13, 2009 10:22 PM (in response to MicahVillmow)Micah,
Thanks. It's very hard from the documentation to abstract this idea out, particularly since so little is known about how the cache works or the memory heirarchy of the ATI GPUs.
I don't suppose ATI/AMD will be letting us have this information anytime soon!?


Compute versus Pixel, odd results
MicahVillmow Aug 13, 2009 10:32 PM (in response to ryta1203)Well, the basics are just doing a 1D>2D transform and then a 2D>2D transform and then do another 2D>1D transform.
So, you have a 64x1x1 group size and you want to transform it into a 16x4, the easiest way to do this is to calculate x = vTidInGrpFlat & 0x3 and y =vTidInGrpFlat >> 2, this gives you the x and y positions within any group.
Now to transform your group id correctly, you need to find out how many groups fit in a row. So you have width, and you want to divide it by 16 and that is the max number of groups per row.
So vGrpIdFlat % (width >> 4) gives you the threads group id in the x dimension and vGrpIdFlat / (width >> 4) give you the threads group id in the y dimension.
Now in order to get your correct x/y location, you need to multiply your x group location by 16 and your y group location by 4(since your group size is 16x4) and then add in your position within a group.
This should give you an x/y location for each thread in a blocking pattern similar to what the LDS_Transpose doc does and getting your new AbsTid is as simple as y * width + x.
There might be some logic problems with the math, but if you write it out on a sheet of paper it becomes a lot easier to understand and you can probably catch any errors.
Compute versus Pixel, odd results
riza.guntur Aug 17, 2009 4:52 AM (in response to MicahVillmow)MicahVillmow, that means in Brook+ the performance of compute shader would never near that of CAL's compute shader since Brook+ doesn't support 2D or 3D group size

Compute versus Pixel, odd results
godsic Aug 17, 2009 3:21 PM (in response to riza.guntur)Some times I cant understand AMD. Why you include possibilities even if they slow and inefficient.
My experience with OpenGL( CAL is just simplified version API of OpenGL) show the true face of GPU:
1. Try to avoid texture fetching in shaders (kernels) because of TMU limitation and inefficient samplers.
2. NEVER USE INT TYPES!!!!!! NEVER!!!! Only 1 from 5 scalar units can perform INT operations. Same rule for some FP operations (read R600 doc carefuly)
3. Try to avoid flowing in shaders, number of FCU is limited. If you need flowing please try to use it carefully, at least 16 pixels (elements) must be coherent.
AND THE MAIN RULE  VECTRIZE all operations to 4 components.
ALWAYS USE NATIVE TYPE FLOAT4!!! FOR CONSTS, RESOURCERS etc.

Compute versus Pixel, odd results
Gipsel Aug 17, 2009 4:18 PM (in response to godsic)Originally posted by: godsic
2. NEVER USE INT TYPES!!!!!! NEVER!!!! Only 1 from 5 scalar units can perform INT operations. Same rule for some FP operations (read R600 doc carefuly)
Not entirely true. With RV7x0 integer multiplications are handled by the t unit. Addition, binary operations and shifts can be handled by all ALU units. RV6x0 restricts shifts also to the t unit.
And yes, I have here some code using more than 4 of the 5 slots on average exclusively with integer operations.

Compute versus Pixel, odd results
godsic Aug 17, 2009 5:54 PM (in response to Gipsel)Maybe, because I am HD2600XT user and all information is only related to R6xx GPU series. But all other topics probably actual even in R7XX arch.
AMD?

Compute versus Pixel, odd results
ryta1203 Aug 19, 2009 3:01 PM (in response to godsic)Micah,
Since it's obvious this information and layout is vital to getting good compute shader performance, is this going to be included in the upcoming documentation?
It's still a little fuzzy for me why this helps.





Compute versus Pixel, odd results
ryta1203 Aug 19, 2009 5:43 PM (in response to MicahVillmow)Originally posted by: MicahVillmow Well, the basics are just doing a 1D>2D transform and then a 2D>2D transform and then do another 2D>1D transform. So, you have a 64x1x1 group size and you want to transform it into a 16x4, the easiest way to do this is to calculate x = vTidInGrpFlat & 0x3 and y =vTidInGrpFlat >> 2, this gives you the x and y positions within any group. Now to transform your group id correctly, you need to find out how many groups fit in a row. So you have width, and you want to divide it by 16 and that is the max number of groups per row. So vGrpIdFlat % (width >> 4) gives you the threads group id in the x dimension and vGrpIdFlat / (width >> 4) give you the threads group id in the y dimension. Now in order to get your correct x/y location, you need to multiply your x group location by 16 and your y group location by 4(since your group size is 16x4) and then add in your position within a group. This should give you an x/y location for each thread in a blocking pattern similar to what the LDS_Transpose doc does and getting your new AbsTid is as simple as y * width + x. There might be some logic problems with the math, but if you write it out on a sheet of paper it becomes a lot easier to understand and you can probably catch any errors.
Once you calculate the x and y positions, why can't you just sample off of these? Do you really need to calculate the AbsTid since the sample is 2D?

Compute versus Pixel, odd results
ryta1203 Aug 20, 2009 2:54 PM (in response to MicahVillmow)Originally posted by: MicahVillmow Well, the basics are just doing a 1D>2D transform and then a 2D>2D transform and then do another 2D>1D transform. So, you have a 64x1x1 group size and you want to transform it into a 16x4, the easiest way to do this is to calculate x = vTidInGrpFlat & 0x3 and y =vTidInGrpFlat >> 2, this gives you the x and y positions within any group. Now to transform your group id correctly, you need to find out how many groups fit in a row. So you have width, and you want to divide it by 16 and that is the max number of groups per row. So vGrpIdFlat % (width >> 4) gives you the threads group id in the x dimension and vGrpIdFlat / (width >> 4) give you the threads group id in the y dimension. Now in order to get your correct x/y location, you need to multiply your x group location by 16 and your y group location by 4(since your group size is 16x4) and then add in your position within a group. This should give you an x/y location for each thread in a blocking pattern similar to what the LDS_Transpose doc does and getting your new AbsTid is as simple as y * width + x. There might be some logic problems with the math, but if you write it out on a sheet of paper it becomes a lot easier to understand and you can probably catch any errors.
Micah,
Sorry, a few more questions:
1. Why are you ANDing by 3?
2. What are "rows"?
3. Can you adjust the block size, not just the group size?
4. What do you mean by "add in your position within the group"? How do we find this?
Thanks.


Compute versus Pixel, odd results
MicahVillmow Aug 19, 2009 3:25 PM (in response to ryta1203)Ryta,
We are always trying to improve our documentation with the feedback that we receive but I can't give any specifics on what does/does not make it into future releases. 
Compute versus Pixel, odd results
MicahVillmow Aug 19, 2009 5:52 PM (in response to ryta1203)The read is based off of X/Y positions, but the write is based of of the new AbsTid.
Compute versus Pixel, odd results
ryta1203 Aug 19, 2009 5:56 PM (in response to MicahVillmow)Ok, this makes sense, thanks.

Compute versus Pixel, odd results
ryta1203 Aug 20, 2009 4:52 PM (in response to MicahVillmow)Originally posted by: MicahVillmow The read is based off of X/Y positions, but the write is based of of the new AbsTid.
Micah,
Does this look right to you:
Given a domain size of 256x256 and a Group Size of 16x4x1 (instead of 64x1x1).
Accessing AbsTid 34 of 64x1x1 is:
1. Block ID: 32 >> 4 = 2
2. Tid/Block: 32 AND 15 = 2
3a. Block ID Xdim: 2 MOD (256/16) = 2
3b. Block ID Ydim: 2/16 = 0
4a. Tid/Block Xdim: 2 AND 3 = 2
4b. Tid/Block Ydim: 2 >> 2 = 0
5a. Start Block Xdim: 2*4 = 8
5b. Start Block Ydim: 0*16 = 0
6a. Start Thread Xdim: 8
6b. Start Thread Ydim: 0 + 2 = 2
So I'm accesing coordinates 8, 2 for this problem? Why doesn't this look right to me?
EDIT: Sorry that should be 34 AbsTid, not 32, sorry, fixed.


Compute versus Pixel, odd results
MicahVillmow Aug 20, 2009 7:44 PM (in response to ryta1203)Ryta,
Actually I got my rows and columns backward, AND'ing by 3 would give you a 4x16 block instead of a 16x4 block.
So, let me go into a little more detailed explanation that I hope will answer your questions. This is in a similiar vein to the LDS_transpose doc but should be more general.
Given an execution domain of NxM and a group size G, we can execute CEIL((N*M)/G) groups, represented as P.
So, in the 1D space we will have Group ID's from 0(P1), we have ThreadID from 0(G1) and AbsThreadId from 0  (N*M1)
Now the problem is that this linear progression of AbsThreadID's from 0(N*M1) does not fit well into the 2D caching structure of our graphics chips, so we want to block our groups to get spacial locality in our memory accesses.
So to get better cache performance, we would then block our groups into a more compact structure, and in this case 16x4.
The restrictions of doing this then becomes that N must be evenly divisible by 16 and M must be evenly divisible by 4.
So, the first step is to transform ThreadInGrpID from a single dimension of 0(G1) to a two dimension of 016 and 04, called G`.
The easiest way to do this is to calculate it via the following:
G`.x = G & 0xF and G`.y = G >> 4
Now we have to modify P so that our group id is now in two dimensions. In order to do this we pass down via constant buffers the number of groups per row(GPR) and 1/GPR, and this calculation gives us P`.
P`.x = P / GPR, P`.y = P * (1/GPR), and these need to be scaled based on our new blocking for a group so we multiply P`.x by 16 and P`.y by 4.
P` gives the thread location in the 2 dimensional space of each block and G` gives the thread location within each block, so to get the absolute thread id in 2 dimensions we just add P` and G` together.
Now, you can adjust the block size and the group size, but the optimal size is usually a multiple of the wavefront size and on HD48XX the wavefront size is 64 threads. The reason you don't want more than one wavefront in a group is that synchronization is required(i.e. barrier) to guarantee that all wavefronts in a group exit the SIMD before another wavefront from another group enters the SIMD. This barrier is not a cheap operation as it is emulated in software using something akin to a spinlock.
Compute versus Pixel, odd results
ryta1203 Sep 3, 2009 3:22 PM (in response to MicahVillmow)Originally posted by: MicahVillmow Ryta, Actually I got my rows and columns backward, AND'ing by 3 would give you a 4x16 block instead of a 16x4 block. So, let me go into a little more detailed explanation that I hope will answer your questions. This is in a similiar vein to the LDS_transpose doc but should be more general. Given an execution domain of NxM and a group size G, we can execute CEIL((N*M)/G) groups, represented as P. So, in the 1D space we will have Group ID's from 0(P1), we have ThreadID from 0(G1) and AbsThreadId from 0  (N*M1) Now the problem is that this linear progression of AbsThreadID's from 0(N*M1) does not fit well into the 2D caching structure of our graphics chips, so we want to block our groups to get spacial locality in our memory accesses. So to get better cache performance, we would then block our groups into a more compact structure, and in this case 16x4. The restrictions of doing this then becomes that N must be evenly divisible by 16 and M must be evenly divisible by 4. So, the first step is to transform ThreadInGrpID from a single dimension of 0(G1) to a two dimension of 016 and 04, called G`. The easiest way to do this is to calculate it via the following: G`.x = G & 0xF and G`.y = G >> 4 Now we have to modify P so that our group id is now in two dimensions. In order to do this we pass down via constant buffers the number of groups per row(GPR) and 1/GPR, and this calculation gives us P`. P`.x = P / GPR, P`.y = P * (1/GPR), and these need to be scaled based on our new blocking for a group so we multiply P`.x by 16 and P`.y by 4. P` gives the thread location in the 2 dimensional space of each block and G` gives the thread location within each block, so to get the absolute thread id in 2 dimensions we just add P` and G` together. Now, you can adjust the block size and the group size, but the optimal size is usually a multiple of the wavefront size and on HD48XX the wavefront size is 64 threads. The reason you don't want more than one wavefront in a group is that synchronization is required(i.e. barrier) to guarantee that all wavefronts in a group exit the SIMD before another wavefront from another group enters the SIMD. This barrier is not a cheap operation as it is emulated in software using something akin to a spinlock.
OK, lol, one more question (whew!): What are the number of groups per row? What is a row?

Compute versus Pixel, odd results
ryta1203 Sep 3, 2009 4:54 PM (in response to ryta1203)Originally posted by: ryta1203
Originally posted by: MicahVillmow Ryta, Actually I got my rows and columns backward, AND'ing by 3 would give you a 4x16 block instead of a 16x4 block. So, let me go into a little more detailed explanation that I hope will answer your questions. This is in a similiar vein to the LDS_transpose doc but should be more general. Given an execution domain of NxM and a group size G, we can execute CEIL((N*M)/G) groups, represented as P. So, in the 1D space we will have Group ID's from 0(P1), we have ThreadID from 0(G1) and AbsThreadId from 0  (N*M1) Now the problem is that this linear progression of AbsThreadID's from 0(N*M1) does not fit well into the 2D caching structure of our graphics chips, so we want to block our groups to get spacial locality in our memory accesses. So to get better cache performance, we would then block our groups into a more compact structure, and in this case 16x4. The restrictions of doing this then becomes that N must be evenly divisible by 16 and M must be evenly divisible by 4. So, the first step is to transform ThreadInGrpID from a single dimension of 0(G1) to a two dimension of 016 and 04, called G`. The easiest way to do this is to calculate it via the following: G`.x = G & 0xF and G`.y = G >> 4 Now we have to modify P so that our group id is now in two dimensions. In order to do this we pass down via constant buffers the number of groups per row(GPR) and 1/GPR, and this calculation gives us P`. P`.x = P / GPR, P`.y = P * (1/GPR), and these need to be scaled based on our new blocking for a group so we multiply P`.x by 16 and P`.y by 4. P` gives the thread location in the 2 dimensional space of each block and G` gives the thread location within each block, so to get the absolute thread id in 2 dimensions we just add P` and G` together. Now, you can adjust the block size and the group size, but the optimal size is usually a multiple of the wavefront size and on HD48XX the wavefront size is 64 threads. The reason you don't want more than one wavefront in a group is that synchronization is required(i.e. barrier) to guarantee that all wavefronts in a group exit the SIMD before another wavefront from another group enters the SIMD. This barrier is not a cheap operation as it is emulated in software using something akin to a spinlock.
OK, lol, one more question (whew!): What are the number of groups per row? What is a row?
And how does this differ from the number of blocks per row (as you mentioned earlier in your walkthrough example)?



Compute versus Pixel, odd results
MicahVillmow Aug 20, 2009 7:57 PM (in response to ryta1203)Ryta,
1 is incorrect. Block ID is AbsTid >> 6 because your block size is still 64 threads even though it is 16x4.
So by my calculations
1. Block ID: 34 >> 6 = 0
2. TidInBlock: 34 & 63 = 34
3a. Block ID Xdim: 0 since BlockID is 0(but would normally be BlockID % (numBlocksPerRow))
3b. Block ID Ydim: 0 since BlockID is 0(but would normally be BlockID / (numBlocksPerRow))
4a. Tid/Block Xdim: TidInBlock(34) & 0xF = 2
4b. Tid/Block Ydim: 2 >> 4 = 2
5a. Start Block Xdim: 0*4 = 0
5b. Start Block Ydim: 0*16 = 0
6a. Start Thread Xdim: 0 + 2 = 2
6b. Start Thread Ydim: 0 + 2 = 2
Compute versus Pixel, odd results
ryta1203 Aug 21, 2009 4:51 PM (in response to MicahVillmow)Micah,
Then why do they use 4 in the lds_transpose example? l0.x is 4

Compute versus Pixel, odd results
ryta1203 Sep 11, 2009 12:58 PM (in response to MicahVillmow)Originally posted by: MicahVillmow Ryta, 1 is incorrect. Block ID is AbsTid >> 6 because your block size is still 64 threads even though it is 16x4. So by my calculations 1. Block ID: 34 >> 6 = 0 2. TidInBlock: 34 & 63 = 34 3a. Block ID Xdim: 0 since BlockID is 0(but would normally be BlockID % (numBlocksPerRow)) 3b. Block ID Ydim: 0 since BlockID is 0(but would normally be BlockID / (numBlocksPerRow)) 4a. Tid/Block Xdim: TidInBlock(34) & 0xF = 2 4b. Tid/Block Ydim: 2 >> 4 = 2 5a. Start Block Xdim: 0*4 = 0 5b. Start Block Ydim: 0*16 = 0 6a. Start Thread Xdim: 0 + 2 = 2 6b. Start Thread Ydim: 0 + 2 = 2
So this should work, providing BlockID/(numBlocksPerRow) = width/16 for 16x4 block size.
"dcl_literal l0, 6, 63, 15, 4\n" "dcl_literal l1, 16, 0, 0, 0\n" "ishr r0.x, vaTid0.x, l0.x\n" // Block ID "and r0.y, vaTid0.x, l0.y\n" // Tid within a block "umod r0.z, r0.x, cb0[0].z\n" // Get the block id in the x direction "udiv r0.w, r0.x, cb0[0].z\n" // Get the block id in the y direction "iand r1.z, r0.y, l0.z\n" // Get the tid within a block in the x direction "ishr r1.w, r0.y, l0.w\n" // Get the tid within a block in the y direction "imul r1.x, r0.z, l0.w\n" // Get the starting X position of the block "imul r1.y, r0.w, l1.x\n" // Get the starting y position of the block "iadd r11.x, r1.x, r1.z\n" // Get the x position of the thread "iadd r11.y, r1.y, r1.w\n" // Get the y position of the thread "imad r10.x, r11.y, cb0[0].x, r11.x\n"

Compute versus Pixel, odd results
ryta1203 Sep 15, 2009 4:12 PM (in response to ryta1203)Originally posted by: ryta1203
Originally posted by: MicahVillmow Ryta, 1 is incorrect. Block ID is AbsTid >> 6 because your block size is still 64 threads even though it is 16x4. So by my calculations 1. Block ID: 34 >> 6 = 0 2. TidInBlock: 34 & 63 = 34 3a. Block ID Xdim: 0 since BlockID is 0(but would normally be BlockID % (numBlocksPerRow)) 3b. Block ID Ydim: 0 since BlockID is 0(but would normally be BlockID / (numBlocksPerRow)) 4a. Tid/Block Xdim: TidInBlock(34) & 0xF = 2 4b. Tid/Block Ydim: 2 >> 4 = 2 5a. Start Block Xdim: 0*4 = 0 5b. Start Block Ydim: 0*16 = 0 6a. Start Thread Xdim: 0 + 2 = 2 6b. Start Thread Ydim: 0 + 2 = 2
So this should work, providing BlockID/(numBlocksPerRow) = width/16 for 16x4 block size.
So this kernel does not produce expected results... only zeroes. I'm not sure this is the right method that you described above.

Compute versus Pixel, odd results
ryta1203 Sep 15, 2009 4:25 PM (in response to ryta1203)I appears that when the "iand r0.y, vAbsTidFlat.x, l0.x" the only result I ever get is zero.
Even for the simple kernel below, I still just get zero. So vAbsTidFlat.x is always zero?? What register should I be using here instead?
const char * HILKernel = "il_cs_2_0\n" "dcl_num_thread_per_group 64\n" "dcl_literal l0, 0x6, 0x3F, 0xFFFF, 0x4\n" "iand r0.y, vAbsTidFlat.x, l0.z\n" "mov g[vAbsTidFlat.x], r0.y\n" "ret_dyn\n" "end\n";

Compute versus Pixel, odd results
ryta1203 Sep 15, 2009 4:57 PM (in response to ryta1203)Actually, after using vAbsTidFlat and "and"ing it with a texture read in value it's fine.
I only get zeroes when I use a literal. Apparently I am using the literals wrong or something, not sure exactly. Even if I use a constant buffer with the same value as the literal I get the correct results but not when I use the literal.

Compute versus Pixel, odd results
ryta1203 Sep 15, 2009 5:20 PM (in response to ryta1203)Is there anyway that we can get an advance document on this subject? Something in the general case with good diagrams/figures?

Compute versus Pixel, odd results
ryta1203 Sep 16, 2009 4:45 PM (in response to ryta1203)Micah,
I have solved this issue.... as always it's something a lot simpler than expected.... I forgot to convert the texture coordinates into float from int (the thread id is given in int but the texture coords need to be in float).... as soon as I did that, BAM, worked fine... I knew I was missing something, didn't think it was that though.
I believe I am using a 16x4 approach (though I'm copying a lot of the lds example and it supposedly uses an 8x8 approach).
EDIT: Forgot why I originally posted: to say Thanks, so Thanks!







Compute versus Pixel, odd results
MicahVillmow Aug 21, 2009 4:54 PM (in response to ryta1203)The lds_transpose example is using hardware specific feature to get a very high performing transpose. The sample does 4 16x16 transposes per wavefront, so sets the group size to 64. Each thread transposes 4 float4's, so we need to offset our thread id's within the block to a stride of 4. I looked at the doc recently and it needs to be more clear, so I'll try to clean it up a little before the next release.
Compute versus Pixel, odd results
ryta1203 Sep 3, 2009 4:44 PM (in response to MicahVillmow)Originally posted by: MicahVillmow The lds_transpose example is using hardware specific feature to get a very high performing transpose. The sample does 4 16x16 transposes per wavefront, so sets the group size to 64. Each thread transposes 4 float4's, so we need to offset our thread id's within the block to a stride of 4. I looked at the doc recently and it needs to be more clear, so I'll try to clean it up a little before the next release.
Also, how about some better figures... ones that show everything, similar to what the CUDA Programming Guide has to offer. Their guide easily shows how the threads and blocks are organizedm, very clearly from just the figures.


Compute versus Pixel, odd results
MicahVillmow Sep 3, 2009 4:59 PM (in response to ryta1203)There are 4 blocks per group, so the number of groups per row is number of blocks per row / 4. A row is the width of the 2D texture + the height of a group. So if you had a 64x64 texture, there would be 4 blocks per row, since each block is 16x16 and there would be 1 group per row since each group is 4 blocks.
Compute versus Pixel, odd results
ryta1203 Sep 3, 2009 5:23 PM (in response to MicahVillmow)Originally posted by: MicahVillmow There are 4 blocks per group, so the number of groups per row is number of blocks per row / 4. A row is the width of the 2D texture + the height of a group. So if you had a 64x64 texture, there would be 4 blocks per row, since each block is 16x16 and there would be 1 group per row since each group is 4 blocks.
So if you were going with an 8x8 (block size) arrangement then you would have:
8 blocks per group
2 Groups per row
A row would be 64+8=72
8 blocks per row
??


Compute versus Pixel, odd results
MicahVillmow Sep 3, 2009 5:28 PM (in response to ryta1203)So, if you were still doing 4 blocks per group, and 8x8 block size, then you would have 2 groups per row. But if you kept it at 1 group = 4 * 16x16, then you would have 8 blocks per group and 1 group per row.
Compute versus Pixel, odd results
ryta1203 Sep 3, 2009 6:13 PM (in response to MicahVillmow)Is it possible to do 8 blocks per group of 8x8 block size? This makes sense to me, since that would still fit in a row (1 group per row)
And by your general description you don't need the absolute thread id to start with at all?
You just need to know the group size, the block size you want, the groups per row and the number of groups. Are you flipflopping the term group with block? The transpose uses Block ID and Thread ID a lot and you haven't used either really and you also haven't needed to access the absolute thread id to calculate the 2D absolute thread id.
