7 Replies Latest reply on Aug 6, 2009 1:38 PM by hagen

    how to  locate the thread in a kernel



      1. How to request the number of threads? Through setting the domain size?

         taking the cal_idct for example, have a look at the code below,

         // Setup a computation domain

      g_calDomain3D.width = Info.Width; // assume 256

          g_calDomain3D.height = Info.Height; // assume 256

          g_calDomain3D.depth = 1;


      In this way did I request 256 x 256 threads?


      2. As we know, in CAL we can organize the threads into thread group (also called block), and organize thread groups into grid, like the code below(from the cal_idct example):

      CALevent event = 0;

          g_calProgramGrid.func             = g_calFunc;

          g_calProgramGrid.flags            = 0;

          g_calProgramGrid.gridBlock.width  = 64; //needs to be = thread group size as given in IL kernel.

          g_calProgramGrid.gridBlock.height = 1;

          g_calProgramGrid.gridBlock.depth  = 1;

          g_calProgramGrid.gridSize.width   = (g_calDomain3D.width * g_calDomain3D.height + 

      g_calProgramGrid.gridBlock.width - 1) / 


          g_calProgramGrid.gridSize.height  = 1;

          g_calProgramGrid.gridSize.depth   = 1;


      and in IL kernel, we can get the absolute thread id through vaTid instruction. And I am confused that, in this example, every thread can process a 8x8 block, then to process the entire matrix, we only need 256 * 256 / 64 threads, which is conflicted with the number of total threads we requested before. If we make each thread process only one element of the matrix, it seems that we need 256 * 256 threads. How to get correct understanding of this ? Does the organization of the threads decide the computation task of every thead, or the opposite, or .....


      3. Now look at the situation in Brook+ application. In Brook+, we can explicitly set the domain size through kernel interface, or by default map the size of output stream to the domain size. Similarly, does the domain size mean the number of total threads?

      I find that in Brook+, we can only set the domain size with kernel interface, and set the thread group size with Attribute keyword, we can not organize the thread groups. And we can  know the thread id in a group with instanceInGroup() funciton. But How can we know the absolute thread id or the thead group id in the kernel? If I want to make a thread process more than one element, e.g. a 8x8 block , I need thread id  information.



        • how to  locate the thread in a kernel

          In brook+, the thread size is determined by the size of the streams.  You don't have to define the thread size explicitly.  The Stream Computing User Guide explains the concept of streams.  (And domain size is not the way to do it.)

          The function instance() gives you the thread id.  (See user guide sect. 2.9)

            • how to  locate the thread in a kernel

              I don't think instance() returns the thread id. Did you notice the code in the .hlsl file generated by brcc?  As the below:


              int4 __getOutputIndex(int pos, int4 outStream, int4 outBuffer)


                  int4 outIndex = int4(0, 0, 0, 0);

                  int2 intPos = int2(pos, 1);

                  int index = intPos.x;

                  int height = index / outStream.x;

                  outIndex.x = index - height * outStream.x;

                  outIndex.z = height / outStream.y;

                  outIndex.y = height - outIndex.z * outStream.y;


                  return outIndex;



              struct csThreadInfo 

                  int tid : SV_RelThreadId; 

                  int atid : SV_AbsThreadId; 

                  int gid : SV_ThreadGroupId; 


              [NumThreads(64), LocalDataShare(0), LocalDataShareRel]


              main (csThreadInfo __threadInfo)


              int __instanceInGroup = __threadInfo.tid;

              int4 __indexof_c;

              int4 __indexofoutput;

              __indexofoutput = __getOutputIndex(__threadInfo.atid, __outputStreamShape, __outputBufferShape);

              __indexof_c = __indexofoutput;



              __instanceInGroup );



              It seems that the instance() is calculated through the tid value in the csTheadInfo struct.

               The document says that the instance() returns the index of the element that the kernel currently being mapped over(user guide A.5). Did the index element mean the thead id?  I think the element heres means the unit of output stream.  If so, it means that a thead only can process only one element. But sometimes we can make a thead process a block of elements, illustrated in cal_idct example. And now how to explain the relation between the thread id and element index? I am not quite sure whether I got it. A little confused :-S

                • how to  locate the thread in a kernel

                  In a brook+ kernel, each element of the stream is mapped to a thread.  instance() returns the id of that element, which is also the thread id.

                  If you want to each thread to process several input elements, you need to pass them into the kernel as several input streams, or alternatively, use a gather stream.

                  Are you coding kernel in brook+ or CAL/IL?  If you are coding in brook+, why look at the hlsl?

                    • how to  locate the thread in a kernel

                      I was just confused about the relation between the thead id and element index. Considering we do the addition of two 16x16 matries, to get another 16x16 matrix. If we pass the 16x16 stream to the kernel, we will request 16x16 threads to do the addition. If in the kernel one thread process one element, indeed we need those threads, and the thread id is the same to element index. But if we make a thead process two elements, and write them to output stream, only half of those threads are needed. Doesn't it?

                      We get the thread id through instance(), which is also the element index. In this case , how to do with the extra threads? or the extra threads I mean doesn't not exist at all....I am so new to it, can you figure it out?

                        • how to  locate the thread in a kernel

                          Doing one element per thread is the right way to do it in streams and will give you the best performace.

                          If you really want to do 2 elements per thread, you will need to pass the whole matrice in as gather streams, and the brook+ code quickly becomes more complex. There is no reason to do this, and you will surely see a performance hit.

                          See sample code in atibrook/samples/CPP/apps/SimpleMatMul on how to work with gather.

                            • how to  locate the thread in a kernel

                              I have read SimpleMatMul sample before and I know how to work with gather stream. 

                              You mean doing one element in the kernel will give the best performance? Considering that we have to access the same data to get different elements of the results,  I think if we do accessing once and get more results in a thread, to make the thread do more, the performance should be better. What do you think?

                              Did you read the sample code in  ATI CAL1.4.0_beta\samples\app\cal_idct? In the kernel string, it makes a thread do a block of elements and gets a block of outputs, which is performing better than the version doing one element I wrote in Brook+.

                              The following is the part of the code in CAL IDCT kernel:


                               // save 8x8 DCT coefficient block location

                               "ishl r16.x, vaTid.x, l8.w\n"


                               // load packed 8x8 DCT coefficients using texture cache

                               "mov  r0, g[r16.x+0]\n" 

                               "mov  r2, g[r16.x+1]\n" 

                               "mov  r4, g[r16.x+2]\n" 

                               "mov  r6, g[r16.x+3]\n" 

                               "mov  r8, g[r16.x+4]\n" 

                               "mov r10, g[r16.x+5]\n" 

                               "mov r12, g[r16.x+6]\n" 

                               "mov r14, g[r16.x+7]\n" 

                              In the code above, it first gets the absolute thread id and then maps it to a 8x8 block, which will be  processed later. So I wonder whether I can do the same thing in Brook+.

                                • how to  locate the thread in a kernel

                                  Yes, to payoff, you need to do a bunch of operations for each memory fetch, so a matrix addition kernel doing one element per thread isn't going to be a high-performer in the first place.  I assumed you would do more computations per element.  But if the amount of work is really small per element, you can certainly block them and define your own access pattern.