2 Replies Latest reply on Jun 26, 2010 4:10 AM by ztatsuch

    Compute shader mode by 4870


      I have a question.
      my VGA is a 4870 single gpu card. so, it has 10 Compute Units and 16 Stream Cores per 1 Compute Unit.
      By the way, in the below IL,
      "dcl_num_thread_per_group 64\n"
      "dcl_lds_size_per_thread 16\n"
      if a thread needs 1 Stream Core, "64 threads / 1 group"  means  "4 Compute Units / 1 Group".
      Then, in my case, 2 groups can run at the same time. and during the execution, 2 Compute Units sleep ?
      (especially, a case which LDS are used in a group).

      Should I rewrite  "dcl_num_thread_per_group 64\n" --> "dcl_num_thread_per_group 32\n" so that 5 groups(10 Comput Units) can run at the same time ?

        • Compute shader mode by 4870

          "4 Compute Units/1 Group" is not accurate.

          There are 10 compute units, each with 16 thread processors (or whatever they are calling them these days, it's too hard to keep track), each TP (thread processor) with a 5-wide VLIW processor.

          The "groups" (I will call them wavefronts) are 64 threads. The threads are organized into 16 quads (2x2 threads), or at least last time I checked, lol.

          Then there are two slots, odd and even, for wavefronts per compute unit (I will call them SIMD engines from here on out).

          So, you have two wavefronts (one for each slot) running eight instructions over eight cycles on one SIMD engine organized into 16 quads (one quad per thread processor) which are organized into 2x2 threads.

          So, 16*2*2 = 64 threads = one wavefront and you have two wavefronts running 8 instr over 8 cycles (so they say).

          Hope this helps.

          Also, "run at the same  time" is a somewhat tricky terminology.

          Technically, only two wavefronts run "at the same time" on a SIMD engine at ONE time; however, wavefronts are queued and scheduled based on resource usage (essentially GPRs used). This allows for wavefronts to be switched out with other wavefronts at the end of an ISA clause to better hide latency (for example, if the WF running is doing some fetching and the WF waiting will use the ALU units, etc... blahblahblah).

          You can find A LOT of VERY useful information both on the ATI Stream forum and the AMD OpenCL forum (developer forums that is) simply by searching them. Try keywords like "wavefront", etc, etc...