20 Replies Latest reply on Feb 18, 2010 12:57 PM by CaptainN

    SR (Shared Registers), sharing level.

    CaptainN
      Whether SRx shared per SIMD thread, per EVEN/ODD Wavefront’ thread, or per Wavefront# (of SIMD)’ thread.

      Ultimately, the task is to pass data between kernels using SR (Shared registers). SR registers supposed to be persistent from thread to thread (within waveform) between different kernel invocations when launched from calCtxRunProgramGridArray)

      Micah,

      I really tried to make it short post.

      Based on number of answers and posts (http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=116932, http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=115872, http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=121826) I wrote number of tests and still can not say for sure what is the sharing level of SR registers. Following documentation (http://developer.amd.com/gpu_assets/R700-Family_Instruction_Set_Architecture.pdf) SR shared per SIMD.

      However, results rather do not confirm this statement. I have used RV730 (HD4670) chip for these experiments.

      Allocated array 128x4 of float4. Size is doubled, reason will be seen below.

      Total 3 kernels is used for experiments (run from calCtxRunProgramGridArray): krn#1 reset sr0 and g[0..511]; krn#2 incr. sr0.x by 1; krn#3 reads sr0 and writes to g[0..511] (for even and odd wavefronts to compare). Every elem. of g[] (float4) array has sr0 value: number of increments by krn#2 (x), tid of krn#1 (y), tid of krn #2(z) and tid of krn #3(w) (yes, I am following lpw’ idea).

      :

      1.       Thread block == 32, Group size == 8. Total 256 threads to run, to occupy full capacity.  g[0..255] has been populated as expected, i.e. g[0.255].x == 1 (incremented by 1 in krn#2), g[0..255].y == g[0..255].z == g[0..255].w == 0..255. In this case SR looks like shared per SIMD.

       

      2.       Thread block == 32, Group size == 16 for krn#1 and krn#3, but Group size == 8 for krn#2. Total 512 threads to run, to occupy full capacity twice (for setup and read SR to g[] kernels). Every SIMD will get scheduled twice for krn#1 and krn#3, and it will be odd and even wavefronts there.  However, Groupsize of krn#2 I set to 8, so every thread for krn#2 will run only once on odd wavefront, where sr.x will be incremented by 1.

      Krn#3 populates g[0..511] this way: i.e. . g[0..511].x == 1 (SR.x), g[0..511].y == g[0..511].z == g[0..511].w == 0..511. Both even and odd wavefronts of krn#3 read SR.x as 1, so again, it looks like SR shared per SIMD.

       

      3.       Thread block == 32, Group size == 16. Total 512 threads to run, to occupy full capacity twice. Every SIMD will get scheduled twice, and it will be odd and even wavefronts for all 3 kernels.  g[0..511] has been populated this way, i.e. . g[0..511].x == 1, g[0..511].y == g[0..511].z == g[0..511].w == 0..511. If SR is shared per SIMD, then this value must be ==2, because even and odd wavefronts should increment it once, so total it should be 2  on time when krn#3 reads SR0 to place into g[].x. So it makes me think SR shared by wavefront (so one ” instance” of SR0 for even wavefront, and another SR0 “instance” for odd wave front).

       

      If I increase number of GroupSize for krn. #2 to 32, (while keeping krn#3 and krn#1 GroupSize == 16) the values in SR found are neither 2 (if it shared by odd/even wavefront), nor 4 (if it is shared per SIMD), but ==  3. How come? This is possibly wrong setup as group size is different between kernel invocations (seems ok, though), but possibly looks like SR shared between wavefront # (so every of 4 wavefronts have its own SR0, not confirmed though).

      Also, if krn#2 increases sr0.x by 2 (two add sr0.x, sr0.x, l0.w ops, where l0.w set to 1.0),  in case Thread block == 32 and group size == 16 for all of 3 kernels ( example#3 above), g[0..511].x seen == 3, what I can not explain either.

      So, what is the sharing level of SR?

        • SR (Shared Registers), sharing level.
          CaptainN

          IL code used to get the results (see post above):


          static CALchar krn1[] =
          "il_cs_2_0\n"
          "dcl_lds_sharing_mode _wavefrontAbs\n"
          "dcl_num_thread_per_group 32\n"
          "dcl_shared_temp sr2\n"
          "dcl_cb cb0[1]\n" //dummy
          "dcl_literal l0, 0, 0, 0, 0\n"
          "itof sr0, l0\n"
          "itof sr0.y, vaTid.x\n"
          "mov g[vaTid.x], r0.0000\n"
          "end\n";

          static CALchar krn2[] =
          "il_cs_2_0\n"
          "dcl_lds_sharing_mode _wavefrontAbs\n"
          "dcl_num_thread_per_group 32\n"
          "dcl_shared_temp sr2\n"
          "dcl_cb cb0[1]\n"

          "dcl_literal l0, 1.0, 1.0, 1.0, 1.0\n"
          "ftoi r0, cb0[0]\n"

          "ult r17, vaTid.x, r0.x\n"
          "if_logicalnz r17.x\n" //this condition always satisfied
          "    add sr0.x, sr0.x, l0.x\n"
          //"    add sr0.x, sr0.x, l0.x\n" //second increment...
          "endif\n"

          "itof sr0.z, vaTid.x\n"
          "end\n";

          static CALchar krn3[] =
          "il_cs_2_0\n"
          "dcl_lds_sharing_mode _wavefrontAbs\n"
          "dcl_num_thread_per_group 32\n"
          "dcl_shared_temp sr2\n"
          "dcl_cb cb0[1]\n" //dummy

          "  mov r0.x, vaTid.x\n"
          "  itof sr0.w, vaTid.x\n"
          "  mov g[r0.x], sr0\n"

          "end\n";

            • SR (Shared Registers), sharing level.
              hazeman

              As no one seems to be too eager to answer I will add my 4 cents.

              First of all you are really brave to use group size which isn't multiple of 64. On 4xxx warp size is 64 - and warp is sometimes called hardware thread for gpus ( gpus thread isn't a real cpu thread ). So using group size smaller than 64 is like trying to split cpu thread. I really don't know how CAL handles group sizes which aren't multiple of 64. This can be one source of your problems.

              The other one could be that sr registers are shared beetwen warps. So all first threads in warps are accessing the same register, all second threads access another register and so on. ATI calls it vertical sharing or something.

              There is some nice slide showing this in 3xxx/4xxx docs.

              • SR (Shared Registers), sharing level.
                CaptainN

                Hi Hazemen, thanks for reply!

                As of warp size == 64, isn't it same as wavefront size? I understand it is important parameter here, but RV730 I work with has 8 SIMDs, with wavefront size == 32, what makes me think it has 8 thread processors per SIMD (in two quads).  Whether waveform size (reported) and warp size have the same physical roots?

                 

                 

                  • SR (Shared Registers), sharing level.
                    hazeman

                     

                    Originally posted by: CaptainN

                     

                    As of warp size == 64, isn't it same as wavefront size?



                    I think term wavefront isn't used anymore ( I found it on some early ATI's slides ). Probably it's the same as warp.

                    I understand it is important parameter here, but RV730 I work with has 8 SIMDs, with wavefront size == 32, what makes me think it has 8 thread processors per SIMD (in two quads).  Whether waveform size (reported) and warp size have the same physical roots?

                     

                    RV730 has warp size 64. Also it doesn't have 8 simd cores. It has 4.

                    Unfortunatelly NVidia/ATI decided to use term thread to describe something that isn't really a thread ( so they could say "we can run >1000 threads" and sell more cards ). This leads to quite a confusion.

                    You should think of SIMD engine as a 1 CPU with strange properties. And warp should be considered a normal thread. Gpu thread should be considered as one operation in vector instruction ( like SSE multiply,add - they do 4 operations at once - so it would be 4 gpu threads ).

                    So your RV730 has 4 cpus , each can do 64 wide instructions at once ( to confuse more each instruction can do from 1 to 5 basic operations ).

                     

                     

                • SR (Shared Registers), sharing level.
                  MicahVillmow
                  Hazeman,
                  I think this is a better analogy for how things work. There are two types of threads in a CPU, a hardware thread and a software thread. A hardware thread usually is per core but there can be multiple software threads per hardware thread. Intel's Hyperthreading gives the illusion that there are twice as many cores, but actually it just runs two hardware threads per core. The GPU is similiar but 'wider' in almost every aspect.

                  To map this onto the GPU(Ignoring CF/TEX), going from the bottom up, each software thread(work-item in OCL) can execute 1-5 instructions per cycle, which is similiar to SSE. A hardware thread(wavefront) can execute 1-64 work-items in parallel depending on the wavefront size and the number of work-items available. Each core(SIMD) executes two wavefronts in parallel, but can have up to 32 wavefronts waiting to execute depending on resources. Each device can have between 1 and 20 SIMD's depending on the generation and model.

                  Hope this clears things up.
                    • SR (Shared Registers), sharing level.
                      hazeman

                       

                      Originally posted by: MicahVillmow Hazeman, I think this is a better analogy for how things work. There are two types of threads in a CPU, a hardware thread and a software thread.


                      Imho it would be better to give new name for gpu thread. Maybe fiber or subthread and call wavefront a thread. This way new people coming to gpu world wouldn't be confused about what is what ( and almost everybody is at the beginning ).

                      But whatever analogy we use the whole problem arised from abuse of thread concept by NVidia/ATI.

                       

                       

                      • SR (Shared Registers), sharing level.
                        CaptainN

                        Dear Hazeman!

                        I think you got mistaken: calDeviceGetAttribs returns (CALdeviceattribs structs):

                        target: CAL_TARGET_730 (Dev manager shows HD 4600, and it is card with one HDMI output, so it is HD4670).

                        wavefrontSize: 32

                        numberOfSIMD: 8

                        (hard to believe CAL RT is wrong).

                        IMHO, wavefront (or warp) of RV730 IS 32 hw threads per SIMD. Micah, Please confirm.

                        They are still threads, running shaders. I see them as SIMD running wavefrontSize threads in parallel what makes a wavefront. Wavefront finishes when all hw threads (32 in case of RV730) finish. Then next wavefront goes.  For me SSE analogy is ok, but only in first approximation because in SSE it is 1 hardware thread is running issuing true SIMD COMMAND while in GPU I have 32 (in case of RV730) threads living they own lives. In SSE all registers must be implicitly loaded before op takes place. Here, all driven by data source or by thread scheduling... Thinking of SIMD as “strange” CPU IMHO makes it more complicated, especially for compute shaders where control is done by thread scheduling.

                        Back to the problem: I am looking for transfer data between wavefronts (within 1 kernel) and between kernels, and so far I see only 2 things which can help me: SR and LDS. Also, I want to have atomic operations which every hw thread (part of wavefront) does on variable which will be transferred between wavefronts.

                        As Micah mentioned, in 4xxx there are 2 wavefronts which will run in parallel in 1 SIMD, if you schedule twice more threads then thread processors in a chip (for 770 it will be 640x2=1280 threads). I can not prove SR belong to hw thread # within wavefront, neither it belong to odd/even wavefront #. I think I missing something, that’s why my question still remains valid: what is SR sharing level?. At this moment it is easy to believe for me that SR is broken (may be from IL level).

                        BTW, I make it working via LDS, can easily see output from every SIMD thread and indeed increment happen # of times eq. to hw thread passes. As ops on LDS storage (abs addressing case) is not atomic (read from LDS,++, write to LDS) whether I have a chance to have my LDS storage incremented wrong if even/odd wavefronts will run in parallel and one of the increments will be lost? (I afraid it is possible, then how can I do atomic ops to shared variables, where everything seems to be screwed up by potential parallel wavefront on the same SIMD, which will operate on the same location (memory or LDS)).

                        I can attach/post app code if needed, but it is built based on samples and a bit ugly because of R&D J

                         

                          • SR (Shared Registers), sharing level.
                            hazeman

                             

                            Originally posted by: CaptainN 

                             

                            Here, all driven by data source or by thread scheduling... Thinking of SIMD as “strange” CPU IMHO makes it more complicated, especially for compute shaders where control is done by thread scheduling.

                             

                             



                            When you need to heavily optimize your code this approach to treat SIMD as vector CPU is usually used ( look at Volkov paper about optimizing matrix multiplication ).

                            And unfortunatelly gpu only simulate scheduling of "gpu threads".

                            Let's consider example. Assume that we have only one wavefront running ( 64 gpu threads ). The pseudo code is:

                            --- code

                            float a = input_memory[thread_idx];

                            if( a>3 ) {

                             a = a + 10;

                            } else {

                             a = a + 20;

                            }

                            --- code

                            If the gpu threads would be real threads then some threads would do a = a + 10 and the others would do a = a + 20;

                            The gpu would do 64 operations at total.

                            But this isn't the case on the gpu.

                            This code is translated into sequential code with 64 wide operations.

                            --- code

                            bool64 cmp = vector64_a>vector64(3);

                            result_a = vector64_a + vector64(10);

                            result_b = vector64_a + vector64(20);

                            vector64_a = select64(cmp,result_a,result_b); <- this part is done in hardware. This is the only advantage of "gpu thread" model - you don't need to explicitly write masking code, as hardware/compiler is doing this for you.

                            --- code

                            As we can count the gpu is doing 128 operations.

                            So instead of 64 operations we have 128 ( and depending on kernel difference can be much bigger ).

                            This is exactly why when you want to write performance code you can't think of gpu threads as real threads.

                              • SR (Shared Registers), sharing level.
                                CaptainN

                                Dear Hazeman,

                                As of SIMD/WavefrontSize, this is what I got, and it is regular 730 from the store. We have to have somebody from AMD or another purson to do third test to confirm what is going on. How can you be so sure, do you have the same board in  possesion?

                                As of logic you describe, I agree with you 100%, this is what is happening. Thinking of hw threads as real threads (say Y direction, how wavefront propogate) gives me flexibility to consider barriers and another sync tools, as well as easy to think about LDS/SR physics. Given the fact all threads a clocked from the same source, you are thinking in X direction per op per clock, and it is hard for me to call it HW thread, as thread presume instruction execution but not simultanious vector operations.

                                Coming back to my quesiton:

                                As ops on LDS storage (abs addressing case) is not atomic (read from LDS,++, write to LDS) whether I have a chance to have my LDS storage incremented wrong if even/odd wavefronts will run in parallel and one of the increments will be lost? (I afraid it is possible, then how can I do atomic ops to shared variables, where everything seems to be screwed up by potential parallel wavefront on the same SIMD, which will operate on the same location (memory or LDS)).

                                 

                          • SR (Shared Registers), sharing level.
                            MicahVillmow
                            hazeman,
                            This is fixed with the OpenCL Nomenclature, it will just take a little bit for everyone to use it. These are work-items(an individual unit of execution) and work-groups(a group of work-items that execute on the same device).
                              • SR (Shared Registers), sharing level.
                                ryta1203

                                Micah,

                                 I'm not a big fan of the term "item", it's simply too vague. Then again, no one asked me. lol.

                                • SR (Shared Registers), sharing level.
                                  hazeman

                                   

                                  Originally posted by: MicahVillmow hazeman, This is fixed with the OpenCL Nomenclature, it will just take a little bit for everyone to use it. These are work-items(an individual unit of execution) and work-groups(a group of work-items that execute on the same device).


                                  Imho it isn't - there is nothing in OCL to describe wavefront ( workgroup can contain multiple wavefronts ). Beside OCL nomenclature isn't only for gpus. And in practice when people talk about optimizing kernels they use hardware thread ( and it will probably stick ).

                                  CaptainN - It's simply impossible for true 46xx card to have 8 simds and 32 wavefront size. Maybe your card is some mobile chipset ( notebook maybe ? ). I know that ATI selled some rv6xx chipsets as 4xxx cards in mobile market.

                                   

                                • SR (Shared Registers), sharing level.
                                  MicahVillmow
                                  RV770 is 10 SIMD's with a wavefront size of 64
                                  RV730 is 8 half wide SIMDs with a wavefront size of 32
                                  RV710 is 4 quarter wide SIMD's with a wavefront size of 16
                                  • SR (Shared Registers), sharing level.
                                    MicahVillmow
                                    CaptainN,
                                    Sorry for the late response on this question. A SR register is atomic within a single IL instruction. It is shared along the work-item ordinal so it must be updated in a single cycle. If you want atomics to LDS, you need to use 5XXX series as 4XXX series does not expose any method of doing atomics operations in IL.
                                      • SR (Shared Registers), sharing level.
                                        CaptainN

                                        Micah, Thank you for reply!

                                        To follow up,

                                         

                                        In abs sharing mode, whether even and odd wavefronts will point to the same location of LDS so there is a chance they will conflict when even and odd will write to the same location (like odd wavefront, thread ordinal #0 and even wavefront, thread ordinal#0 will point to the same LDS offset)? Should I worry about it and if yes, how can I secure location from not to be updated from another concurrent wavefront running in parallel on the same SIMD? (note: in case of histograms, there is a chance bin increment will be lost due-to concurrent waveforms. It’s should be a general issue so I believe it must be a technique to solve it).

                                      • SR (Shared Registers), sharing level.
                                        MicahVillmow
                                        CaptainN,
                                        The only way to guarantee that LDS operations don't clobber each other is to use relative sharing mode. In Absolute sharing mode if you have more than 1 wavefront per simd, then the data cannot be guaranteed to be correct.
                                          • SR (Shared Registers), sharing level.
                                            CaptainN

                                            Micah,

                                            Whether SIMD (say SIMD#0) can have more then 1 thread group to be processed at the same time, regardless of lds sharing mode?

                                            (other words, if dcl_num_thread_per_group set to 64 (for 770) ,and of course corresponding gridBlock.width of CALProgramGrid set to 64 as well, would given SIMD #0 will ever have more then 1 wavefront ever scheduled, at any given time?)

                                          • SR (Shared Registers), sharing level.
                                            MicahVillmow
                                            CaptainN,
                                            If absolute address mode is specified only 1 wavefront will execute at a time per SIMD, but if the SIMD is empty and another wavefront is available it will execute on that SIMD.
                                              • SR (Shared Registers), sharing level.
                                                CaptainN

                                                It seems will solve my problem. limiting thread group to waveform size (for absolute addressing mode) will give only 1 waveform to be executed, hence every thread can read-modify-write LDS storage safe without worries something else will interfere data on LDS! (so, only 1 thread group will be executed on SIMD at any given time).

                                                From what you said, whether in relative addressing mode more then 1 thread group will be scheduled for SIMD (say SIMD# 0) and more then 1 wavefronts will be executing at the same time, even and odd, (if thread group set to waveform size)?

                                                • SR (Shared Registers), sharing level.
                                                  CaptainN

                                                  And first of all, Micah, I should thank you for the light in these tunnels. Please keep up good work and best wishes to you and your team.