9 Replies Latest reply on Dec 23, 2008 9:01 PM by twinclouds

    ATI Stream for Communication link simulation

    twinclouds
      Is the FireStream Processor suitable for such tasks?

      While I went to the AMD Website, I "discovered" the FireSteam processor.  It looks like that the processor might be useful for communication link simulation that requires a lot of resources, but I want to make sure I understand it correctly.

      In evaluation the performance of a communication link, we often run the same simulation program multiple times under different parameters, e.g. different SNR points and different channels.  These runs are relatively independent.  From the description of the FireStream Processor, sounds like such tasks are naturally suitable for it.  Is this correct?  I would like to make sure before we start going down this route.

      Really appreciate if someone can let me know their opinions and share their experience.

      Thanks.

        • ATI Stream for Communication link simulation
          pbhani

          I have no experience with this kind of a simulation but from your description, it does sound like something that will map well to data parallel architectures like GPUs. The only thing to be careful about is the number of kernel instances that you need. To get the maximum performance out of GPUs, your application needs to launch thousands of work-items to keep the GPU busy. If that is not the case for you, you might be underutilizing the GPU and get lower performance acceleration.

           

            • ATI Stream for Communication link simulation
              twinclouds

              Hi, Thank you for your prompt response.   I have a few questions may be you can answer.

              1. You said one need "to launch thousands of work-items to keep the GPU busy."  How many processing units are there anyway?  I my impression is that there are only at most a few hundreds.   If this is the case, why we need to launch thousands of work items?

              2. Assuming we only run one instance, what will be the equivalent of the clock speed of an ordinary AMD or Intel CPU to run to get the results in the same time duration?

              3. Program memory requirement, for execution the same C program, what will be the amount of memory needed for using the Stream processor vs. the amount needed for ordinary CPU?  Does the amount of memory scales with number of instances launched?

              I don't really need fully utilize the processor power.  I will be happy if it can be generate results as, say 20, ordinary CPUs, at the same time duration.

              Really appreciate your reply.  Sorry if some of the questions sound naiive.  I just want to determine quickly if this is something I would like to do.

              Fuyun

                • ATI Stream for Communication link simulation
                  pbhani

                  1. GPUs need a lot of threads to hide latencies incurred due to memory operations. Typically, if you have more threads (work items), the GPU engines have a better chance of hiding memory latencies, giving you better performance.

                  2. Don't have specific numbers but definitely the performance would be quite poor compared to x86 as we have extremely high memory latencies compared to CPUs.

                  3. Depending on what you are doing, the memory footprint could remain the same or in the worst case double as you could be creating 2 copies of the data - 1 for the CPU and other for the GPU.

                  GPU performance doesn't necessarily come for free. You need to try and do the right things to hit the performance sweet spot. I feel though that the same is usually true for CPU performance optimizations as well.

                  Read the documentation, use the software tools to write your code, analyse the performance, go back to reading the documentation step :-)

                  • ATI Stream for Communication link simulation
                    twinclouds

                    Thanks.  These all make sense.  I just want to get an idea before I go down the road.

                    As for write the code, do I need a Stream Processor to develop the code.  Of course I cannot run the code but can I at least to get a feeling about it?

                • ATI Stream for Communication link simulation
                  Ceq
                  There are many AMD graphic cards compatible with Brook+, not just the FireSteam processor, in
                  general every card above Radeon 2xxx will work (altough without some features or less performance).

                  If you need double precision math or scatter function you'll need a Radeon 38xx or 48xx.

                  What is more, Brook+ has a software backend, so you always can try it even if you don't have the
                  right hardware. To enable it define environment variable BRT_RUNTIME=CPU

                  If you want to have a look at the documentation, download the SDK and look in doc folder.
                  • ATI Stream for Communication link simulation
                    Ceq
                    Well, I recommend you to use 3450 and turn off your integrated graphics, if your IGP isn't from ATI
                    probably you won't be able to install both drivers at the same time.

                    Radeon 3450 will be ok for testing, altough it is based on Radeon 2400 architecture, which is quite old.
                    It has only 40 processors and it can reach about 48 GFlops. (Quite far from Radeon 4850 with 1 TFlop)

                    If you want to work with Brook+, as stream computing is a diferent programming model, I advice you not
                    only to read the manual but also to look and try to understand the examples that come with the SDK.