Archives Discussions

twinclouds · ‎12-16-2008

Is the FireStream Processor suitable for such tasks?

While I went to the AMD Website, I "discovered" the FireSteam processor. It looks like that the processor might be useful for communication link simulation that requires a lot of resources, but I want to make sure I understand it correctly.

In evaluation the performance of a communication link, we often run the same simulation program multiple times under different parameters, e.g. different SNR points and different channels. These runs are relatively independent. From the description of the FireStream Processor, sounds like such tasks are naturally suitable for it. Is this correct? I would like to make sure before we start going down this route.

Really appreciate if someone can let me know their opinions and share their experience.

Thanks.

pbhani · ‎12-17-2008

I have no experience with this kind of a simulation but from your description, it does sound like something that will map well to data parallel architectures like GPUs. The only thing to be careful about is the number of kernel instances that you need. To get the maximum performance out of GPUs, your application needs to launch thousands of work-items to keep the GPU busy. If that is not the case for you, you might be underutilizing the GPU and get lower performance acceleration.

twinclouds · ‎12-17-2008

Hi, Thank you for your prompt response. I have a few questions may be you can answer.

1. You said one need "to launch thousands of work-items to keep the GPU busy." How many processing units are there anyway? I my impression is that there are only at most a few hundreds. If this is the case, why we need to launch thousands of work items?

2. Assuming we only run one instance, what will be the equivalent of the clock speed of an ordinary AMD or Intel CPU to run to get the results in the same time duration?

3. Program memory requirement, for execution the same C program, what will be the amount of memory needed for using the Stream processor vs. the amount needed for ordinary CPU? Does the amount of memory scales with number of instances launched?

I don't really need fully utilize the processor power. I will be happy if it can be generate results as, say 20, ordinary CPUs, at the same time duration.

Really appreciate your reply. Sorry if some of the questions sound naiive. I just want to determine quickly if this is something I would like to do.

Fuyun

pbhani · ‎12-22-2008

1. GPUs need a lot of threads to hide latencies incurred due to memory operations. Typically, if you have more threads (work items), the GPU engines have a better chance of hiding memory latencies, giving you better performance.

2. Don't have specific numbers but definitely the performance would be quite poor compared to x86 as we have extremely high memory latencies compared to CPUs.

3. Depending on what you are doing, the memory footprint could remain the same or in the worst case double as you could be creating 2 copies of the data - 1 for the CPU and other for the GPU.

GPU performance doesn't necessarily come for free. You need to try and do the right things to hit the performance sweet spot. I feel though that the same is usually true for CPU performance optimizations as well.

Read the documentation, use the software tools to write your code, analyse the performance, go back to reading the documentation step 🙂

twinclouds · ‎12-22-2008

Thanks. These all make sense. I just want to get an idea before I go down the road.

As for write the code, do I need a Stream Processor to develop the code. Of course I cannot run the code but can I at least to get a feeling about it?

Ceq · ‎12-23-2008

There are many AMD graphic cards compatible with Brook+, not just the FireSteam processor, in
general every card above Radeon 2xxx will work (altough without some features or less performance).

If you need double precision math or scatter function you'll need a Radeon 38xx or 48xx.

What is more, Brook+ has a software backend, so you always can try it even if you don't have the
right hardware. To enable it define environment variable BRT_RUNTIME=CPU

If you want to have a look at the documentation, download the SDK and look in doc folder.

twinclouds · ‎12-23-2008

O.K. I will try my existing HD3450 card to begin with. That one should work, right?

twinclouds · ‎12-23-2008

One more question for Ceq. If I plug the HD3450 into the PCI Express slot for stream processing, should I connect my monitor to it or still connect my mornitor to the on board video?

(I will read the manual during the Holiday time off but just got used to not read manual when install new software and hardware nowadays. Maybe I get too lazy 🙂

Ceq · ‎12-23-2008

Well, I recommend you to use 3450 and turn off your integrated graphics, if your IGP isn't from ATI
probably you won't be able to install both drivers at the same time.

Radeon 3450 will be ok for testing, altough it is based on Radeon 2400 architecture, which is quite old.
It has only 40 processors and it can reach about 48 GFlops. (Quite far from Radeon 4850 with 1 TFlop)

If you want to work with Brook+, as stream computing is a diferent programming model, I advice you not
only to read the manual but also to look and try to understand the examples that come with the SDK.

twinclouds · ‎12-23-2008

Thanks for your quick reply and detailed suggestions. I will try to see how far I can go before ask you next question.

Have a Happy Holiday Season.

Archives Discussions

ATI Stream for Communication link simulation