While I went to the AMD Website, I "discovered" the FireSteam processor. It looks like that the processor might be useful for communication link simulation that requires a lot of resources, but I want to make sure I understand it correctly.
In evaluation the performance of a communication link, we often run the same simulation program multiple times under different parameters, e.g. different SNR points and different channels. These runs are relatively independent. From the description of the FireStream Processor, sounds like such tasks are naturally suitable for it. Is this correct? I would like to make sure before we start going down this route.
Really appreciate if someone can let me know their opinions and share their experience.
I have no experience with this kind of a simulation but from your description, it does sound like something that will map well to data parallel architectures like GPUs. The only thing to be careful about is the number of kernel instances that you need. To get the maximum performance out of GPUs, your application needs to launch thousands of work-items to keep the GPU busy. If that is not the case for you, you might be underutilizing the GPU and get lower performance acceleration.
Hi, Thank you for your prompt response. I have a few questions may be you can answer.
1. You said one need "to launch thousands of work-items to keep the GPU busy." How many processing units are there anyway? I my impression is that there are only at most a few hundreds. If this is the case, why we need to launch thousands of work items?
2. Assuming we only run one instance, what will be the equivalent of the clock speed of an ordinary AMD or Intel CPU to run to get the results in the same time duration?
3. Program memory requirement, for execution the same C program, what will be the amount of memory needed for using the Stream processor vs. the amount needed for ordinary CPU? Does the amount of memory scales with number of instances launched?
I don't really need fully utilize the processor power. I will be happy if it can be generate results as, say 20, ordinary CPUs, at the same time duration.
Really appreciate your reply. Sorry if some of the questions sound naiive. I just want to determine quickly if this is something I would like to do.
1. GPUs need a lot of threads to hide latencies incurred due to memory operations. Typically, if you have more threads (work items), the GPU engines have a better chance of hiding memory latencies, giving you better performance.
2. Don't have specific numbers but definitely the performance would be quite poor compared to x86 as we have extremely high memory latencies compared to CPUs.
3. Depending on what you are doing, the memory footprint could remain the same or in the worst case double as you could be creating 2 copies of the data - 1 for the CPU and other for the GPU.
GPU performance doesn't necessarily come for free. You need to try and do the right things to hit the performance sweet spot. I feel though that the same is usually true for CPU performance optimizations as well.
Read the documentation, use the software tools to write your code, analyse the performance, go back to reading the documentation step 🙂
Thanks. These all make sense. I just want to get an idea before I go down the road.
As for write the code, do I need a Stream Processor to develop the code. Of course I cannot run the code but can I at least to get a feeling about it?
One more question for Ceq. If I plug the HD3450 into the PCI Express slot for stream processing, should I connect my monitor to it or still connect my mornitor to the on board video?
(I will read the manual during the Holiday time off but just got used to not read manual when install new software and hardware nowadays. Maybe I get too lazy 🙂
Thanks for your quick reply and detailed suggestions. I will try to see how far I can go before ask you next question.
Have a Happy Holiday Season.