I have just tried porting a relatively simple 1D problem submitted to me by a colleague, which, I thought at first sight, would be well suited to gpu resolution. The problem consists in solving a 1D equation giving pressure as a function of space and time, via finite differences. The particulariity here is that it needs to be time-stepped, for a *very* large number of periods.
Hence, I need to call my resolution kernel on a 1D vector, of length 3000, around 5e7 times.
The kernel I have written is fairly straight-forward, and has a large number of multiplications per memory access, so I think it should be well suited to gpu. However, I am getting *very* bad FLOPs. The program performs a few hundred time steps (kernel calls) per second; this equates to substantially less than 1 GFLOP (on an HD4970).
Is this due to an incompressible overhead linked to kernel calling? If so, would switching to cal help? Or is my problem just not such a good match to gpu after all?