Hello all,
I have just tried porting a relatively simple 1D problem submitted to me by a colleague, which, I thought at first sight, would be well suited to gpu resolution. The problem consists in solving a 1D equation giving pressure as a function of space and time, via finite differences. The particulariity here is that it needs to be time-stepped, for a *very* large number of periods.
Hence, I need to call my resolution kernel on a 1D vector, of length 3000, around 5e7 times.
The kernel I have written is fairly straight-forward, and has a large number of multiplications per memory access, so I think it should be well suited to gpu. However, I am getting *very* bad FLOPs. The program performs a few hundred time steps (kernel calls) per second; this equates to substantially less than 1 GFLOP (on an HD4970).
Is this due to an incompressible overhead linked to kernel calling? If so, would switching to cal help? Or is my problem just not such a good match to gpu after all?
Thanks!
How much off-GPU transfers are you doing per kernel call?
Micah,
I don't want to derail the thread... but, I haven't noticed any real difference in performance with different domain sizes (meaning a multiple of 8 or not). I have run experiments from 256x256 to 4096x4096 (varying the domain along with way by some odd number, say 93 for example) and didn't see any degradation in performance between any of the domains, they all ran in the same time. Just FYI.
Micah,
Sorry, I had an error in my code. Fixed and you are certainly correct... thanks for making me double check my work.
dukeleto,
The kernel call overhead also depends on your kernel argument characteristics and runtime features used. Features like scatter, stream resizing, domain has lot more overhead. Are you using any of these features?
After I thought, my problem seems to be similar to dukeleto.
Not so big but need to be called a lot (millions, billions to trillions), from one input to another input sequentially, but inside that operiation, there exist parallel function
Is there anyway for kernel to be called million times inside GPU sequentially from one input to another without call from host?
for example:
for(int epoch = 0; epoch < num_of_epoch;epoch++) { for(int i = 0; i < (int) yB; i++) { Stream<float4> myu_min(rank[2], streamSizeMinOfVecCluster); Stream<float4> myu_max_of_min(rank[2], streamSizeMaxOfMin); myufy(i,fuzzy_number,vec_ref,myu);//from one input to another minimum_myu_cluster(myu,myu_min);//parallel function but for one input only max_of_min_myu(myu_min,myu_max_of_min); } }