Archives Discussions

dukeleto · ‎07-09-2009

is there one?

Hello all,

I have just tried porting a relatively simple 1D problem submitted to me by a colleague, which, I thought at first sight, would be well suited to gpu resolution. The problem consists in solving a 1D equation giving pressure as a function of space and time, via finite differences. The particulariity here is that it needs to be time-stepped, for a *very* large number of periods.

Hence, I need to call my resolution kernel on a 1D vector, of length 3000, around 5e7 times.

The kernel I have written is fairly straight-forward, and has a large number of multiplications per memory access, so I think it should be well suited to gpu. However, I am getting *very* bad FLOPs. The program performs a few hundred time steps (kernel calls) per second; this equates to substantially less than 1 GFLOP (on an HD4970).

Is this due to an incompressible overhead linked to kernel calling? If so, would switching to cal help? Or is my problem just not such a good match to gpu after all?

Thanks!

ryta1203 · ‎07-09-2009

How much off-GPU transfers are you doing per kernel call?

MicahVillmow · ‎07-09-2009

dukeleto,
The problem is multi-pronged. First off, I'm not sure if brook+ transforms 1D vectors into 2D, but our hardware is optimized for 2D arrays, so a vector of length 3000 will run slower when compared to a matrix of say 200x15. In actuality in how our hardware works, you want to make your dimensions a multiple of 8 if possible, so a matrix of 8x275, or 275x8, would probably have better cache access than 3000x1.

Second, 3000 elements is very small and you are not going to fully utilize the graphics card. See simple_matmult at 64x64 versus 2kx2k. With a data set that small, the transfer overhead, launch overhead and many other factors take up a larger portion of the execution time instead of the actual calculation.

Third, is there a way to do multiple time-stepped iterations at the same time? So, instead of doing 1 computation per kernel call, bunch 1000 or 2000 of them together and do them in parallel. However, this means that each iteration does not depend on the previous iteration directly. Iterative algorithms or problems don't map easily to the GPU directly.

Hope this helps.

ryta1203 · ‎07-09-2009

Micah,

I don't want to derail the thread... but, I haven't noticed any real difference in performance with different domain sizes (meaning a multiple of 8 or not). I have run experiments from 256x256 to 4096x4096 (varying the domain along with way by some odd number, say 93 for example) and didn't see any degradation in performance between any of the domains, they all ran in the same time. Just FYI.

MicahVillmow · ‎07-09-2009

Ryta,
If they are all running in the same time with varying domain sizes then the execution is not the bottleneck and something else is. This would point to that whatever example you are running is not stressing the hardware enough for it to make a difference.

ryta1203 · ‎07-09-2009

Micah,

Sorry, I had an error in my code. Fixed and you are certainly correct... thanks for making me double check my work.

gaurav_garg · ‎07-16-2009

dukeleto,

The kernel call overhead also depends on your kernel argument characteristics and runtime features used. Features like scatter, stream resizing, domain has lot more overhead. Are you using any of these features?

riza_guntur · ‎07-18-2009

After I thought, my problem seems to be similar to dukeleto.

Not so big but need to be called a lot (millions, billions to trillions), from one input to another input sequentially, but inside that operiation, there exist parallel function

Is there anyway for kernel to be called million times inside GPU sequentially from one input to another without call from host?

for example:

for(int epoch = 0; epoch < num_of_epoch;epoch++) { for(int i = 0; i < (int) yB; i++) { Stream<float4> myu_min(rank[2], streamSizeMinOfVecCluster); Stream<float4> myu_max_of_min(rank[2], streamSizeMaxOfMin); myufy(i,fuzzy_number,vec_ref,myu);//from one input to another minimum_myu_cluster(myu,myu_min);//parallel function but for one input only max_of_min_myu(myu_min,myu_max_of_min); } }

Archives Discussions

minimum time cost of a kernel call