Archives Discussions

Raistmer · ‎07-20-2009

kernels listed - some comments, please

In attempts to speed up currently too slow my Brook+ code I trying different variants of kernels that do same tasks.

There was comment in another thread that performance degradation is big when scatter stream with more than 8192 elements is used.
So I rewrited kernel to avoid usage of scatter array at all.
Unfortunately, it seems this move only increased app CPU time (and elapsed time too).

What is wrong?

kernel void GPU_coadd_kernel54(float4 src[][],int size[],out float4 dest<>) { int threadID=instance().y; int i=instance().x; float4 o; float4 i2; float4 i21; int ln=(size[threadID]+3)>>2; if(i>=ln) return;//R:thread unneeded //for(;i<ln;i++){ i2=src[threadID][2*i]; i21=src[threadID][2*i+1]; o.x=i2.x+i2.y; o.y=i2.z+i2.w; o.z=i21.x+i21.y; o.w=i21.z+i21.w; dest=o; //} } kernel void GPU_coadd_kernel4(float4 src[][],int size[],out float4 dest[][]) { int threadID=instance().y; int i=0; float4 o; float4 i2; float4 i21; int ln=(size[threadID]+3)>>2; for(;i<ln;i++){ i2=src[threadID][2*i]; i21=src[threadID][2*i+1]; o.x=i2.x+i2.y; o.y=i2.z+i2.w; o.z=i21.x+i21.y; o.w=i21.z+i21.w; dest[threadID]=o; } }

riza_guntur · ‎07-21-2009

Wow... After reading this, I just know that we can use gather and scatter at the same time

Raistmer · ‎07-22-2009

Originally posted by: riza.guntur

Wow... After reading this, I just know that we can use gather and scatter at the same time

Yes, we can... but performance of both these kernels still worse than simple CPU loop. I was advised in another thread not to use scatter if possible. So I was pretty disappointed when kernel w/o scatter took even more time. And great lack of understanding the reasons of such event here (lack of guides and samples only add to frustration)...
Still no advantages from using GPU... Maybe I'm missing something apparent to GPU-oriented peoples? Or Brook+ not intended for high performance computations at all?....
Accordingly profilers and my own measurements that coadd loop takes most time in function I trying to port to ATI GPU. So, it's very important to get all performance benefits here that possible...

Ceq · ‎07-22-2009

If array size is very unbalanced is normal that the first kernel is slow.

By the way, you can delete the line "if(i >= ln)..." in the first kernel because is slowing down the kernel, undefined values will be written either way.

Raistmer · ‎07-22-2009

Thank you for comments!

Originally posted by: Ceq

If array size is very unbalanced is normal that the first kernel is slow.

What you mean? Possible matrix sizes are from 4096x479 to 4096x3.
I can change first number cause it defines how many separate iterations will be performed simultaneously, but can't touch second number.

By the way, you can delete the line "if(i >= ln)..." in the first kernel because is slowing down the kernel, undefined values will be written either way.

Again, could you give more info on this topic, please?
In CUDA it's method to end unneeded thread faster and to free computational resources for next warp. Here threads executed by wavefronts (analog of warp for CUDA, right?) so I thought if all threads in wavefront will have early exit then all wavefront will free computational resources faster for next one.
Data matrix size big enough so there are certainly more threads than 1 or 2 wavefronts can accomodate.
Does that "if" still unneeded in such case? (And why - please, explain if possible...)

Ceq · ‎07-22-2009

Well, if 10% of your rows have more than 400 elements and 90% have less than 10 elements the second kernel does less work because it doesn't process unneeded elements.

About the condition it depends again on the branch coherence and memory bandwidth, in this example a real jump is used instead of predication. I think is better to test it and see how it performs, if you give it a try let me know the results.

Note that even if you return before writing any output the original data could be overwritten.

Raistmer · ‎07-22-2009

Originally posted by: Ceq

Well, if 10% of your rows have more than 400 elements and 90% have less than 10 elements the second kernel does less work because it doesn't process unneeded elements.

I feel unsure about row and column terms in this context.
If I change dimensions from 2048x3 to 3x2048 could this improve performance?
And ( I asked in another thread but maybe it's better to consolidate discussion here) what will be faster: (code attached)

I will benchmark variant w/o if and post here.

1) o.xy=inp[tID].xz+inp[tID].yw; o.zw=inp[tID][i+1].xz+inp[tID][i+1].yw; 2) o.xy=inp[tID].xz+inp[tID].yw; o.zw=inp[i+1][tID].xz+inp[i+1][tID].yw; with kernel void k(float4 inp[][], out float4 o<>);

riza_guntur · ‎07-22-2009

From the best and only User Guide available, it seems all kernel would wait until the longest running thread is finish, see 1.2.3 about flow control in example one.

Thinking about which thread to be created to use limited resources available, there is one thing we need to know, that is about domain of execution. From there I think we could define how much thread to be made since the number of kernel maps directly to the number of elements in the output.

In your case if you can define better execution domain, there would be less kernel created, no need for if, and faster execution.

But still, there is somebody needs to explain how to define execution domain since all those examples don't talk much about this. I still don't understand how to make use of execution domain in scatter even after months analyzing it.

Gipsel · ‎07-23-2009

Frankly, doing only 4 additions in a kernel is never going to be blazingly fast if you have to fetch 32Bytes for doing these additions. Your kernels are severly bandwidth limited. You have approximately 0.1 Byte available memory bandwidth per operation on a GPU (slightly below 1 Byte per ALU clause), for your kernel you need 8 Bytes per operation (32Bytes per thread). As you don't reuse any values, caching won't help here. By the way, a nvidia GPU would face the same bandwidth constraints.

Without knowing your algorithm, I would suggest to implement these additions into the kernel creating the input stream. Just do the calculations for two currently neighbouring output positions within one thread and combine them in the end. If you have enough threads (best would be >>10k) you may even see a performance improvement in this kernel because of better utilization of the instruction slots (more independent instructions available).

But actually I still don't get why you fiddle around with this some kind of a partial reduction kernel that much. With the array sizes you mentioned, the kernel surely completes in less than a millisecond (if the data is in the graphics card RAM). I hope the creation of the data you want to add takes quite a bit longer, so that addition should not be a severe bottleneck

Apropos partial reduction, the simplest (but slightly slower) possibility to do your task would the creation of the dest stream with exactly half the size in one dimension (the one that should be reduced) as the source stream and then calling an appropriate reduction kernel. But such a reduction takes provisions for a lot of unnecessary things. So it would be a bit slower. But taking the memory bandwidth limitation into account I would guess it wouldn't hurt much, even considering it is translated to 34 ALU clauses instead of just 8 like the normal kernel below. Remember , we have less than 1 Byte per ALU clause on a HD4870 (HD4850 only 0.5 Bytes) available and we need to fetch 32 Bytes.

reduce void coadd_reduce(float4 src<>, reduce float4 dest<>) { float4 temp=dest; dest.xy = temp.xz + temp.yw; dest.zw = src.xz + src.yw; } kernel void GPU_coadd_kernel(float4 src[][], out float4 dest<>) { float4 i2; float4 i21; int2 index=instance().xy; index.x <<= 1; i2=src[index.y][index.x]; i21=src[index.y][index.x+1]; dest.xy = i2.xz + i2.yw; dest.zw = i21.xz + i21.yw; }

Raistmer · ‎07-30-2009

Thanks for suggestions, will try

About algorithm - it's some kind of folding.
Co-addition performed in loop many times so long data array eventually becomes 2-3 elements in size.
That's why from 479 to 3 in one of dimentions.

Full data path:
~64kb-length array packed into small one by addition of elements with some stride (this stride changes from iteration to iteration). This small array then will be added with itself with stride of half array size (many times too). Additionally one more addition performed between i and i+1 elements of array (coadd loop), many times too.
And for each new array signal finding performed, that is, each element of array compared with some threshold value. If power of bin exceeds threshold, some data should be retrieved from this array (power value, bin number + powers of near bins too).
In CPU version, where all operations are performed on the same memory location, all data reside in L2 and partially in L1 caches, so memory issues not so devastating...

Gipsel · ‎07-30-2009

Originally posted by: Raistmer Thanks for suggestions, will try About algorithm - it's some kind of folding. Co-addition performed in loop many times so long data array eventually becomes 2-3 elements in size. That's why from 479 to 3 in one of dimentions.

That sounds almost like a partial reduction would fit the bill.

Full data path: ~64kb-length array packed into small one by addition of elements with some stride (this stride changes from iteration to iteration). This small array then will be added with itself with stride of half array size (many times too). Additionally one more addition performed between i and i+1 elements of array (coadd loop), many times too. And for each new array signal finding performed, that is, each element of array compared with some threshold value. If power of bin exceeds threshold, some data should be retrieved from this array (power value, bin number + powers of near bins too). In CPU version, where all operations are performed on the same memory location, all data reside in L2 and partially in L1 caches, so memory issues not so devastating...

You can try to block the memory accesses also on GPUs taking advantage of the higher bandwidth of the texture caches. But generally it sound like a quite memory intensive algorithm which isn't perfect for a GPU. Guess that's the reason a fast quad core isn't that much slower than a GT200 with that stuff

Raistmer · ‎07-31-2009

Originally posted by: Gipsel

You can try to block the memory accesses also on GPUs taking advantage of the higher bandwidth of the texture caches.

Is it possible via Brook+ or IL level required?

Raistmer · ‎08-10-2009

Originally posted by: Gipsel
Remember , we have less than 1 Byte per ALU clause on a HD4870 (HD4850 only 0.5 Bytes) available and we need to fetch 32 Bytes.

Ok, now I try to do few coadd levels in one kernel call keeping coadded values in registers. So, value fetched from memory only once and used for few additions instead of just one.
But such kernels (aspecially if float4 type used) became very big and ugly with increase of initial size of array.
Absent of local arrays is critical here (I need to give separate name for each register holding result....)

Raistmer · ‎08-10-2009

2 Ceq:
EDITED
After few test repetitions it seems version w/o early thread termination slightly faster. So I remove "if" condition.

kernel void GPU_fetch_array_kernel54ct(float src[],double per[],int per_int[],int n_per[],out float4 dest<>){ //R: here we should form dest buffer element one by one, no arrays inside kernel allowed //So we should return back to initial fetch order that was so unoptimal for CPU version //will hope here access to memory with stride will not degrade performance so much as it was for CPU version int j=instance().y; int threadID=instance().x; int k=0; int l=0; float4 acc=float4(0.f,0.f,0.f,0.f); double period=per[threadID]; int size=(per_int[threadID]+3)>>2; if(j>size)return;//R:this thread not needed //for(;j<per_int[threadID];j++){ for(k=0;k<n_per[threadID];k++){ l=(int)(period*(double)k + 0.5); l+=(4*j);//R: index to data array computed //R:turn it into offset in float4 array //l4=l>>2; //offset=l-4*l4; acc.x=src; acc.y=src[l+1]; acc.z=src[l+2]; acc.w=src[l+3]; /* if(offset==0) acc+=src[l4]; else if(offset==1){ acc.xyz+=src[l4].yzw; acc.w+=src[l4+1].x; }else if(offset==2){ acc.xy+=src[l4].zw; acc.zw+=src[l4+1].xy; }else if(offset==3){ acc.x+=src[l4].w; acc.yzw+=src[l4+1].xyz; } */ } dest=acc; //} }

Raistmer · ‎08-01-2009

Originally posted by: Ceq
About the condition it depends again on the branch coherence and memory bandwidth, in this example a real jump is used instead of predication. I think is better to test it and see how it performs, if you give it a try let me know the results.

Difference in speed almost unnoticeable for whole program, but minimal time for kernel execution different.
if presents:
T_FFA_coadd: total=2.48e+012, N=36744, <>=67628609 (6.76e+007), min=8287096 (8.29e+006)
w/o if:
T_FFA_coadd: total=2.50e+012, N=36744, <>=67999251 (6.80e+007), min=5697864 (5.70e+006)

That is, version w/o if looks little faster indeed but maybe it's just variation in processing time (total and mean value differ just in opposite way)

Archives Discussions

Why kernel w/o scatter stream slower than kernel based on scatter stream?