kernel void GPU_coadd_kernel54(float4 src[][],int size[],out float4 dest<>) { int threadID=instance().y; int i=instance().x; float4 o; float4 i2; float4 i21; int ln=(size[threadID]+3)>>2; if(i>=ln) return;//R:thread unneeded //for(;i<ln;i++){ i2=src[threadID][2*i]; i21=src[threadID][2*i+1]; o.x=i2.x+i2.y; o.y=i2.z+i2.w; o.z=i21.x+i21.y; o.w=i21.z+i21.w; dest=o; //} } kernel void GPU_coadd_kernel4(float4 src[][],int size[],out float4 dest[][]) { int threadID=instance().y; int i=0; float4 o; float4 i2; float4 i21; int ln=(size[threadID]+3)>>2; for(;i<ln;i++){ i2=src[threadID][2*i]; i21=src[threadID][2*i+1]; o.x=i2.x+i2.y; o.y=i2.z+i2.w; o.z=i21.x+i21.y; o.w=i21.z+i21.w; dest[threadID]=o; } }
Wow... After reading this, I just know that we can use gather and scatter at the same time
Originally posted by: riza.guntur
Wow... After reading this, I just know that we can use gather and scatter at the same time
If array size is very unbalanced is normal that the first kernel is slow.
By the way, you can delete the line "if(i >= ln)..." in the first kernel because is slowing down the kernel, undefined values will be written either way.
Originally posted by: Ceq
If array size is very unbalanced is normal that the first kernel is slow.
By the way, you can delete the line "if(i >= ln)..." in the first kernel because is slowing down the kernel, undefined values will be written either way.
Well, if 10% of your rows have more than 400 elements and 90% have less than 10 elements the second kernel does less work because it doesn't process unneeded elements.
About the condition it depends again on the branch coherence and memory bandwidth, in this example a real jump is used instead of predication. I think is better to test it and see how it performs, if you give it a try let me know the results.
Note that even if you return before writing any output the original data could be overwritten.
Originally posted by: Ceq
Well, if 10% of your rows have more than 400 elements and 90% have less than 10 elements the second kernel does less work because it doesn't process unneeded elements.
1) o.xy=inp[tID].xz+inp[tID].yw; o.zw=inp[tID][i+1].xz+inp[tID][i+1].yw; 2) o.xy=inp[tID].xz+inp[tID].yw; o.zw=inp[i+1][tID].xz+inp[i+1][tID].yw; with kernel void k(float4 inp[][], out float4 o<>);
From the best and only User Guide available, it seems all kernel would wait until the longest running thread is finish, see 1.2.3 about flow control in example one.
Thinking about which thread to be created to use limited resources available, there is one thing we need to know, that is about domain of execution. From there I think we could define how much thread to be made since the number of kernel maps directly to the number of elements in the output.
In your case if you can define better execution domain, there would be less kernel created, no need for if, and faster execution.
But still, there is somebody needs to explain how to define execution domain since all those examples don't talk much about this. I still don't understand how to make use of execution domain in scatter even after months analyzing it.
Frankly, doing only 4 additions in a kernel is never going to be blazingly fast if you have to fetch 32Bytes for doing these additions. Your kernels are severly bandwidth limited. You have approximately 0.1 Byte available memory bandwidth per operation on a GPU (slightly below 1 Byte per ALU clause), for your kernel you need 8 Bytes per operation (32Bytes per thread). As you don't reuse any values, caching won't help here. By the way, a nvidia GPU would face the same bandwidth constraints.
Without knowing your algorithm, I would suggest to implement these additions into the kernel creating the input stream. Just do the calculations for two currently neighbouring output positions within one thread and combine them in the end. If you have enough threads (best would be >>10k) you may even see a performance improvement in this kernel because of better utilization of the instruction slots (more independent instructions available).
But actually I still don't get why you fiddle around with this some kind of a partial reduction kernel that much. With the array sizes you mentioned, the kernel surely completes in less than a millisecond (if the data is in the graphics card RAM). I hope the creation of the data you want to add takes quite a bit longer, so that addition should not be a severe bottleneck
Apropos partial reduction, the simplest (but slightly slower) possibility to do your task would the creation of the dest stream with exactly half the size in one dimension (the one that should be reduced) as the source stream and then calling an appropriate reduction kernel. But such a reduction takes provisions for a lot of unnecessary things. So it would be a bit slower. But taking the memory bandwidth limitation into account I would guess it wouldn't hurt much, even considering it is translated to 34 ALU clauses instead of just 8 like the normal kernel below. Remember , we have less than 1 Byte per ALU clause on a HD4870 (HD4850 only 0.5 Bytes) available and we need to fetch 32 Bytes.
reduce void coadd_reduce(float4 src<>, reduce float4 dest<>) { float4 temp=dest; dest.xy = temp.xz + temp.yw; dest.zw = src.xz + src.yw; } kernel void GPU_coadd_kernel(float4 src[][], out float4 dest<>) { float4 i2; float4 i21; int2 index=instance().xy; index.x <<= 1; i2=src[index.y][index.x]; i21=src[index.y][index.x+1]; dest.xy = i2.xz + i2.yw; dest.zw = i21.xz + i21.yw; }
Originally posted by: Raistmer Thanks for suggestions, will try About algorithm - it's some kind of folding. Co-addition performed in loop many times so long data array eventually becomes 2-3 elements in size. That's why from 479 to 3 in one of dimentions.
That sounds almost like a partial reduction would fit the bill.
Full data path: ~64kb-length array packed into small one by addition of elements with some stride (this stride changes from iteration to iteration). This small array then will be added with itself with stride of half array size (many times too). Additionally one more addition performed between i and i+1 elements of array (coadd loop), many times too. And for each new array signal finding performed, that is, each element of array compared with some threshold value. If power of bin exceeds threshold, some data should be retrieved from this array (power value, bin number + powers of near bins too). In CPU version, where all operations are performed on the same memory location, all data reside in L2 and partially in L1 caches, so memory issues not so devastating...
You can try to block the memory accesses also on GPUs taking advantage of the higher bandwidth of the texture caches. But generally it sound like a quite memory intensive algorithm which isn't perfect for a GPU. Guess that's the reason a fast quad core isn't that much slower than a GT200 with that stuff
Originally posted by: Gipsel
You can try to block the memory accesses also on GPUs taking advantage of the higher bandwidth of the texture caches.
Originally posted by: Gipsel
Remember , we have less than 1 Byte per ALU clause on a HD4870 (HD4850 only 0.5 Bytes) available and we need to fetch 32 Bytes.
kernel void GPU_fetch_array_kernel54ct(float src[],double per[],int per_int[],int n_per[],out float4 dest<>){ //R: here we should form dest buffer element one by one, no arrays inside kernel allowed //So we should return back to initial fetch order that was so unoptimal for CPU version //will hope here access to memory with stride will not degrade performance so much as it was for CPU version int j=instance().y; int threadID=instance().x; int k=0; int l=0; float4 acc=float4(0.f,0.f,0.f,0.f); double period=per[threadID]; int size=(per_int[threadID]+3)>>2; if(j>size)return;//R:this thread not needed //for(;j<per_int[threadID];j++){ for(k=0;k<n_per[threadID];k++){ l=(int)(period*(double)k + 0.5); l+=(4*j);//R: index to data array computed //R:turn it into offset in float4 array //l4=l>>2; //offset=l-4*l4; acc.x=src
; acc.y=src[l+1]; acc.z=src[l+2]; acc.w=src[l+3]; /* if(offset==0) acc+=src[l4]; else if(offset==1){ acc.xyz+=src[l4].yzw; acc.w+=src[l4+1].x; }else if(offset==2){ acc.xy+=src[l4].zw; acc.zw+=src[l4+1].xy; }else if(offset==3){ acc.x+=src[l4].w; acc.yzw+=src[l4+1].xyz; } */ } dest=acc; //} }
Originally posted by: Ceq
About the condition it depends again on the branch coherence and memory bandwidth, in this example a real jump is used instead of predication. I think is better to test it and see how it performs, if you give it a try let me know the results.