cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Raistmer
Adept II

Kernel can't add 4 numbers, please help!

0 Likes
15 Replies
gaurav_garg
Adept I

Could you post it in more readable format, I had hard time reading it. May be you can mail it on my e-mail address mentioned in my profile.

0 Likes

Originally posted by: gaurav.garg

Could you post it in more readable format, I had hard time reading it. May be you can mail it on my e-mail address mentioned in my profile.


Thanks for offer, will do right now!

[
About posting in more readable format - I edited most many times - fighted with [ i ] as italic i < as even don't know what - it just eats end of line....
If AMD representatives think that this forum engine just right for developers I understand why AMD still have no own compiler and more less decorous performance libraries....
]
0 Likes

What is your system configuration? I have recently seen some issues with scatter on Vista.

 

0 Likes

Originally posted by: gaurav.garg What is your system configuration? I have recently seen some issues with scatter on Vista.

 

Vista x86 SP1, Business Edition.

Catalyst 9.2 (cause new ones can't handle big streams ).

Radeon HD4870 GPU.

 

0 Likes

This is standalone sample that produces same error:

1+1=0 ?? (On CAL backend, CPU backend compute correctly).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

main(){

 

 

unsigned int

buf_size[2];

 



 

unsigned int

thread_num_coadd=3;

buf_size[0]=4;

buf_size[1]=thread_num_coadd;

brook::Stream<



 

float

>* gpu_temp_coadd_old=NULL;

brook::Stream<



 

float>* gpu_temp_coadd=new brook::Stream<float

>(2,buf_size);

buf_size[0]=2;

 



 

float

cpu_temp[3][4];

 



 

for(int

i=0;i<thread_num_coadd;i++)

 



 

for(int

j=0;j<4;j++)

cpu_temp=1.0f;

gpu_temp_coadd->read(cpu_temp);

 



 

int

temp_coadd_working_length[]={2,2,2};

brook::Stream<



 

int

> *gpu_temp_coadd_working_length=NULL;

#if

 

1

fprintf(stderr,



 

"buf_size(coadd loop) is (%u,%u)\n"

,buf_size[0],buf_size[1]);

#endif

{

 

 

if(gpu_temp_coadd_old)delete

gpu_temp_coadd_old;

gpu_temp_coadd_old=gpu_temp_coadd;

gpu_temp_coadd=



 

new brook::Stream<float

>(2,buf_size);

 



 

if(gpu_temp_coadd_working_length) delete

gpu_temp_coadd_working_length;

gpu_temp_coadd_working_length=



 

new brook::Stream<int

>(1,&thread_num_coadd);

gpu_temp_coadd_working_length->read(temp_coadd_working_length);

GPU_coadd_kernel3(*gpu_temp_coadd_old,*gpu_temp_coadd_working_length,*gpu_temp_coadd);



#if

 

1

gpu_temp_coadd->finish();



#endif

 

 

if

(gpu_temp_coadd->error())

fprintf(stderr,



 

"ERROR: GPU_coadd_kernel3(coadd loop): %s\n"

,gpu_temp_coadd->errorLog());

#if

 

1

 



 

if(true

){

 



 

float

t1[4096];

 



 

float

t2[4096];

 



 

float

ta[3*4096];

fprintf(stderr,



 

"ARRAYS just after coadd:\n"

);

 



 

unsigned int

begin[]={0,2};

 



 

unsigned int

end[]={2,3};

 



 

unsigned int

end_old[]={2*2,3};

brook::Stream<



 

float

>& g1=gpu_temp_coadd_old->domain(begin, end_old);

g1.write(t1);

 



 

if(g1.error())fprintf(stderr,"ERROR: g1:%s\n"

,g1.errorLog());

brook::Stream<



 

float

>& g2=gpu_temp_coadd->domain(begin, end);

g2.write(t2);

 



 

if(g2.error())fprintf(stderr,"ERROR: g2:%s\n"

,g2.errorLog());

g2.write(ta);

 



 

if(g2.error())fprintf(stderr,"ERROR: g2->ta:%s\n"

,g2.errorLog());

 



 

for(int

i=0;i<2;i++){

fprintf(stderr,



 

"Old[%d]=%.9g,old[%d]=%.9g,new[%d]=%.9g\n"

,2*i,t1[2*i],2*i+1,t1[2*i+1],i,t2);

}

 



 

for(int

i=0;i<2;i++){

fprintf(stderr,



 

"Old[%d]=%.9g,old[%d]=%.9g,new[%d]=%.9g\n"

,2*i,t1[2*i],2*i+1,t1[2*i+1],i,t2);

}

}



#endif

}

 

//R: coadd block end

}

---------------

 

 

 

 

 







0 Likes

0 Likes

Raistmer,
Try using something like pastebin(http://www.pastebin.com) to paste your code and provide a link. It allows for much easier reading than pasting code onto the forum directly.
0 Likes

Originally posted by: MicahVillmow

Raistmer,

Try using something like pastebin(http://www.pastebin.com) to paste your code and provide a link. It allows for much easier reading than pasting code onto the forum directly.


Ok, I will cause I need help in my own problem with ATI Stream SDK (for now it looks like fresh bug under Vista ).
But natural extension of such advises will be "try to use another boards and then, try to use products of another vendors"... Unfortunately, I already bought 2 Radeons, will think twice next time....
0 Likes

Link on standalone test case that shows the same problem (CAL backend, Vista; no problems on CPU backend, Win2003x64).
1+1=0 by CAL version 😉

http://pastebin.com/meaaf6ed

0 Likes

Possible workaround:
(look comments at size variable)
http://pastebin.com/m4b983c48
0 Likes

For the case when size is two, it seems that you are writing to only first two lines of output and in host code you are reading back only the last row that is going to be uninitialized. That's why you see zeros.

Some basics on Brook+ kernel, not sure if you know already -

instance().x gives the colum number that is going to give value from 0 to size-1.

dest[threadID][ i ] means you are writing on row threadID and column i of dst. That would mean that you are writing sub-matrix from (0,0) to (1,1) of dst.

In host code, you are reading from last row of both src and dst stream. As you can guess the last row of dst stream was not updated inside kernel.

0 Likes

Originally posted by: gaurav.garg

Some basics on Brook+ kernel, not sure if you know already -


instance().x gives the colum number that is going to give value from 0 to size-1.


Column <-> row relation seems reversed in kernel code regarding to host code.
I use 1D stream as ordinary stream that will define domain of execution, right?
It should have only x dimension greater than 1, y dimension should be 1, correct?
Do you suggest that if I will use instance().y I will recive correct result in my case?


It leads to big question:
What defines domain of execution in case of such kernel?
kernel a(float b[][],int c<>,out float d[][]);
I thought size of c stream will define how many invocations of this kernel will be run.
Should I use dimensions of stream d instead to determine how many kernel invocations will be launched ?
0 Likes

No, it is the first output stream that define the domain of execution. So, in your case it is size * 3.

Column-row relationship is actually similar. Width/column number is the first index in instance(), domain operator as well as stream dimension pointer. You need to just take care at stream indexing that is similar to C-style indexing.

0 Likes

Thanks a lot!
It means I called kernel in size times more than needed. It could explain so bad performance at least in part.
I don't need to call kernel for each element of dest stream, that is, domain of execution is required in my case...
0 Likes

Yes, you need to use domain of execution. Regarding performance, I guess you would still see bad performance with 2D non-128 bit scatter stream.

You need to change your kernel to use 128-bit 1D scatter stream with size < 8192 to get better performance.

0 Likes