cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

rexiaoyu
Journeyman III

Problem with simple matrix addition

I declare two 16x16 matries represented in 1D and do the addition. In the kernel, I make a thead process N elements(N = 1, 2,4,.....) , when N is less than 16, it works fine, but when it reaches 16, some kind of runtime error happens. I can not figure it out.

The code is as below(main.cpp and locate.br):

main.cpp:

#include <stdio.h>

#include <stdlib.h>

#include "brookgenfiles/locate.h"

 

using namespace brook;

#define SIZE 16

#define SIZE2 256

 

void printMatrix(int len, float m[])

{

int i, j;

for (i = 0; i < len; i++)

{

for (j = 0; j < len; j++)

{

printf("%f, ", m[i * len + j]);

}

printf("\n");

}

 

}

int main()

{

//array a, b and c

float a[SIZE2];

float b[SIZE2];

float c[SIZE2];

 

int i;

for (i = 0; i< SIZE2; i++)

{

a = 1.0;

b = 2.0;

}

 

unsigned int msize = SIZE2;

Stream<float> sa(1, &msize);

Stream<float> sb(1, &msize);

Stream<float> sc(1, &msize);

 

sa.read(a);

sb.read(b);

uint4 domainSize = uint4(SIZE2, 1, 1, 1);

blockAdd.domainSize(domainSize);

blockAdd(sa, sb, sc);

 

sc.write(c);

if (sc.error())

{

printf("Error occured! %s\n", sc.errorLog());

return 1;

}

printMatrix(SIZE, c);

getchar();

return 0;

}



locate.br:

Attribute[GroupSize(64, 1, 1)]

kernel void

blockAdd(float a[], float b[], out float c[])

{

int tid = instance().x;

//every thread process len elements, len =  1, 2, 4, 6, 8, 16, ....

//when len = 16, come the error

int len= 16;

int start = tid * len;

int i;

int index;

for (i = 0; i < len; i++)

{

c[start + i] = a[start + i] + b[start + i];

}

}



0 Likes
7 Replies
gaurav_garg
Adept I

What runtime error you see? It is a crash? If yes, where does it crash?

0 Likes

Holiday passed and I am back

The result seems weird. Sometimes I can get the correct answer, with matrix C full of 3 (only when len doesn't exceed 16); sometimes it reports a memory error, and now the answer becomes an array of random numbers.

Is there something wrong with my algorithm in the kernel? I am wondering.

0 Likes

It seems the indices you are using are out of range. You are running 256 threads and each a, b, c contains only 256 elements.

Also, I would suggest to use both domainOffset and domainSize together. Brook+ runtime can ignore domian of execution hint if domainOffset is not specified. Also, check your results without Attribute qualifier in kernel.

0 Likes

Thanks for your suggestion.  

Does it mean that to avoid the out of range problem, I can not write more than one element in the kernel?

But I knew the cal_idct sample provided with sdk writes more than one element in the IL kernel. Here is part of code:

// save 8x8 DCT coefficient block location

 "ishl r16.x, vaTid.x, l8.w\n"

 

 // load packed 8x8 DCT coefficients using texture cache

 "mov  r0, g[r16.x+0]\n" 

 "mov  r2, g[r16.x+1]\n" 

 "mov  r4, g[r16.x+2]\n" 

 "mov  r6, g[r16.x+3]\n" 

 "mov  r8, g[r16.x+4]\n" 

 "mov r10, g[r16.x+5]\n" 

 "mov r12, g[r16.x+6]\n" 

 "mov r14, g[r16.x+7]\n" 

//DO IDCT

......

// save DCT values

 "mov g[r16.x+0], r0\n" 

 "mov g[r16.x+1], r2\n" 

 "mov g[r16.x+2], r4\n" 

 "mov g[r16.x+3], r6\n" 

 "mov g[r16.x+4], r8\n" 

 "mov g[r16.x+5], r10\n" 

 "mov g[r16.x+6], r12\n" 

 "mov g[r16.x+7], r14\n"

 



In the code above, it first gets the absolute thread id and then maps it to a 8x8 block, which will be  processed later. At last it writes these elements back. It works fine. So I wonder whether I can do the same thing in Brook+.



0 Likes
gaurav_garg
Adept I

In your kernel instance().x would return values from 0...255 and writing 16 elements in each thread would mean accessing memory element from 0...4095. But, the amount of memory allocated is 256 elements.

0 Likes

Can you explain why the cal_idct kernel works?

0 Likes

I wanna do the matrix addtion in compute shader. Do multiple elements addition in a thread. But it seems my code above didn't work as I expect. Then how to write the code?

0 Likes