cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

m_wagner
Journeyman III

Problems with big Matrices

Segfaul with Matrix-Matrix-Multiplication on Matrices greater than 831x831

I postet this Question already yesterday but today it was not findable. So here again:

I just want to test the Performance of the FireStream 9170 with a simple Matrix-Matrix-Multiplication in single precision. So I wrote a really simple kernel like this:

kernel
void matmul_kernel(float a<>, float b<>, out float c<>
{
    c = a * b;
}

So if I run the executable with nxn-matrices with an n < 832 all works well. But if I start it with an n = 832 or above I get a segmentation fault.

I read that the size of a 2D stream is limited by 8192x8192. So what is the problem. I compiled it with address translation and without (-r flag), but it is exactly the same in both cases except for a less performance on address translated code.

I would be thankkful for any kind of help.

0 Likes
9 Replies
Ceq
Journeyman III

Well, you don't specify how you allocate memory, it could be related to using streamWrite in a bad allocated area, if you allocate too much data in the stack you should change compilation parameters.
For example in MSVC:

int function(...) {
float output[4096][4096];
...
}

Would probably abort, so you have to change the stack size in:

project -> properties -> linker -> system -> stack

I hope this helps. If not, what happens if you comment the line that calls the kernel? Do Brook+ examples work?
0 Likes

(1) I just allocate 3 matrices with float A;

(2) I am working on Linux 64 and compile the files with a GNU Makefile using the brcc

(3) The problem remains the same if I comment the kernel call

(4) The Brook+ examples work fine with n < 4096, thats why I think my program should work with the same size of matrices

0 Likes
Ceq
Journeyman III

If Brook+ examples work I think it is related to the allocation, because they use malloc, so try this:
Change array declaration from "float A;" to "float *A = (float*)malloc(n * n * sizeof(float));"
If this way it works ok you should increase default stack size.
You can also try some tool like MEMWATCH to check that there isn't overflow writing to data structures.
0 Likes

@ Ceq: You were right. The problem was the memory allocation. Now I allocate the memory with A = allocate_mat_f(); and all works fine.

But I am still limited to 3072x3072. Why?

3072*3072*sizeof(float) = 36 MB * 3 matrices = 108 MB. Where is the problem?

0 Likes
Ceq
Journeyman III

Looks like this time the limit could be the amount of memory that can be allocated by "malloc".
Try to change the maximum heap size, in MSVC you can change it in the same place as the stack, if you use Linux I'm not sure, but I think that could be with the "ulimit" command.
Note that if you use floating point values as indexs of huge streams it is possible to exceed floating point precision and the results could be wrong, so it's safer to use ints.
0 Likes

I dont think malloc is limited by 3072x3072 because I used matrices upto 10000x10000 in other programs working on the CPU.

Thanks for your answers.

0 Likes
Ceq
Journeyman III

You are right that 3072x3072 shouldn't be a problem.
Where exactly does your program abort? (It's quite easy to check with a debugger and some checkpoints)

- Kernel call / StreamRead -> CAL / driver issue
- Array initialization / StreamWrite -> memory allocation bug
- Function return -> stack corruption

0 Likes

Ok, I tested where the program aborts. It aborts between the first and the second streamRead at n = 3584. Thats crazy.

So I but a streamWrite after the first streamRead to wait that first streamRead has finished. in this way the program works with n = 4096, too.

At n ~ 5000 the programm aborts before the first streamRead.

n = 8192 would be great, but 4096 is ok. In this way I can get a realistic measurement.

0 Likes

In order to perform a streamRead, BRT will allocate some GPU remote (ie CPU local) memory, copy the data there, and then run a kernel to copy it to the actual surface (asynch).

The problem is likely when the BRT allocates this GPU remote memory. I believe there may be unseen (unreported?) limits. Try using the environment variable USE_NONCACHEABLE and set it to 1.

I've met this problem myself in using CAL directly...

 

0 Likes