Hi,
When trying to implement a particular problem we are facing a number of problems.
1. Most important is that local arrays are not supported. From my point of view it was also not well documented in the Stream Computing user Guide - which mentions local shared arrays in A.6.3.1, but the compiler just tells me its not implemented.
2. When manually unrolling the kernel we faced another problem. The kernel just didn't seem to return anything. The output stream always contained the results of the last successful run.
We managed to reduce the kernel to this:
kernel void dame(unsigned int input<>, out unsigned int output<>, out unsigned int debug<>
{
unsigned int cnt = 0u;
debug = input;
do {
do {
do {
do {
do {
cnt = cnt + 1u;
} while(0);
} while(0);
} while(0);
} while(0);
} while(0);
output = cnt;
}
For example running it straight after boot:
$> ./dame 123
0: 4294967295 dbg: 4294967167
1: 4294967295 dbg: 4294967295
2: 4294967215 dbg: 4294967295
3: 4294967295 dbg: 4294967295
4: 4294770688 dbg: 4294967295
5: 4294377472 dbg: 4261412863
6: 4294901760 dbg: 4294967295
7: 4294901764 dbg: 4294967295
$> BRT_RUNTIME=cpu ./dame 123
0: 1 dbg: 123
1: 1 dbg: 123
2: 1 dbg: 123
3: 1 dbg: 123
4: 1 dbg: 123
5: 1 dbg: 123
6: 1 dbg: 123
7: 1 dbg: 123
Just for reference the CPP program.
#include
#include
#include
using namespace std;
#define N 8
int main(int argc, char *argv[]) {
unsigned int Dim1[]={N};
int i;
brook::Streaminput(1,Dim1);
brook::Streamoutput(1,Dim1);
brook::Streamdebug(1,Dim1);
uint32_t input_cpu;
uint32_t output_cpu;
uint32_t debug_cpu;
for(i=0;i input.read(input_cpu);
dame(input, output, debug);
output.write(output_cpu);
debug.write(debug_cpu);
for(i=0;i cout << i << ": " << output_cpu[ i ] << " dbg: " << debug_cpu[ i ] << endl;
return 0;
}
Some other notes:
3. The brcc compiler is extremely picky about types. Each integer constant used with unsigned ints must be explicitly marked as 0u. IMHO this makes it harder than it should be to just copy/paste c code.
4. Will there be support for 64 bit integer types?
5. Will we ever be the stream kernel analyser for linux?
Best Regards
Thomas
P.S. atistream-1.4.0_beta-lnx64 // ati-driver-installer-9-6-x86.x86_64 // single FireStream 9170 // SLES10
P.S. I had to edit my post about ten times to get it right because there is no code tag and the wysiwyg editor gets extremely confused by non-closing [ i ] tags. I hope it is correct now.
Try to check error and errorLog on your output streams. It seems that CAL is not able to compile the kernel with so many nested loops. Removing one pair of do-while makes it work.
Originally posted by: gaurav.garg Try to check error and errorLog on your output streams. It seems that CAL is not able to compile the kernel with so many nested loops. Removing one pair of do-while makes it work.
Yes, it works with only 4 nested loops. But if one looks to the IL documentation there shouldn't be a limit to nested loops. It is explicitly mentioned for the whileloop instruction. So it appears to be a bug, isn't it?
Originally posted by: gaurav.garg Try to check error and errorLog on your output streams. It seems that CAL is not able to compile the kernel with so many nested loops. Removing one pair of do-while makes it work.
Sorry for not making correct error checking. Indeed I get the following error.
Kernel Execution : Failed to create Program
I did already notice that it works with one less while loop. This is indeed a minimal example, in reality we need around 26 while loops.
- Thomas
Originally posted by: ThomasI I did already notice that it works with one less while loop. This is indeed a minimal example, in reality we need around 26 while loops.
What the hell are you doing with 26 nested loops? How long it is supposed to run?
Would it be a possible solution to do the outer loops in host code and simply calling the kernel (with 4 nested loops) within in those 22 outer loops?
Originally posted by: Gipsel Originally posted by: ThomasI I did already notice that it works with one less while loop. This is indeed a minimal example, in reality we need around 26 while loops.
What the hell are you doing with 26 nested loops? How long it is supposed to run?
Would it be a possible solution to do the outer loops in host code and simply calling the kernel (with 4 nested loops) within in those 22 outer loops?
Originally posted by: Gipsel Originally posted by: ThomasI I did already notice that it works with one less while loop. This is indeed a minimal example, in reality we need around 26 while loops.What the hell are you doing with 26 nested loops? How long it is supposed to run? Would it be a possible solution to do the outer loops in host code and simply calling the kernel (with 4 nested loops) within in those 22 outer loops?
First of all I would prefer discussing the actual problem before going to workarounds.
But anyway: The loops result in manually "unrolling" a depth-first search algorithm. This was done because of missing local arrays, so if there was a feasable way to replicate local arrays this could help.
The code is supposed to run for a few hours.
Originally posted by: ThomasI
First of all I would prefer discussing the actual problem before going to workarounds.
I think there is not too much to discuss. It is a bug of the CAL compiler. The number of nested loops should be unlimited, but it simply fails to compile it. One has to wait for a driver release that may fix it.
Originally posted by: ThomasI But anyway: The loops result in manually "unrolling" a depth-first search algorithm. This was done because of missing local arrays, so if there was a feasable way to replicate local arrays this could help.
The code is supposed to run for a few hours.
I don't understand what is supposed to run in parallel on the GPU. From a general point of view traversing a search tree isn't exactly a show case for a GPU. A breadth-first search algorithm would be probably a much better fit. You could put every level of the tree in an array/stream (if the size is prohibitive, one can divide it easily in such a case) and search that level. A simple reduction after each level (or part of it) would yield the result (if something was found).
But again, the whole thing is not computationally demanding (it would be severely memory bound) if you don't create/calculate the values at the nodes before or when searching the tree, so it would be hard to get a speedup against a fast CPU.
ThomasI,
If you want local arrays, you will need to code in IL. As for the loop level, IL is unlimited and if you find that it is not so, then it should be marked as a bug. However, 22 levels of nested looping will have horrid performance on a graphics card and you would most likely do better using part of the global buffer as your local array than that level of looping.
Thanks again for the answers!
I don't understand what is supposed to run in parallel on the GPU. From a general point of view traversing a search tree isn't exactly a show case for a GPU. A breadth-first search algorithm would be probably a much better fit. You could put every level of the tree in an array/stream (if the size is prohibitive, one can divide it easily in such a case) and search that level. A simple reduction after each level (or part of it) would yield the result (if something was found). But again, the whole thing is not computationally demanding (it would be severely memory bound) if you don't create/calculate the values at the nodes before or when searching the tree, so it would be hard to get a speedup against a fast CPU.
The Idea is to use a BFS to precompute partial solutions on the CPU which are then fully explored on the GPU - so each stream element contains a partial solution. (this is the way it works on different accelerators). The working set is ~500 bytes, i was assuming it should fit in registers.
Originally posted by: MicahVillmow ThomasI, If you want local arrays, you will need to code in IL. As for the loop level, IL is unlimited and if you find that it is not so, then it should be marked as a bug. However, 22 levels of nested looping will have horrid performance on a graphics card and you would most likely do better using part of the global buffer as your local array than that level of looping.
Should I do something else to report this as a bug or does this forum post suffice?
I think I was making wrong assumptions about the SIMD Engines. If I got it right a FireStream 9170 has got 4 SIMD Engines - each 16 Thread Processors - each with 5 ALUs, so any kind of branching within the SIMD Engines kills performance - so I guess you are right.
Even if you could exceed the depth limitation of 4, there isn't much of any point in doing it. Even when the loop count is just 2, with 4 nested loops the overhead vs a single loop of count 16 would already be prohibitive.