Archives Discussions

karx11erx · ‎05-28-2009

I am currently trying to write kernel functions processing two doubles at a time, like e.g. this one:

double2 _f1 = {1.0, 1.0};

double2 _f6 = {6.0, 6.0};
double2 _f20 = {20.0, 20.0};
double2 _f42 = {42.0, 42.0};
double2 _f72 = {72.0, 72.0};
double2 _f110 = {110.0, 110.0};
double2 _f156 = {156.0, 156.0};
double2 _f210 = {210.0, 210.0};
double2 _f272 = {272.0, 272.0};

kernel double2 SinD2 (double2 phi<>)
{
double2 phi2 = phi * phi;
return phi * (_f1 - phi2 / _f6 * (_f1 - phi2 / _f20 * (_f1 - phi2 / _f42 * (_f1 - phi2 / _f72 * (_f1 - phi2 / _f110 * (_f1 - phi2 / _f156 * (_f1 - phi2 / _f210 * (_f1 - phi2 / _f272))))))));
}

The _f...-Variables aren't recognized in the kernel. So I cannot use global variables in kernels? I am getting:

1>WARNING: ASSERT(GetResultSymbol().IsValid() + mDataTypeValue.IsValid() >= 1) failed
1>While processing :191
1>In compiler at AST::DelayedLookup::ResolveSymbols()[astdelayedlookup.cpp:139]
1> *mName = _f1
1>Message: unknown symbol

The reason I am trying this is that something like

double2 v;
v.x = 1.0;
v.y = 1.0;
v = v * 0.5;

is rejected with an error message about type conflicts. Now if I turn off strong type checking, this compiles.

The question is: Does "v = v * 0.5;" actually multiply each v.x and v.y with 0.5? It would in GLSL, so I suppose the hardware supports it, but does the CAL compiler?

Another question. Provided double2 * double works as intended, could I theoretically double stream processing speed for doubles when processing double2 data? I assume that when passing single doubles to a kernel, half of the thread ALUs stay passive, is that right? The thought is that each thread processor can process 4 floats which would mean two double2s simultaneously, so if I give a thread a double2 to process it should process both in the same time it takes for a double.

Solution: double2 * double multiplies both double2 components with the double value.

Jawed · ‎05-28-2009

As far as I can tell you can't use globals in Brook+ kernels.

The strong type checking stops all those nice shortcuts you get in OpenGL/D3D :frown;

You won't get increased ALU throughput because double-precision multiply and multiply-add both use 4 lanes out of the 5 lanes. Double-precision add can double in throughput, as it only uses 2 lanes. Unfortunately the compilers don't identify this in your code, so it only issues one DADD at a time.

Double-precision divide is slow.

By using double2 for streams and gather/scatter operations, instead of double, you will improve performance against memory - reducing the count of reads/writes against memory for a given amount of data.

I put this:

kernel double2 SinD2 (double2 phi)
{
double2 _f1 = double2(1.0, 1.0);

double2 _f6 = double2(6.0, 6.0);
double2 _f20 = double2(20.0, 20.0);
double2 _f42 = double2(42.0, 42.0);
double2 _f72 = double2(72.0, 72.0);
double2 _f110 = double2(110.0, 110.0);
double2 _f156 = double2(156.0, 156.0);
double2 _f210 = double2(210.0, 210.0);
double2 _f272 = double2(272.0, 272.0);

double2 phi2 = phi * phi;
return phi * (_f1 - phi2 / _f6 * (_f1 - phi2 / _f20 * (_f1 - phi2 / _f42 * (_f1 - phi2 / _f72 * (_f1 - phi2 / _f110 * (_f1 - phi2 / _f156 * (_f1 - phi2 / _f210 * (_f1 - phi2 / _f272))))))));

}

kernel void test(double2 A<>, out double2 B<>)
{
B=SinD2(A) ;
}

into Stream Kernel Analyzer where the instructions that run on the chip can be seen.

Jawed

Ceq · ‎05-28-2009

You can't use global variables in kernels, if you use them only in that kernel you can put them inside, if you need to use them in several places and you want to avoid code replication you can use preprocessor "#define". Brook 1.4 supports some preprocessor directives, just add the "-pp" flag when you call BRCC to compile the source code.

The expression you say doesn't compile because in the current version variables must be of the same type, "V" is a double2 value, but 0.5 is a double. You should write:

double2 v = double2(1.0, 1.0);

v = v * double2(0.5, 0.5);

Another solution would be disabling strong type checking using "-a" flag, your code will compile and multiply each component by 0.5 as you say.

I think using double2 instead double is better for fetching several values together from memory, inside kernels arithmetic expressions will be optimized by the compiler automatically, reordering instructions to use as many processing units as possible (you can check this using KernelAnalyzer).

By the way, I think you can avoid some of those divisions by using the reciprocal elements in the constants. According to KernelAnalyzer the compiler is smart enough to optimize them, however it could miss some optimizations in complex expressions or kernel parameters.

karx11erx · ‎05-28-2009

Ummm ... yeah, thanks, but as I wrote this was solved for me already.

Thanks for the performance hints though.

Archives Discussions

SOLVED: global variables in kernels