Archives Discussions

karx11erx · ‎05-04-2009

Having a pointer in a global struct causes CAL compile errors

I am passing a pointer in a global struct. No variables of the struct's type are used inside kernels, it's only used to pass values to the function that calls the kernels as in the following short example code:

typedef struct tTest { double* a; } tTest;

kernel void TestKernel (float s_in<>, out float s_out<>, float a0, float a1, float a2)
{
s_out = s_in * a0 * a1 * a2;
}

void TestFunc (float* a_in, float *a_out, tTest* t, int n)
{
    float    s_in;
    float s_out;
    float a [3];
    int i;

for (i = 0; i < 3; i++)
    a = (float) t.a [ i ];
streamRead (s_in, a_in);
TestKernel (s_in, s_out, a [0], a [1], a [2]);
streamWrite (s_out, a_out);
}

This leads to errors in CAL compile step. I cannot quite understand this, as TestFunc should be purely C++.

Not passing the values by a pointer unfortunately is not an option.

gaurav_garg · ‎05-05-2009

It looks like brcc fails to compile structs (if it has a pointer) even if it is unused in kernels. As a workaround, I would suggest you to move this struct definition in a header file and include that header in br file, so that brcc can ignore it.

Also, I would suggest you to move your complete non-kernel code out of br file. You can take a look at C++ API provided in Brook+ that allows you to write non-kernel code in C++ file and provides a lot more features.

karx11erx · ‎05-05-2009

brcc doesn't parse header files?

gaurav_garg · ‎05-05-2009

No, it doesn't.

karx11erx · ‎05-05-2009

Well, I can't see how to put the C++ code in a separate file and call the kernel functions from it from the documentation. I cannot find anything about that in the documentation, and I don't want to start coding too far beyond regular C/C++ code. If I could declare the kernel functions kind of "extern" or so, but it doesn't look like that's the way to go, and I will not start to code around on IL level.

I am e.g. having a function like this:

kernel void merc_s_forward (float4 lp<>, out float4 xy<>, float es, float esp, float k0, float en0, float en1, float en2, float en3, float en4, float ml0)
{
pj_merc_s_forward_decomp (lp.x, lp.y, xy.x, xy.y, es, esp, k0, en0, en1, en2, en3, en4, ml0);
pj_merc_s_forward_decomp (lp.z, lp.w, xy.z, xy.w, es, esp, k0, en0, en1, en2, en3, en4, ml0);
}

And this is what Brook+ and CAL compiler create from it:

__THREAD__ __merc_s_forward merc_s_forward;

How the hell do I tell C++ about this? And all because those shitty tools cannot properly process struct declarations.

This stuff is starting to get too complicated. I could as well do some GPGPU coding via OpenGL with float render targets.

If ATI isn't able to provide a simple and clean C++ interface to their Stream Computing plus tools that really work, I will have to try whether NVidia's CUDA works better; or wait for OpenCL.

The more I learn about ATI's Stream Computing API and tools, the more it looks like an experiment and not like a serious, mature, well working programming environment, and I am losing my patience with this. I want to get work done and not work around one flaw after another in it.

Pity.

gaurav_garg · ‎05-05-2009

I would suggest you atleast read some documentation or look at some samples before saying it is too complicated. I personally feel C++ API is very easy to use.

Documentation: Stream computig use guide (Start reading from Section 2.6)

Samples: under $(BROKROOT)\samples\cpp

karx11erx · ‎05-05-2009

If you read my previous post you will notice that I did that.

Ceq · ‎05-05-2009

Hello karx11erx, you should try to split CPU and Brook+ GPU code, it is easier to mantain and avoids some problems. Here you have a small sample showing how to do it:

// *** gpu.br ***

kernel void set0(out int o<> ) { o = 0; }

// *** main.cpp ***

// System libraries
#include <cstdio >
// Brook+ libraries
#include <brook/stream.h >
// Brook+ user functions
#include "brookgenfiles/gpu.h"
using namespace brook;

// Structure with pointer
typedef struct sTest { double* a; } tTest;

int main(int argc, char *argv[]) {
    // System memory data
    int v[4] = {1, 2, 3, 4};
    printf("vIn = {%i, %i, %i, %i}\n", v[0], v[1], v[2], v[3]);

    // GPU memory data
    unsigned int dim[] = { 4 };
    Stream<int> s(1, dim);

    // CPU -> GPU
    s.read(v);

    // Kernel call
    set0(s);

    // GPU -> CPU
    s.write(v);

    printf("vOut = {%i, %i, %i, %i}\n", v[0], v[1], v[2], v[3]);
    return 0;
}

As you see the kernel can be called from a normal 'cpp' file and this way you avoid having a structure with a pointer in a 'br' file.

When you call Brook+ 'BRCC' compiler several files are generated. You should include "brookgenfiles/gpu.h" header file in "main.cpp" to be able to use the kernel. You should add "brookgenfiles/gpu.cpp" to your project so the generated code gets compiled. If you still have problems I can send you the Visual C++ project.

Note that this is only a simplified case of the samples Gaurav mentioned.

karx11erx · ‎05-06-2009

Thank you. I have already found out how to do that and not use the legacy brook interface but the template class based one instead. I have run into another problem though:

#1 I don't know how to use non standard math functions that are in a stream computing math lib (tan, atan)

so

#2 I wrote my own versions just to be able to continue implementing my Stream Computing test code. These functions are needed in several brook files. According to the doc, kernel functions get inlined, so should be static in their file. They aren't though:

1>Linking...
1>pj_merc_br.obj : error LNK2005: "void __cdecl __Fabs_cpu(class brt::KernelC *,int,int,bool)" (?__Fabs_cpu@@YAXPAVKernelC@brt@@HH_N@Z) already defined in geocent_br.obj
1>pj_merc_br.obj : error LNK2005: "void __cdecl __adjlon_cpu(class brt::KernelC *,int,int,bool)" (?__adjlon_cpu@@YAXPAVKernelC@brt@@HH_N@Z) already defined in geocent_br.obj
1>pj_merc_br.obj : error LNK2005: "void __cdecl __Tan_cpu(class brt::KernelC *,int,int,bool)" (?__Tan_cpu@@YAXPAVKernelC@brt@@HH_N@Z) already defined in geocent_br.obj
1>pj_merc_br.obj : error LNK2005: "void __cdecl __Atan_cpu(class brt::KernelC *,int,int,bool)" (?__Atan_cpu@@YAXPAVKernelC@brt@@HH_N@Z) already defined in geocent_br.obj
1>pj_inv_br.obj : error LNK2005: "void __cdecl __Fabs_cpu(class brt::KernelC *,int,int,bool)" (?__Fabs_cpu@@YAXPAVKernelC@brt@@HH_N@Z) already defined in geocent_br.obj
1>pj_inv_br.obj : error LNK2005: "void __cdecl __adjlon_cpu(class brt::KernelC *,int,int,bool)" (?__adjlon_cpu@@YAXPAVKernelC@brt@@HH_N@Z) already defined in geocent_br.obj
1>pj_inv_br.obj : error LNK2005: "void __cdecl __Tan_cpu(class brt::KernelC *,int,int,bool)" (?__Tan_cpu@@YAXPAVKernelC@brt@@HH_N@Z) already defined in geocent_br.obj
1>pj_inv_br.obj : error LNK2005: "void __cdecl __Atan_cpu(class brt::KernelC *,int,int,bool)" (?__Atan_cpu@@YAXPAVKernelC@brt@@HH_N@Z) already defined in geocent_br.obj
1>pj_fwd_br.obj : error LNK2005: "void __cdecl __Fabs_cpu(class brt::KernelC *,int,int,bool)" (?__Fabs_cpu@@YAXPAVKernelC@brt@@HH_N@Z) already defined in geocent_br.obj
1>pj_fwd_br.obj : error LNK2005: "void __cdecl __adjlon_cpu(class brt::KernelC *,int,int,bool)" (?__adjlon_cpu@@YAXPAVKernelC@brt@@HH_N@Z) already defined in geocent_br.obj
1>pj_fwd_br.obj : error LNK2005: "void __cdecl __Tan_cpu(class brt::KernelC *,int,int,bool)" (?__Tan_cpu@@YAXPAVKernelC@brt@@HH_N@Z) already defined in geocent_br.obj
1>pj_fwd_br.obj : error LNK2005: "void __cdecl __Atan_cpu(class brt::KernelC *,int,int,bool)" (?__Atan_cpu@@YAXPAVKernelC@brt@@HH_N@Z) already defined in geocent_br.obj
1>pj_tmerc_br.obj : error LNK2005: "void __cdecl __Fabs_cpu(class brt::KernelC *,int,int,bool)" (?__Fabs_cpu@@YAXPAVKernelC@brt@@HH_N@Z) already defined in geocent_br.obj
1>pj_tmerc_br.obj : error LNK2005: "void __cdecl __adjlon_cpu(class brt::KernelC *,int,int,bool)" (?__adjlon_cpu@@YAXPAVKernelC@brt@@HH_N@Z) already defined in geocent_br.obj
1>pj_tmerc_br.obj : error LNK2005: "void __cdecl __Tan_cpu(class brt::KernelC *,int,int,bool)" (?__Tan_cpu@@YAXPAVKernelC@brt@@HH_N@Z) already defined in geocent_br.obj
1>pj_tmerc_br.obj : error LNK2005: "void __cdecl __Atan_cpu(class brt::KernelC *,int,int,bool)" (?__Atan_cpu@@YAXPAVKernelC@brt@@HH_N@Z) already defined in geocent_br.obj
1> Creating library D:\projects\proj-4.6.1\VisualC\Debug\proj.lib and object D:\projects\proj-4.6.1\VisualC\Debug\proj.exp
1>D:\projects\proj-4.6.1\VisualC\Debug\proj.exe : fatal error LNK1169: one or more multiply defined symbols found

Why the heck do these CPU functions even get created, when I was actually implementing a kernel sub function called by another kernel, and how do I solve this?

@AMD:

The half assed Stream computing API is driving me nuts! And on top of it Brook+ 1.4 (One Dot Four, not Zero Dot Something!) doesn't even support include files, i.e. I have to copy all global constants into each and every br file that needs them!

If I hadn't been so far into ATI's stuff, I'd have already switched to CUDA and seen whether that is less lackluster.

Edit: Declaring the kernel subfunctions static doesn't help either.

gaurav_garg · ‎05-07-2009

Unfortunately, CPU emulation function are not inlined. If you are not using CPU runtime, I would suggest you compile your .br file so that it doesn't generate CPU emulation code. You need to use -p cal in brcc command line options to generate only CAL code.

karx11erx · ‎05-07-2009

Thanks again. I will try and see whether I can arrange my code so that I get this to work.

Any pointers on how to get built-in tan and abs into my brook code?

gaurav_garg · ‎05-07-2009

You can implement your own methods, but with a different name, as these names (abs, tan) are currently reserved.

karx11erx · ‎05-07-2009

Does that mean that there are no function implementations behind these reserved words currently?

Btw, with -p CAL it links now. Phew.

gaurav_garg · ‎05-07-2009

tan is not there....not sure about abs.

karx11erx · ‎05-12-2009

Well, I have implemented my own sin, cos, tan, atan and atan2 functions because I need double precision and the built-in functions all use float. The only one I didn't replace was sqrt because afaik I cannot use bit-wise manipulation of double vars in CAL programs which I'd need for a fast sqrt implementation - or do I?. So I still get somewhat different results between the CPU and GPU paths of my application, but they are acceptably small. I would of course prefer double precision math to be fully available with ATI stream computing hardware.

Given all the limitations and bugs of the current brook+ version I am really looking forward to OpenCL now, and I hope the next hardware generation will offer true double precision across all functions.

Jawed · ‎05-13-2009

kernel double drsqrt(double x)
// double precision reciprocal square root
{
float y = rsqrt((float) x) ;
return 0.5 * y * (3 - (x * (y*y))) ;
}

I cribbed that from NVidia's CUDA forum

I haven't tested it, though as I don't have a graphics card to run it on.

Faster than divide! Shame you'll need to do a divide to get dsqrt which makes it slower and losing a bit of precision.

Jawed

karx11erx · ‎05-13-2009

the sqrt function I can derive from this halves the errors I get - and actually I need the reciprocal square root most of the time anyway.

Thanks a bunch, this was very helpful.

Jawed · ‎05-15-2009

I realised a slight error as it should use a double-precision 3.0, not an integer 3:

kernel double drsqrt(double x)
// double precision reciprocal square root
{
float y = rsqrt((float) x) ;
return 0.5 * y * (3.0 - (x * (y*y))) ;
}

This produces a different compilation. I can't test this to ascertain whether it affects precision.

Jawed

karx11erx · ‎05-24-2009

Thanks. I have setup the Brook+/CAL compilation so that it flags such stuff as errors so I fixed this though. I have found that your function still doesn't yield the full precision that is possible with double. Here is a version I found to yield full double precision:

kernel double Sqrt (double y)
{
double x = (double) (rsqrt ((float) y));
double z = y * 0.5;
x = 1.5 * x - (x * x * x * z);
x = 1.5 * x - (x * x * x * z);
x = 1.5 * x - (x * x * x * z);
return x * y;
}

Btw, that function simply implements a Newton/Raphson square root iteration using the float rsqrt result as a good initial estimate, therefore converging very fast.

I have found that in my application using self-implemented double precision math (sin, cos, tan, atan, atan2, sqrt, exp, log) slows the computations down by about five percent (I estimate that per computation 3 or 4 of these functions are called). That's acceptable. Using the built-in functions where available yields a 10 fold speedup and using my double functions yields a 9.5 fould speedup compared to the equivalent CPU code on my test hardware.

Jawed · ‎05-25-2009

Nice. When I originally found the drsqrt code I tested it in Excel and seemed to get really good results and I didn't investigate further.

I expect your refinement will be appreciated by quite a few people. I suppose at some point there'll be official support for this, using a "macro" of some kind, perhaps like this function. NVidia's double-precision functions are all macros apart from ADD, MUL and MAD.

Jawed

Archives Discussions

No pointers in global structs?