Archives Discussions

geekmaster · ‎06-19-2010

Simple mandelbrot fractal generator using brook+ and glut on a 3870

I have created a very simple fractal generator using brook+, because i am stuck with hd 3870. However the results were surprising as 3870 turned out to be 5-20 times faster than my dual core cpu (intel 6400) depending the resolution settings etc. Note that 3870 used an unoptimized kernel and that the cpu used 2 threads.

You can download my program here: http://depositfiles.com/files/llys7tzdz

(If anyone wants to help me to optimize the kernel i would be gratefull as i am new to gpgpu)

In order to build it and use it, you will need glut32.dll on system32 folder, visual c++ ati stream sdk v1.4 of cource and some other opengl libraries installed in your microsoft sdk folder.

It supports real time panning and zooming even with 3870.

Use + and - for zooming, the arrows for panning and < and > to change the iterations.

If anyone wants to see the source or the executable let me know

You can use the code if you want, don't ask for permision

geekmaster · ‎06-20-2010

So is anybody intrested in running it on a 5850 or 5870 and tell me some results?

I am also posting some of the source

code to calculate the fractal: (similar to the sample provided by amd but the glut and opengl which show the fractal to the user make things far more interesting)

kernel void mandelbrot(float maxIterations, float scale, float offsetx, float offsety, float halfsize, out float mandelbrotStream<> )
{
    float2 vPos = indexof(mandelbrotStream).xy;
    float2 pointt = vPos;
    float x, y, x2, y2;
    float iteration;
    pointt.x = (2.0f / scale) * ((pointt.x - halfsize) / halfsize) + offsetx;
    pointt.y = (2.0f / scale) * ((pointt.y - halfsize) / halfsize) + offsety;
    x = pointt.x;
    y = pointt.y;
    x2 = x*x;
    y2 = y*y;
    for(iteration = 0.0f; (x2+y2 < 4.0f) && (iteration < maxIterations); iteration += 1.0f)
    {
        y = 2.0f*(x*y) + pointt.y;
        x = (x2 - y2) + pointt.x;
        x2 = x*x;
        y2 = y*y;
    }
    mandelbrotStream = iteration/maxIterations;
}

{
  float mandelbrotStream< size, size >;
  mandelbrot(maxIterations, scale, offsetx, offsety, (float)(size / 2), mandelbrotStream);
  streamWrite(mandelbrotStream, mandelbrotArray);
}

code to display the fractal:

glDrawPixels(size, size, GL_BLUE, GL_FLOAT, mandelbrotArray);

Jawed · ‎06-20-2010

The first performance optimisation you can play with is the number of pixels you evaluate per element in the domain of execution. Currently you are evaluating a single pixel for each element.

Brook+ can export up to 32 values per element.

To evaluate 4 pixels in parallel you'd use "out float4 mandelbrotStream<>". For 8 pixels, you'd use "out float4 mandelbrotStream0<>, out float4 mandelbrotStream1<>". etc.

The main loop then evaluates all the pixels in parallel. The great advantage of doing this is you give the ALUs something to chew on. Your basic kernel doesn't give the GPU much work, see attached (generated from Stream Kernel Analyzer). Here you can see that the 8 instructions in the loop result in only 1 or 2 operations per ALU cycle.

A disadvantage of multiple pixels in parallel is that if one pixel wants to exit after 500 iterations, but another pixel runs for 10,000 iterations, then the first pixel has to wait until the loop is done 10,000 times.

Another thing to try is to use double precision math. This will allow you to zoom in much further:

kernel void mandelbrot(float maxIterations, double scale, double offsetx, double offsety, double halfsize, out float mandelbrotStream<> )

Replace "float" with "double" for the compuations.

Brook+ has trouble with tests using the double type (or at least it did) so you might have to do something like:

for(iteration = 0.0f; (float)(x2+y2 < 4.0) && (iteration < maxIterations); iteration += 1.0f)

This converts the result of (x2+y2 < 4.0) into a float, it doesn't throw away any precision in evaluating whether x2+y2 is less than 4.0.

01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX 02 ALU_BREAK: ADDR(51) CNT(6) KCACHE0(CB0:0-15) 11 y: SETGT_DX10 T0.y, KC0[0].x, R0.x w: ADD ____, R4.x, R3.x VEC_021 12 z: SETGT_DX10 ____, (0x40800000, 4.0f).x, PV11.w 13 x: AND_INT R7.x, PV12.z, T0.y 14 x: PREDNE_INT ____, R7.x, 0.0f UPDATE_EXEC_MASK UPDATE_PRED 03 ALU: ADDR(57) CNT(7) 15 x: ADD R0.x, R0.x, 1.0f z: MUL*2 ____, R5.x, R6.x VEC_120 16 x: ADD R5.x, R2.x, PV15.z y: ADD ____, -R4.x, R3.x VEC_120 17 x: ADD R6.x, R1.x, PV16.y t: MUL R4.x, PV16.x, PV16.x 18 x: MUL R3.x, PV17.x, PV17.x 04 ENDLOOP i0 PASS_JUMP_ADDR(2)

geekmaster · ‎06-20-2010

Thank you Jawed, i will certenlly try what you suggested about proccesing multiple pixels.

But using double will result in less performance probably.

Anyway float has very limited precision.

So creating two kernels and using each time the right one is the best solution.

geekmaster · ‎06-25-2010

After watching code from other people i figured out a way to use float4 to calculate the fractal. This kernel is 2,2 times faster than the previous one.

Is there a way to make it even faster? Like calculating 16 pixels in each thread?

kernel void mandelbrot(int4 maxIterations, float _scalesize, float offsetx, float offsety, float halfsize, out int mandelbrotStream<>) { float2 pointx = indexof(mandelbrotStream).xy; const float y_pos = _scalesize * (pointx.y - halfsize) + offsety; const float4 pointsy = float4(y_pos, y_pos, y_pos, y_pos); const float4 pointsx = float4(_scalesize * (4.0f * pointx.x - halfsize) + offsetx, _scalesize * (4.0f * pointx.x + 1.0f - halfsize) + offsetx, _scalesize * (4.0f * pointx.x + 2.0f - halfsize) + offsetx, _scalesize * (4.0f * pointx.x + 3.0f - halfsize) + offsetx); float4 x, y, x2, y2; int4 iteration; const int4 zero = int4(0,0,0,0); const int4 one = int4(1,1,1,1); const float4 two = float4(2.0f, 2.0f, 2.0f, 2.0f); float4 x2y2; int4 reach_iter = zero; int4 reach_bailout = zero; int4 itervar = zero; x = pointsx; y = pointsy; x2 = x*x; y2 = y*y; iteration = zero; while(1) { y = two*(x*y) + pointsy; x = (x2 - y2) + pointsx; x2 = x*x; y2 = y*y; x2y2 = x2 + y2; reach_iter = int4((iteration.x < maxIterations.x), (iteration.y < maxIterations.y), (iteration.z < maxIterations.z), (iteration.w < maxIterations.w)); reach_bailout.x = (int)(x2y2.x < 4.0f); reach_bailout.y = (int)(x2y2.y < 4.0f); reach_bailout.z = (int)(x2y2.z < 4.0f); reach_bailout.w = (int)(x2y2.w < 4.0f); itervar = int4(reach_iter.x & reach_bailout.x, reach_iter.y & reach_bailout.y, reach_iter.z & reach_bailout.z, reach_iter.w & reach_bailout.w); iteration += (one & itervar); if(!any(itervar)) break; } iteration = (int4(255, 255, 255, 255) * iteration) / maxIterations; mandelbrotStream = iteration.x | (iteration.y << 😎 | (iteration.z << 16) | (iteration.w << 24); }

geekmaster · ‎06-25-2010

I forgot to mention that _scalesize = 4.0f / (scale * size)

and halfsize = size / 2

size : image dimentions

I get with 3870 100 fps while rendering the fractal with no zoom (scale = 1.0), 256 iterations and 960x960 resolution.

Jawed · ‎06-25-2010

If you are going to pack 4 iteration counts into a 32-bit int, then you can do up to 128 pixels.

You'd use int4, i.e. 16 iteration counts packed into 128 bits. Then you can have 8 outputs from the kernel, i.e. 1024 bits.

(In fact you can have far more outputs from a kernel if you do writes to global memory.)

You have to balance the number of pixels you generate against the branching incoherency penalty. That penalty is due to the pixels that exit quickly having to wait until the pixels that exit slowly are done.

geekmaster · ‎06-26-2010

When i tested my program at 960x960, 256 iterations my card was downclocked to 300mhz.

Only when the kernel contains a vast amount of calculations my card goes to the deafault clocks 775mhz. This happens probably with the new drivers (those after 2010).

So at 960x960, 1024 iterations i get 70fps with 3870.

Archives Discussions

mandelbrot fractal with brook+