cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ryta1203
Journeyman III

Success with Brook+?

I just wanted to ask and see how many users have had real success using Brook+ (NOT CAL). When I say "real success" I mean some complex (non-simple, non-embarassingly parallel) real world application using Brook+ while providing performance improvement.

I think it would be very interesting to see and I welcome any/all posts. Please, post the application if you want and the speedup if you want, that would be great!

Personally, I have had little/no success with some LBM using Brook+. I have currently switched to CUDA/Cell for the time being while I am waiting for Brook+ to become more mature and better documented.
0 Likes
25 Replies
jean-claude
Journeyman III

Hi,

Interesting question indeed that  I've been asking myself for several weeks already.

1) I originally has some trouble to get the whole environment running properly on my config, ...

2) Since, the documentation is not precise, so I originally thought I was missing something...

3) After having played around with some AMD examples, I decided to do some real Brook programming by converting a full video recognition software into a GPU accelerated version.

4) Doing simple frame deinterlacing kernel or dilate/erode functions worked quite well and provided an order of magnitude speed improvement versus an implementation running on a core2 duo CPU. (I'm not mentionning some problems encountered with Brook compiler at this stage).

5) BUT halas::: Programming some basic modules such a Sort or Median filter, I discovered that performance improvement were not clear at all.

I went through a lot of trial and error seeking the best way to use:

- indexof

- stream.domain

- "gather" versus smart reordering of streams

In some cases the runtime system would generate unexplained bugs.

Last but not least, tring to dig into the CAL generated code and the Brook runtime, I'm amazed of the overhead that seems to be added to perform even simple actions (try for instance decompiling the SUM example...)

Part of this may come from logical to physical domain translations...

who knows ??

An other point is that apparently some very useful features existing in CAL are not directly accesible by Brook...

What could be advisable is to use Brook to comile the kernel and get some form of CAL to interface and manage with the GPU memory and stream preparation..

Bottom line:

- I like the Brook+ concept and would enjoy using it in a professional environment

- But: the documentation is still poor and imprecise

- Bugs exist in the compiler

- Bugs exist in the runtime

- No guidance (or how to) tutorial seems to be available to cope with best practice to get performance improvement versus CPU code.

I do think Brook+ is still in infancy, and that our friend from AMD/ATI should speed up developping it to a professional grade too and providing answer to their beta testers...

Kind regards

JC

0 Likes

As an example, just have a lokk at what is generated for a simple kernel whose aim is for instance to get  output=2*input ...

 

kernel void times2(out float output<>, float input<> {
    output=2.0f*input;
}

Looking at what follows, I have to say I'm puzzled by the complexity ...

Could somebody from ATI comment please.. It would certainly help understanding a little bit more what's really happening under the Brook's hook !!

 

Generated code:

namespace {
    using namespace ::brook:esc;
    const char __times2_cal_desc_tech0_pass0[] = "il_ps_2_0\n"
        "dcl_literal l0,0x00000000,0x00000000,0x00000000,0x00000000\n"
        "dcl_literal l1,0x00000001,0x00000001,0x00000001,0x00000001\n"
        "dcl_literal l2,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF\n"
        "dcl_literal l3,0x7FFFFFFF,0x7FFFFFFF,0x7FFFFFFF,0x7FFFFFFF\n"
        "dcl_literal l4,0x7F800000,0x7F800000,0x7F800000,0x7F800000\n"
        "dcl_literal l5,0x80000000,0x80000000,0x80000000,0x80000000\n"
        "dcl_literal l6,0x3E9A209B,0x3E9A209B,0x3E9A209B,0x3E9A209B\n"
        "dcl_literal l7,0x3F317218,0x3F317218,0x3F317218,0x3F317218\n"
        "dcl_literal l8,0x40490FDB,0x40490FDB,0x40490FDB,0x40490FDB\n"
        "dcl_literal l9,0x3FC90FDB,0x3FC90FDB,0x3FC90FDB,0x3FC90FDB\n"
        "dcl_literal l10,0x00000003,0x00000003,0x00000003,0x00000003\n"
        "dcl_literal l11,0x00000002,0x00000002,0x00000002,0x00000002\n"
        "dcl_literal l12,0x40000000,0x40000000,0x40000000,0x40000000\n"
        "dcl_output_usage(color) o0.xyzw\n"
        "dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"
        "dcl_input_usage(generic) v0.xyzw\n"
        "mov r308.xy__,v0.xyzw\n"
        "call 37 \n"
        "call 0 \n"
        "endmain\n"
        "\n"
        "func 0\n"
        "mov o0.xyzw,r307.xyzw\n"
        "ret\n"
        "\n"
        "func 2\n"
        "ieq r0.x___,r17.x000,l0.x000\n"
        "if_logicalnz r0.x000\n"
        "sample_l_resource(0)_sampler(0) r19.xyzw,r18.xy00,r18.0000\n"
        "endif\n"
        "mov r16.x___,r19.x000\n"
        "ret_dyn\n"
        "ret\n"
        "\n"
        "func 35\n"
        "mul_ieee r267.x___,l12.x000,r266.x000\n"
        "mov r265.x___,r267.x000\n"
        "ret\n"
        "\n"
        "func 37\n"
        "mov r17.x___,l0.x000\n"
        "mov r18.xy__,r308.xy00\n"
        "call 2 \n"
        "mov r312.x___,r16.x000\n"
        "mov r310.x___,r312.x000\n"
        "mov r266.x___,r310.x000\n"
        "call 35 \n"
        "mov r309.x___,r265.x000\n"
        "mov r311.x___,r309.x000\n"
        "mov r311._y__,l0.0x00\n"
        "mov r311.__z_,l0.00x0\n"
        "mov r311.___w,l0.000x\n"
        "mov r307.xyzw,r311.xyzw\n"
        "ret\n"
        "\n"
        "end\n"
        "";

    const char __times2_cal_desc_tech1_pass0[] = "il_ps_2_0\n"
        "dcl_literal l0,0x00000000,0x00000000,0x00000000,0x00000000\n"
        "dcl_literal l1,0x00000001,0x00000001,0x00000001,0x00000001\n"
        "dcl_literal l2,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF\n"
        "dcl_literal l3,0x7FFFFFFF,0x7FFFFFFF,0x7FFFFFFF,0x7FFFFFFF\n"
        "dcl_literal l4,0x7F800000,0x7F800000,0x7F800000,0x7F800000\n"
        "dcl_literal l5,0x80000000,0x80000000,0x80000000,0x80000000\n"
        "dcl_literal l6,0x3E9A209B,0x3E9A209B,0x3E9A209B,0x3E9A209B\n"
        "dcl_literal l7,0x3F317218,0x3F317218,0x3F317218,0x3F317218\n"
        "dcl_literal l8,0x40490FDB,0x40490FDB,0x40490FDB,0x40490FDB\n"
        "dcl_literal l9,0x3FC90FDB,0x3FC90FDB,0x3FC90FDB,0x3FC90FDB\n"
        "dcl_literal l10,0x00000003,0x00000003,0x00000003,0x00000003\n"
        "dcl_literal l11,0x00000002,0x00000002,0x00000002,0x00000002\n"
        "dcl_literal l12,0x3F000000,0x3F000000,0x3F000000,0x3F000000\n"
        "dcl_literal l13,0x40000000,0x40000000,0x40000000,0x40000000\n"
        "dcl_output_usage(color) o0.xyzw\n"
        "dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"
        "dcl_input_usage(generic) v0.xyzw\n"
        "dcl_input_usage(generic) v1.xyzw\n"
        "dcl_cb cb0[13]\n"
        "mov r392.xy__,v0.xyzw\n"
        "mov r393.xy__,v1.xyzw\n"
        "mov r387.xyzw,cb0[l0.x + 0].xyzw\n"
        "mov r388.xyzw,cb0[l0.x + 1].xyzw\n"
        "mov r389.xyzw,cb0[l0.x + 2].xyzw\n"
        "mov r390.xyzw,cb0[l0.x + 3].xyzw\n"
        "mov r391.xyzw,cb0[l0.x + 4].xyzw\n"
        "mov r394.xyzw,cb0[l0.x + 5].xyzw\n"
        "mov r395.xyzw,cb0[l0.x + 6].xyzw\n"
        "mov r396.xyzw,cb0[l0.x + 7].xyzw\n"
        "mov r397.xyzw,cb0[l0.x + 8].xyzw\n"
        "mov r398.xyzw,cb0[l0.x + 9].xyzw\n"
        "mov r399.xyzw,cb0[l0.x + 10].xyzw\n"
        "mov r400.xyzw,cb0[l0.x + 11].xyzw\n"
        "mov r401.xyzw,cb0[l0.x + 12].xyzw\n"
        "call 41 \n"
        "call 0 \n"
        "endmain\n"
        "\n"
        "func 0\n"
        "mov o0.xyzw,r386.xyzw\n"
        "ret\n"
        "\n"
        "func 2\n"
        "ieq r0.x___,r17.x000,l0.x000\n"
        "if_logicalnz r0.x000\n"
        "sample_l_resource(0)_sampler(0) r19.xyzw,r18.xy00,r18.0000\n"
        "endif\n"
        "mov r16.x___,r19.x000\n"
        "ret_dyn\n"
        "ret\n"
        "\n"
        "func 32\n"
        "add r262.xyzw,r255.xyzw,r256.xyzw\n"
        "dp4_ieee r263.x___,r262.xyzw,r257.xyzw\n"
        "add r264.x___,r263.x000,l12.x000\n"
        "mov r259.x___,r264.x000\n"
        "mul_ieee r265.x___,r259.x000,r258.x000\n"
        "round_neginf r266.x___,r265.x000\n"
        "mov r260._y__,r266.0x00\n"
        "mov r409.x___,r260.y000\n"
        "mov r410.x___,r258.z000\n"
        "mul_ieee r267.x___,r409.x000,r410.x000\n"
        "sub r268.x___,r259.x000,r267.x000\n"
        "round_neginf r269.x___,r268.x000\n"
        "mov r260.x___,r269.x000\n"
        "mov r411.xy__,l12.xx00\n"
        "add r270.xy__,r260.xy00,r411.xy00\n"
        "mov r261.xy__,r270.xy00\n"
        "mov r254.xy__,r261.xy00\n"
        "ret_dyn\n"
        "ret\n"
        "\n"
        "func 33\n"
        "round_neginf r284.xy__,r272.xy00\n"
        "mov r280.xy__,r284.xy00\n"
        "dp2_ieee r285.x___,r280.xy00,r273.xy00\n"
        "mov r281.x___,r285.x000\n"
        "add r286.x___,r281.x000,l12.x000\n"
        "mov r412.xyzw,r286.xxxx\n"
        "mul_ieee r287.xyzw,r412.xyzw,r275.xyzw\n"
        "round_neginf r288.xyzw,r287.xyzw\n"
        "mov r282.xyzw,r288.xyzw\n"
        "mul_ieee r289.xyzw,r282.xyzw,r274.xyzw\n"
        "mov r413.xyzw,r281.xxxx\n"
        "sub r290.xyzw,r413.xyzw,r289.xyzw\n"
        "mov r283.xyzw,r290.xyzw\n"
        "mov r414.xyzw,l12.xxxx\n"
        "add r291.xyzw,r283.xyzw,r414.xyzw\n"
        "mul_ieee r292.xyzw,r291.xyzw,r276.xyzw\n"
        "sub r293.xyzw,r292.xyzw,r277.xyzw\n"
        "round_neginf r294.xyzw,r293.xyzw\n"
        "mov r279.xyzw,r294.xyzw\n"
        "mov r415.xyzw,l0.xxxx\n"
        "itof r416.xyzw,r415.xyzw\n"
        "lt r295.xyzw,r279.xyzw,r416.xyzw\n"
        "ior r296.xy__,r295.xy00,r295.zy00\n"
        "ior r296.x___,r296.x000,r296.y000\n"
        "if_logicalnz r296.x000\n"
        "discard_logicalz l0.xyzw\n"
        "endif\n"
        "ge r297.xyzw,r279.xyzw,r278.xyzw\n"
        "ior r298.xy__,r297.xy00,r297.zy00\n"
        "ior r298.x___,r298.x000,r298.y000\n"
        "if_logicalnz r298.x000\n"
        "discard_logicalz l0.xyzw\n"
        "endif\n"
        "ret\n"
        "\n"
        "func 39\n"
        "mul_ieee r346.x___,l13.x000,r345.x000\n"
        "mov r344.x___,r346.x000\n"
        "ret\n"
        "\n"
        "func 41\n"
        "mov r272.xy__,r393.xy00\n"
        "mov r417.xy__,r394.xyzw\n"
        "mov r273.xy__,r417.xy00\n"
        "mov r274.xyzw,r395.xyzw\n"
        "mov r275.xyzw,r396.xyzw\n"
        "mov r276.xyzw,r397.xyzw\n"
        "mov r277.xyzw,r398.xyzw\n"
        "mov r278.xyzw,r399.xyzw\n"
        "call 33 \n"
        "mov r407.xyzw,r279.xyzw\n"
        "mov r403.xyzw,r407.xyzw\n"
        "mov r405.xyzw,r407.xyzw\n"
        "mov r406.xy__,r392.xy00\n"
        "mov r17.x___,l0.x000\n"
        "mov r18.xy__,r406.xy00\n"
        "call 2 \n"
        "mov r418.x___,r16.x000\n"
        "mov r404.x___,r418.x000\n"
        "mov r345.x___,r404.x000\n"
        "call 39 \n"
        "mov r402.x___,r344.x000\n"
        "mov r408.x___,r402.x000\n"
        "mov r408._y__,l0.0x00\n"
        "mov r408.__z_,l0.00x0\n"
        "mov r408.___w,l0.000x\n"
        "mov r386.xyzw,r408.xyzw\n"
        "ret\n"
        "\n"
        "end\n"
        "";

    const char __times2_cal_desc_tech2_pass0[] = "il_ps_2_0\n"
        "dcl_literal l0,0x00000000,0x00000000,0x00000000,0x00000000\n"
        "dcl_literal l1,0x00000001,0x00000001,0x00000001,0x00000001\n"
        "dcl_literal l2,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF\n"
        "dcl_literal l3,0x7FFFFFFF,0x7FFFFFFF,0x7FFFFFFF,0x7FFFFFFF\n"
        "dcl_literal l4,0x7F800000,0x7F800000,0x7F800000,0x7F800000\n"
        "dcl_literal l5,0x80000000,0x80000000,0x80000000,0x80000000\n"
        "dcl_literal l6,0x3E9A209B,0x3E9A209B,0x3E9A209B,0x3E9A209B\n"
        "dcl_literal l7,0x3F317218,0x3F317218,0x3F317218,0x3F317218\n"
        "dcl_literal l8,0x40490FDB,0x40490FDB,0x40490FDB,0x40490FDB\n"
        "dcl_literal l9,0x3FC90FDB,0x3FC90FDB,0x3FC90FDB,0x3FC90FDB\n"
        "dcl_literal l10,0x00000003,0x00000003,0x00000003,0x00000003\n"
        "dcl_literal l11,0x00000002,0x00000002,0x00000002,0x00000002\n"
        "dcl_literal l12,0x3F000000,0x3F000000,0x3F000000,0x3F000000\n"
        "dcl_literal l13,0x40000000,0x40000000,0x40000000,0x40000000\n"
        "dcl_output_usage(color) o0.xyzw\n"
        "dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"
        "dcl_input_usage(generic) v0.xyzw\n"
        "dcl_input_usage(generic) v1.xyzw\n"
        "dcl_cb cb0[13]\n"
        "mov r392.xy__,v0.xyzw\n"
        "mov r393.xy__,v1.xyzw\n"
        "mov r387.xyzw,cb0[l0.x + 0].xyzw\n"
        "mov r388.xyzw,cb0[l0.x + 1].xyzw\n"
        "mov r389.xyzw,cb0[l0.x + 2].xyzw\n"
        "mov r390.xyzw,cb0[l0.x + 3].xyzw\n"
        "mov r391.xyzw,cb0[l0.x + 4].xyzw\n"
        "mov r394.xyzw,cb0[l0.x + 5].xyzw\n"
        "mov r395.xyzw,cb0[l0.x + 6].xyzw\n"
        "mov r396.xyzw,cb0[l0.x + 7].xyzw\n"
        "mov r397.xyzw,cb0[l0.x + 8].xyzw\n"
        "mov r398.xyzw,cb0[l0.x + 9].xyzw\n"
        "mov r399.xyzw,cb0[l0.x + 10].xyzw\n"
        "mov r400.xyzw,cb0[l0.x + 11].xyzw\n"
        "mov r401.xyzw,cb0[l0.x + 12].xyzw\n"
        "call 41 \n"
        "call 0 \n"
        "endmain\n"
        "\n"
        "func 0\n"
        "mov o0.xyzw,r386.xyzw\n"
        "ret\n"
        "\n"
        "func 2\n"
        "ieq r0.x___,r17.x000,l0.x000\n"
        "if_logicalnz r0.x000\n"
        "sample_l_resource(0)_sampler(0) r19.xyzw,r18.xy00,r18.0000\n"
        "endif\n"
        "mov r16.x___,r19.x000\n"
        "ret_dyn\n"
        "ret\n"
        "\n"
        "func 31\n"
        "mul_ieee r249.xyzw,r246.xyzw,r247.xyzw\n"
        "mov r409.xyzw,l12.xxxx\n"
        "add r250.xyzw,r249.xyzw,r409.xyzw\n"
        "mul_ieee r251.xyzw,r250.xyzw,r248.xyzw\n"
        "round_neginf r252.xyzw,r251.xyzw\n"
        "mov r245.xyzw,r252.xyzw\n"
        "ret_dyn\n"
        "ret\n"
        "\n"
        "func 32\n"
        "add r262.xyzw,r255.xyzw,r256.xyzw\n"
        "dp4_ieee r263.x___,r262.xyzw,r257.xyzw\n"
        "add r264.x___,r263.x000,l12.x000\n"
        "mov r259.x___,r264.x000\n"
        "mul_ieee r265.x___,r259.x000,r258.x000\n"
        "round_neginf r266.x___,r265.x000\n"
        "mov r260._y__,r266.0x00\n"
        "mov r410.x___,r260.y000\n"
        "mov r411.x___,r258.z000\n"
        "mul_ieee r267.x___,r410.x000,r411.x000\n"
        "sub r268.x___,r259.x000,r267.x000\n"
        "round_neginf r269.x___,r268.x000\n"
        "mov r260.x___,r269.x000\n"
        "mov r412.xy__,l12.xx00\n"
        "add r270.xy__,r260.xy00,r412.xy00\n"
        "mov r261.xy__,r270.xy00\n"
        "mov r254.xy__,r261.xy00\n"
        "ret_dyn\n"
        "ret\n"
        "\n"
        "func 33\n"
        "round_neginf r284.xy__,r272.xy00\n"
        "mov r280.xy__,r284.xy00\n"
        "dp2_ieee r285.x___,r280.xy00,r273.xy00\n"
        "mov r281.x___,r285.x000\n"
        "add r286.x___,r281.x000,l12.x000\n"
        "mov r413.xyzw,r286.xxxx\n"
        "mul_ieee r287.xyzw,r413.xyzw,r275.xyzw\n"
        "round_neginf r288.xyzw,r287.xyzw\n"
        "mov r282.xyzw,r288.xyzw\n"
        "mul_ieee r289.xyzw,r282.xyzw,r274.xyzw\n"
        "mov r414.xyzw,r281.xxxx\n"
        "sub r290.xyzw,r414.xyzw,r289.xyzw\n"
        "mov r283.xyzw,r290.xyzw\n"
        "mov r415.xyzw,l12.xxxx\n"
        "add r291.xyzw,r283.xyzw,r415.xyzw\n"
        "mul_ieee r292.xyzw,r291.xyzw,r276.xyzw\n"
        "sub r293.xyzw,r292.xyzw,r277.xyzw\n"
        "round_neginf r294.xyzw,r293.xyzw\n"
        "mov r279.xyzw,r294.xyzw\n"
        "mov r416.xyzw,l0.xxxx\n"
        "itof r417.xyzw,r416.xyzw\n"
        "lt r295.xyzw,r279.xyzw,r417.xyzw\n"
        "ior r296.xy__,r295.xy00,r295.zy00\n"
        "ior r296.x___,r296.x000,r296.y000\n"
        "if_logicalnz r296.x000\n"
        "discard_logicalz l0.xyzw\n"
        "endif\n"
        "ge r297.xyzw,r279.xyzw,r278.xyzw\n"
        "ior r298.xy__,r297.xy00,r297.zy00\n"
        "ior r298.x___,r298.x000,r298.y000\n"
        "if_logicalnz r298.x000\n"
        "discard_logicalz l0.xyzw\n"
        "endif\n"
        "ret\n"
        "\n"
        "func 39\n"
        "mul_ieee r346.x___,l13.x000,r345.x000\n"
        "mov r344.x___,r346.x000\n"
        "ret\n"
        "\n"
        "func 41\n"
        "mov r272.xy__,r393.xy00\n"
        "mov r418.xy__,r394.xyzw\n"
        "mov r273.xy__,r418.xy00\n"
        "mov r274.xyzw,r395.xyzw\n"
        "mov r275.xyzw,r396.xyzw\n"
        "mov r276.xyzw,r397.xyzw\n"
        "mov r277.xyzw,r398.xyzw\n"
        "mov r278.xyzw,r399.xyzw\n"
        "call 33 \n"
        "mov r407.xyzw,r279.xyzw\n"
        "mov r403.xyzw,r407.xyzw\n"
        "mov r246.xyzw,r407.xyzw\n"
        "mov r247.xyzw,r387.xyzw\n"
        "mov r248.xyzw,r388.xyzw\n"
        "call 31 \n"
        "mov r419.xyzw,r245.xyzw\n"
        "mov r405.xyzw,r419.xyzw\n"
        "mov r255.xyzw,r405.xyzw\n"
        "mov r256.xyzw,r391.xyzw\n"
        "mov r257.xyzw,r389.xyzw\n"
        "mov r258.xyzw,r390.xyzw\n"
        "call 32 \n"
        "mov r420.xy__,r254.xy00\n"
        "mov r406.xy__,r420.xy00\n"
        "mov r17.x___,l0.x000\n"
        "mov r18.xy__,r406.xy00\n"
        "call 2 \n"
        "mov r421.x___,r16.x000\n"
        "mov r404.x___,r421.x000\n"
        "mov r345.x___,r404.x000\n"
        "call 39 \n"
        "mov r402.x___,r344.x000\n"
        "mov r408.x___,r402.x000\n"
        "mov r408._y__,l0.0x00\n"
        "mov r408.__z_,l0.00x0\n"
        "mov r408.___w,l0.000x\n"
        "mov r386.xyzw,r408.xyzw\n"
        "ret\n"
        "\n"
        "end\n"
        "";

    static const gpu_kernel_desc __times2_cal_desc = gpu_kernel_desc()
        .technique( gpu_technique_desc()
            .pass( gpu_pass_desc( __times2_cal_desc_tech0_pass0 )
                .interpolant(2, kStreamInterpolant_Position)
                .output(1, 0)
                .sampler(2, 0)
            )
        )
        .technique( gpu_technique_desc()
            .output_address_translation()
            .pass( gpu_pass_desc( __times2_cal_desc_tech1_pass0 )
                .constant(2, kStreamConstant_ATIndexofNumer)
                .constant(2, kStreamConstant_ATIndexofDenom)
                .constant(2, kStreamConstant_ATLinearize)
                .constant(2, kStreamConstant_ATTextureShape)
                .constant(2, kStreamConstant_ATDomainMin)
                .constant(0, kGlobalConstant_ATOutputLinearize)
                .constant(0, kGlobalConstant_ATOutputStride)
                .constant(0, kGlobalConstant_ATOutputInvStride)
                .constant(0, kGlobalConstant_ATOutputInvExtent)
                .constant(0, kGlobalConstant_ATOutputDomainMin)
                .constant(0, kGlobalConstant_ATOutputDomainSize)
                .constant(0, kGlobalConstant_ATOutputInvShape)
                .constant(0, kGlobalConstant_ATHackConstant)
                .interpolant(0, kGlobalInterpolant_ATOutputTex)
                .interpolant(0, kGlobalInterpolant_ATOutputAddress)
                .output(1, 0)
                .sampler(2, 0)
            )
        )
        .technique( gpu_technique_desc()
            .output_address_translation()
            .input_address_translation()
            .pass( gpu_pass_desc( __times2_cal_desc_tech2_pass0 )
                .constant(2, kStreamConstant_ATIndexofNumer)
                .constant(2, kStreamConstant_ATIndexofDenom)
                .constant(2, kStreamConstant_ATLinearize)
                .constant(2, kStreamConstant_ATTextureShape)
                .constant(2, kStreamConstant_ATDomainMin)
                .constant(0, kGlobalConstant_ATOutputLinearize)
                .constant(0, kGlobalConstant_ATOutputStride)
                .constant(0, kGlobalConstant_ATOutputInvStride)
                .constant(0, kGlobalConstant_ATOutputInvExtent)
                .constant(0, kGlobalConstant_ATOutputDomainMin)
                .constant(0, kGlobalConstant_ATOutputDomainSize)
                .constant(0, kGlobalConstant_ATOutputInvShape)
                .constant(0, kGlobalConstant_ATHackConstant)
                .interpolant(0, kGlobalInterpolant_ATOutputTex)
                .interpolant(0, kGlobalInterpolant_ATOutputAddress)
                .output(1, 0)
                .sampler(2, 0)
            )
        );
    static const void* __times2_cal = &__times2_cal_desc;
}

0 Likes


Hi there,

Just two quick notes about the brook+ il code: it comes out simpler if you disable address virtualization (brcc -r I think) and also that despite appearing complex it nontheless often leads to "simple" final gpuisa code when further optimized and compiled.

I have mainly focussed on getting to grips with il so far, but like the look of brook+ for tasks that do map clearly onto the streaming paradigm and plan to have more of a go sometime soon.

Best,
Steven.
0 Likes

//

0 Likes
dukeleto
Adept I

Re the first post,
I have ported a finite-differences based Navier-Stokes solver in 2D (double precision) to brook, and found it does quite well but perhaps a little less than I hoped.
I have not yet managed to measure performance with the brook+ code directly, the gpuanalyzer software refusing to compile my (long) kernels.
Thus the only comparison I have for the moment is with my reference fortran version of the same code, which, for large grids, runs at around
1.5 GFlops on a single core of a xeon dual-core 3.2 GHz processor. For large grids, I get a speedup of around 12x with a HD3870+core2duo7200.
A naive interpretation of this would be that the code runs at around 18 GFlops on the graphics card.
I suspect my performance at the moment is limited by memory speed, as the computational intensity of my kernels is not very high.
Regards
Olivier

0 Likes

Originally posted by: dukeleto

Re the first post,

I have ported a finite-differences based Navier-Stokes solver in 2D (double precision) to brook, and found it does quite well but perhaps a little less than I hoped.

I have not yet managed to measure performance with the brook+ code directly, the gpuanalyzer software refusing to compile my (long) kernels.

Thus the only comparison I have for the moment is with my reference fortran version of the same code, which, for large grids, runs at around

1.5 GFlops on a single core of a xeon dual-core 3.2 GHz processor. For large grids, I get a speedup of around 12x with a HD3870+core2duo7200.

A naive interpretation of this would be that the code runs at around 18 GFlops on the graphics card.

I suspect my performance at the moment is limited by memory speed, as the computational intensity of my kernels is not very high.

Regards

Olivier


I have a collleague who did a nonlinear LBM Navier-Stokes algorithm in CUDA. It took him less than a few days and he got large increases in time: 100x+, with optimizations and 100% occupancy. Just an FYI.

0 Likes

Hope this is not too old thread 🙂

I can say, that I have a success with Brook+ 🙂

My aplication (BarsWF, hash bruteforcer) works with near 100% of real theoretical performance on both Brook+, CUDA and SSE2.

Making it work efficiently was quite challenging with Brook, but it is defenetly possible. Right now my application shows that performance/$(I mean performance of my program) on AMD cards is like twice better then on nVidia cards:

4870 makes ~1286 Mhash/sec

GTX280 makes ~710 MHash/sec while being more expensive (and this is not a development issue, competitors are even slower)

0 Likes
nberger
Adept I

Just to add to the successes: My partial wave analysis fit runs about 150 times faster on a 4870 than the reference FORTRAN implementation on the same Core Duo machine.
0 Likes
bayoumi
Journeyman III

I would like to disagree about avoiding "embarrassingly parallel" or "non-simple". I think the GPU from any manufacturer is just a SIMD machine. If you have a part of your problem which is either explicitly parallel or "vector-like", and takes a good portion of the computation time then the GPU is the perfect call. The GPU is just one piece of the infrastructure, and we should not be obsessed by putting everything on it.
0 Likes

Originally posted by: bayoumi

I would like to disagree about avoiding "embarrassingly parallel" or "non-simple". I think the GPU from any manufacturer is just a SIMD machine. If you have a part of your problem which is either explicitly parallel or "vector-like", and takes a good portion of the computation time then the GPU is the perfect call. The GPU is just one piece of the infrastructure, and we should not be obsessed by putting everything on it.


Well, you can't really disagree with a question, can you??

I asked a question about who has had success with complex (non-simple/embarrasignly parallel) problems.

I agree that non-simple and embarrasingly parallel problems are the best for GPU and are also the easiest to port (which is why I don't care about them). A problem that is nicely laid out has already done the hard part for you and there is really no "problem" there to solve.

If you are looking for results and have a non-simple/ep problem then YES, by all means the GPU is a great solution.... if you are only interested in results. If you are interested in GPU solutions themselves, then this is not the case, which is why I posed the question.
0 Likes

I started this thread so I just wanted to give an update:

Since 1.3 I have started back with Brook+, I have been working on my LBM code on and off for ~3 weeks.

I was comparing my version to a colleagues CUDA version, both for the 8800GTX and the GTX280.

Currently, my implementation on the 4850 runs ~1sec faster than his CUDA version (which has 100% occupancy and little/low uncoalesced global/shared reads/writes) on the 8800GTX, but is still significantly slower than the GTX280. I am shooting for it to run somewhere between the 8800GTX and the GTX280.

Compared to the CPU (I'm still not sure why people do this but ok) I get ~26x speedup.
0 Likes

Originally posted by: ryta1203 I started this thread so I just wanted to give an update: Since 1.3 I have started back with Brook+, I have been working on my LBM code on and off for ~3 weeks. I was comparing my version to a colleagues CUDA version, both for the 8800GTX and the GTX280. Currently, my implementation on the 4850 runs ~1sec faster than his CUDA version (which has 100% occupancy and little/low uncoalesced global/shared reads/writes) on the 8800GTX, but is still significantly slower than the GTX280. I am shooting for it to run somewhere between the 8800GTX and the GTX280. Compared to the CPU (I'm still not sure why people do this but ok) I get ~26x speedup.


Thanks for the update, and nice work thus far! Keep us posted.

0 Likes

Just another little update on that same project, with Brook+.

I am now at ~6 seconds faster than the 8800GTX with the 4850. The 8800GTX solution is ~57.3 seconds and the 4850 solution is ~51.2 seconds

A couple of optimizations since last time I posted have gained me 5 seconds, giving me ~28x speedup so far.
0 Likes

Originally posted by: ryta1203 Just another little update on that same project, with Brook+. I am now at ~6 seconds faster than the 8800GTX with the 4850. The 8800GTX solution is ~57.3 seconds and the 4850 solution is ~51.2 seconds A couple of optimizations since last time I posted have gained me 5 seconds, giving me ~28x speedup so far.


Just thought I'd give one final update... I haven't messed with this code in A LONG TIME but the results I posted earlier were also not the final results.

My "final" code (possibly could be optimized further but I'm not going to do it) on the 4870 runs ~3 seconds slower over 5000 iterations for a domain of 1024x1024 than an "optimized" CUDA implementation on the GTX280.

Just an FYI in case anyone cares. I have recently gotten a 5870... I'm excited to see the code run on the 5870.

0 Likes

Run it then ryta, I want to see if adding more stream processing unit worth the price

Maybe your code need more thread than more ALU

0 Likes

Originally posted by: riza.guntur Run it then ryta, I want to see if adding more stream processing unit worth the price

Maybe your code need more thread than more ALU

Some of the kernels are memory bound, certainly but I do have a few that are ALU bound.

Also, the clock speeds will help some regardless.

The only problem is that right now I can't get the 5870 installed on my machine to work... it won't display anything. The computer runs, everything looks fine but none of the display outputs display anything... put my 4870 back in, everythings fine. Figures.

0 Likes

Originally posted by: riza.guntur Run it then ryta, I want to see if adding more stream processing unit worth the price

Maybe your code need more thread than more ALU

Well, two things:

1. There are more ALU units

2. Clock speeds are faster.

So even if I had no kernels that were truly ALU bound and didn't have enough WFs running to still be ALU bound or at least to hide some fetch latencies, I would see some improvement from the clock speeds alone.

The same code runs at ~22 (give or take a few tenths of a second) seconds on the 5870 for 5000 iterations at 1024x1024. Just thought I'd let you know since you asked.

BTW, I had to move the 5870 to another machine to get it to work. I'm not sure if the problem is my PSU (850W, so shouldn't be the problem) or my MB (MSI K9A Platinum)....

the machine I moved the card to has MSI K9A2 Platinum and 750W PSU, so I'm guessing that this board won't run on some legacy motherboards (not that the K9A is all that "legacy" really).

0 Likes

ryta, This code on HD4870 was running in 33 seconds right?

Have you already figured wich was the bottleneck in the new card? If all units gets used, could it be L2 badnwidth?

 

0 Likes

Yes, that is correct with the 4870.

It depends on the kernel (there are 4 kernels altogether).

ALSO, I underclocked the 5870 to 750/900 (which I believe is what the 4870 is clocked at) and I got ~27.5 seconds... so about half the speedup is due to increased clocks and the other half is due to increased SIMD engines.. this makes sense to me.

Since there are more SIMD engines, it makes to me (correct me if I am wrong here) that you are getting more overall operating power... so it's not just about the ALUs either, you also have more TEX units (though the same per SIMD engine) so you can operate on more data at once, whether it's fetching or alu operations. I'm not 100% sure about how the increase in SIMD engines effects the global write so I wont' speak on that.

Overall though, for 5000 iterations I got a 10 second increase (without underclocking) and for 1024x1024 the total process takes ~88000 iterations that's a decent speedup if you were running a lot of tests at large domain sizes, IMO.

0 Likes

Originally posted by: ryta1203Since there are more SIMD engines, it makes to me (correct me if I am wrong here) that you are getting more overall operating power... so it's not just about the ALUs either, you also have more TEX units (though the same per SIMD engine) so you can operate on more data at once, whether it's fetching or alu operations. I'm not 100% sure about how the increase in SIMD engines effects the global write so I wont' speak on that.

You are correct.

 

About global writes and others, that's the point... I heard many guys complaining about the poor gain of HD5870 over HD4870, but why?

 

0 Likes

Well eduardo,

  If you look at the SCUG section "Estimating Performance" and look at how the memory bottleneck is calculated, you won't notice any variable/parameter that has anything to do with increasing texture units or ALUs (essentially nothing to do with increased SIMD engines)...

... so... if your kernels are memory bound you are only going to get an increase in the memory clock speed from 4870 to 5870.. which probably isn't really going to be that significant a performance increase.

 

0 Likes

Occasionally I go back to this program..

.. just an update: I squeezed an additional ~2 seconds/5000 iterations out the code.

So my optimized Brook+ code running on a single 5870 runs about 10 seconds faster over 5000 iterations than our optimized Cuda code running on a single GTX280.

The CUDA code is pretty optimized btw, the guy who coded is knows what he's doing, in case anyone was wondering.

0 Likes

Which version of Brook+ are u guys using and where did u get it? I'd like to get it up and running on my PC too

0 Likes

I think I'm using SDK 1.4 still.

0 Likes

Glad to see that your are able to get things to work.
Just for a comparison on where you should be running on the 4850:
http://www.tomshardware.com/ch...3DMark-Score,794.html

Once you better optimize for the chip, you should be able to push into the GTX260 area.
0 Likes