cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ryta1203
Journeyman III

Success with Brook+?

I just wanted to ask and see how many users have had real success using Brook+ (NOT CAL). When I say "real success" I mean some complex (non-simple, non-embarassingly parallel) real world application using Brook+ while providing performance improvement.

I think it would be very interesting to see and I welcome any/all posts. Please, post the application if you want and the speedup if you want, that would be great!

Personally, I have had little/no success with some LBM using Brook+. I have currently switched to CUDA/Cell for the time being while I am waiting for Brook+ to become more mature and better documented.
Tags (3)
0 Likes
25 Replies
jean-claude
Journeyman III

Success with Brook+?

Hi,

Interesting question indeed that  I've been asking myself for several weeks already.

1) I originally has some trouble to get the whole environment running properly on my config, ...

2) Since, the documentation is not precise, so I originally thought I was missing something...

3) After having played around with some AMD examples, I decided to do some real Brook programming by converting a full video recognition software into a GPU accelerated version.

4) Doing simple frame deinterlacing kernel or dilate/erode functions worked quite well and provided an order of magnitude speed improvement versus an implementation running on a core2 duo CPU. (I'm not mentionning some problems encountered with Brook compiler at this stage).

5) BUT halas::: Programming some basic modules such a Sort or Median filter, I discovered that performance improvement were not clear at all.

I went through a lot of trial and error seeking the best way to use:

- indexof

- stream.domain

- "gather" versus smart reordering of streams

In some cases the runtime system would generate unexplained bugs.

Last but not least, tring to dig into the CAL generated code and the Brook runtime, I'm amazed of the overhead that seems to be added to perform even simple actions (try for instance decompiling the SUM example...)

Part of this may come from logical to physical domain translations...

who knows ??

An other point is that apparently some very useful features existing in CAL are not directly accesible by Brook...

What could be advisable is to use Brook to comile the kernel and get some form of CAL to interface and manage with the GPU memory and stream preparation..

Bottom line:

- I like the Brook+ concept and would enjoy using it in a professional environment

- But: the documentation is still poor and imprecise

- Bugs exist in the compiler

- Bugs exist in the runtime

- No guidance (or how to) tutorial seems to be available to cope with best practice to get performance improvement versus CPU code.

I do think Brook+ is still in infancy, and that our friend from AMD/ATI should speed up developping it to a professional grade too and providing answer to their beta testers...

Kind regards

JC

0 Likes
jean-claude
Journeyman III

Success with Brook+?

As an example, just have a lokk at what is generated for a simple kernel whose aim is for instance to get  output=2*input ...

 

kernel void times2(out float output<>, float input<> {
    output=2.0f*input;
}

Looking at what follows, I have to say I'm puzzled by the complexity ...

Could somebody from ATI comment please.. It would certainly help understanding a little bit more what's really happening under the Brook's hook !!

 

Generated code:

namespace {
    using namespace ::brook:esc;
    const char __times2_cal_desc_tech0_pass0[] = "il_ps_2_0\n"
        "dcl_literal l0,0x00000000,0x00000000,0x00000000,0x00000000\n"
        "dcl_literal l1,0x00000001,0x00000001,0x00000001,0x00000001\n"
        "dcl_literal l2,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF\n"
        "dcl_literal l3,0x7FFFFFFF,0x7FFFFFFF,0x7FFFFFFF,0x7FFFFFFF\n"
        "dcl_literal l4,0x7F800000,0x7F800000,0x7F800000,0x7F800000\n"
        "dcl_literal l5,0x80000000,0x80000000,0x80000000,0x80000000\n"
        "dcl_literal l6,0x3E9A209B,0x3E9A209B,0x3E9A209B,0x3E9A209B\n"
        "dcl_literal l7,0x3F317218,0x3F317218,0x3F317218,0x3F317218\n"
        "dcl_literal l8,0x40490FDB,0x40490FDB,0x40490FDB,0x40490FDB\n"
        "dcl_literal l9,0x3FC90FDB,0x3FC90FDB,0x3FC90FDB,0x3FC90FDB\n"
        "dcl_literal l10,0x00000003,0x00000003,0x00000003,0x00000003\n"
        "dcl_literal l11,0x00000002,0x00000002,0x00000002,0x00000002\n"
        "dcl_literal l12,0x40000000,0x40000000,0x40000000,0x40000000\n"
        "dcl_output_usage(color) o0.xyzw\n"
        "dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"
        "dcl_input_usage(generic) v0.xyzw\n"
        "mov r308.xy__,v0.xyzw\n"
        "call 37 \n"
        "call 0 \n"
        "endmain\n"
        "\n"
        "func 0\n"
        "mov o0.xyzw,r307.xyzw\n"
        "ret\n"
        "\n"
        "func 2\n"
        "ieq r0.x___,r17.x000,l0.x000\n"
        "if_logicalnz r0.x000\n"
        "sample_l_resource(0)_sampler(0) r19.xyzw,r18.xy00,r18.0000\n"
        "endif\n"
        "mov r16.x___,r19.x000\n"
        "ret_dyn\n"
        "ret\n"
        "\n"
        "func 35\n"
        "mul_ieee r267.x___,l12.x000,r266.x000\n"
        "mov r265.x___,r267.x000\n"
        "ret\n"
        "\n"
        "func 37\n"
        "mov r17.x___,l0.x000\n"
        "mov r18.xy__,r308.xy00\n"
        "call 2 \n"
        "mov r312.x___,r16.x000\n"
        "mov r310.x___,r312.x000\n"
        "mov r266.x___,r310.x000\n"
        "call 35 \n"
        "mov r309.x___,r265.x000\n"
        "mov r311.x___,r309.x000\n"
        "mov r311._y__,l0.0x00\n"
        "mov r311.__z_,l0.00x0\n"
        "mov r311.___w,l0.000x\n"
        "mov r307.xyzw,r311.xyzw\n"
        "ret\n"
        "\n"
        "end\n"
        "";

    const char __times2_cal_desc_tech1_pass0[] = "il_ps_2_0\n"
        "dcl_literal l0,0x00000000,0x00000000,0x00000000,0x00000000\n"
        "dcl_literal l1,0x00000001,0x00000001,0x00000001,0x00000001\n"
        "dcl_literal l2,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF\n"
        "dcl_literal l3,0x7FFFFFFF,0x7FFFFFFF,0x7FFFFFFF,0x7FFFFFFF\n"
        "dcl_literal l4,0x7F800000,0x7F800000,0x7F800000,0x7F800000\n"
        "dcl_literal l5,0x80000000,0x80000000,0x80000000,0x80000000\n"
        "dcl_literal l6,0x3E9A209B,0x3E9A209B,0x3E9A209B,0x3E9A209B\n"
        "dcl_literal l7,0x3F317218,0x3F317218,0x3F317218,0x3F317218\n"
        "dcl_literal l8,0x40490FDB,0x40490FDB,0x40490FDB,0x40490FDB\n"
        "dcl_literal l9,0x3FC90FDB,0x3FC90FDB,0x3FC90FDB,0x3FC90FDB\n"
        "dcl_literal l10,0x00000003,0x00000003,0x00000003,0x00000003\n"
        "dcl_literal l11,0x00000002,0x00000002,0x00000002,0x00000002\n"
        "dcl_literal l12,0x3F000000,0x3F000000,0x3F000000,0x3F000000\n"
        "dcl_literal l13,0x40000000,0x40000000,0x40000000,0x40000000\n"
        "dcl_output_usage(color) o0.xyzw\n"
        "dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"
        "dcl_input_usage(generic) v0.xyzw\n"
        "dcl_input_usage(generic) v1.xyzw\n"
        "dcl_cb cb0[13]\n"
        "mov r392.xy__,v0.xyzw\n"
        "mov r393.xy__,v1.xyzw\n"
        "mov r387.xyzw,cb0[l0.x + 0].xyzw\n"
        "mov r388.xyzw,cb0[l0.x + 1].xyzw\n"
        "mov r389.xyzw,cb0[l0.x + 2].xyzw\n"
        "mov r390.xyzw,cb0[l0.x + 3].xyzw\n"
        "mov r391.xyzw,cb0[l0.x + 4].xyzw\n"
        "mov r394.xyzw,cb0[l0.x + 5].xyzw\n"
        "mov r395.xyzw,cb0[l0.x + 6].xyzw\n"
        "mov r396.xyzw,cb0[l0.x + 7].xyzw\n"
        "mov r397.xyzw,cb0[l0.x + 8].xyzw\n"
        "mov r398.xyzw,cb0[l0.x + 9].xyzw\n"
        "mov r399.xyzw,cb0[l0.x + 10].xyzw\n"
        "mov r400.xyzw,cb0[l0.x + 11].xyzw\n"
        "mov r401.xyzw,cb0[l0.x + 12].xyzw\n"
        "call 41 \n"
        "call 0 \n"
        "endmain\n"
        "\n"
        "func 0\n"
        "mov o0.xyzw,r386.xyzw\n"
        "ret\n"
        "\n"
        "func 2\n"
        "ieq r0.x___,r17.x000,l0.x000\n"
        "if_logicalnz r0.x000\n"
        "sample_l_resource(0)_sampler(0) r19.xyzw,r18.xy00,r18.0000\n"
        "endif\n"
        "mov r16.x___,r19.x000\n"
        "ret_dyn\n"
        "ret\n"
        "\n"
        "func 32\n"
        "add r262.xyzw,r255.xyzw,r256.xyzw\n"
        "dp4_ieee r263.x___,r262.xyzw,r257.xyzw\n"
        "add r264.x___,r263.x000,l12.x000\n"
        "mov r259.x___,r264.x000\n"
        "mul_ieee r265.x___,r259.x000,r258.x000\n"
        "round_neginf r266.x___,r265.x000\n"
        "mov r260._y__,r266.0x00\n"
        "mov r409.x___,r260.y000\n"
        "mov r410.x___,r258.z000\n"
        "mul_ieee r267.x___,r409.x000,r410.x000\n"
        "sub r268.x___,r259.x000,r267.x000\n"
        "round_neginf r269.x___,r268.x000\n"
        "mov r260.x___,r269.x000\n"
        "mov r411.xy__,l12.xx00\n"
        "add r270.xy__,r260.xy00,r411.xy00\n"
        "mov r261.xy__,r270.xy00\n"
        "mov r254.xy__,r261.xy00\n"
        "ret_dyn\n"
        "ret\n"
        "\n"
        "func 33\n"
        "round_neginf r284.xy__,r272.xy00\n"
        "mov r280.xy__,r284.xy00\n"
        "dp2_ieee r285.x___,r280.xy00,r273.xy00\n"
        "mov r281.x___,r285.x000\n"
        "add r286.x___,r281.x000,l12.x000\n"
        "mov r412.xyzw,r286.xxxx\n"
        "mul_ieee r287.xyzw,r412.xyzw,r275.xyzw\n"
        "round_neginf r288.xyzw,r287.xyzw\n"
        "mov r282.xyzw,r288.xyzw\n"
        "mul_ieee r289.xyzw,r282.xyzw,r274.xyzw\n"
        "mov r413.xyzw,r281.xxxx\n"
        "sub r290.xyzw,r413.xyzw,r289.xyzw\n"
        "mov r283.xyzw,r290.xyzw\n"
        "mov r414.xyzw,l12.xxxx\n"
        "add r291.xyzw,r283.xyzw,r414.xyzw\n"
        "mul_ieee r292.xyzw,r291.xyzw,r276.xyzw\n"
        "sub r293.xyzw,r292.xyzw,r277.xyzw\n"
        "round_neginf r294.xyzw,r293.xyzw\n"
        "mov r279.xyzw,r294.xyzw\n"
        "mov r415.xyzw,l0.xxxx\n"
        "itof r416.xyzw,r415.xyzw\n"
        "lt r295.xyzw,r279.xyzw,r416.xyzw\n"
        "ior r296.xy__,r295.xy00,r295.zy00\n"
        "ior r296.x___,r296.x000,r296.y000\n"
        "if_logicalnz r296.x000\n"
        "discard_logicalz l0.xyzw\n"
        "endif\n"
        "ge r297.xyzw,r279.xyzw,r278.xyzw\n"
        "ior r298.xy__,r297.xy00,r297.zy00\n"
        "ior r298.x___,r298.x000,r298.y000\n"
        "if_logicalnz r298.x000\n"
        "discard_logicalz l0.xyzw\n"
        "endif\n"
        "ret\n"
        "\n"
        "func 39\n"
        "mul_ieee r346.x___,l13.x000,r345.x000\n"
        "mov r344.x___,r346.x000\n"
        "ret\n"
        "\n"
        "func 41\n"
        "mov r272.xy__,r393.xy00\n"
        "mov r417.xy__,r394.xyzw\n"
        "mov r273.xy__,r417.xy00\n"
        "mov r274.xyzw,r395.xyzw\n"
        "mov r275.xyzw,r396.xyzw\n"
        "mov r276.xyzw,r397.xyzw\n"
        "mov r277.xyzw,r398.xyzw\n"
        "mov r278.xyzw,r399.xyzw\n"
        "call 33 \n"
        "mov r407.xyzw,r279.xyzw\n"
        "mov r403.xyzw,r407.xyzw\n"
        "mov r405.xyzw,r407.xyzw\n"
        "mov r406.xy__,r392.xy00\n"
        "mov r17.x___,l0.x000\n"
        "mov r18.xy__,r406.xy00\n"
        "call 2 \n"
        "mov r418.x___,r16.x000\n"
        "mov r404.x___,r418.x000\n"
        "mov r345.x___,r404.x000\n"
        "call 39 \n"
        "mov r402.x___,r344.x000\n"
        "mov r408.x___,r402.x000\n"
        "mov r408._y__,l0.0x00\n"
        "mov r408.__z_,l0.00x0\n"
        "mov r408.___w,l0.000x\n"
        "mov r386.xyzw,r408.xyzw\n"
        "ret\n"
        "\n"
        "end\n"
        "";

    const char __times2_cal_desc_tech2_pass0[] = "il_ps_2_0\n"
        "dcl_literal l0,0x00000000,0x00000000,0x00000000,0x00000000\n"
        "dcl_literal l1,0x00000001,0x00000001,0x00000001,0x00000001\n"
        "dcl_literal l2,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF\n"
        "dcl_literal l3,0x7FFFFFFF,0x7FFFFFFF,0x7FFFFFFF,0x7FFFFFFF\n"
        "dcl_literal l4,0x7F800000,0x7F800000,0x7F800000,0x7F800000\n"
        "dcl_literal l5,0x80000000,0x80000000,0x80000000,0x80000000\n"
        "dcl_literal l6,0x3E9A209B,0x3E9A209B,0x3E9A209B,0x3E9A209B\n"
        "dcl_literal l7,0x3F317218,0x3F317218,0x3F317218,0x3F317218\n"
        "dcl_literal l8,0x40490FDB,0x40490FDB,0x40490FDB,0x40490FDB\n"
        "dcl_literal l9,0x3FC90FDB,0x3FC90FDB,0x3FC90FDB,0x3FC90FDB\n"
        "dcl_literal l10,0x00000003,0x00000003,0x00000003,0x00000003\n"
        "dcl_literal l11,0x00000002,0x00000002,0x00000002,0x00000002\n"
        "dcl_literal l12,0x3F000000,0x3F000000,0x3F000000,0x3F000000\n"
        "dcl_literal l13,0x40000000,0x40000000,0x40000000,0x40000000\n"
        "dcl_output_usage(color) o0.xyzw\n"
        "dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"
        "dcl_input_usage(generic) v0.xyzw\n"
        "dcl_input_usage(generic) v1.xyzw\n"
        "dcl_cb cb0[13]\n"
        "mov r392.xy__,v0.xyzw\n"
        "mov r393.xy__,v1.xyzw\n"
        "mov r387.xyzw,cb0[l0.x + 0].xyzw\n"
        "mov r388.xyzw,cb0[l0.x + 1].xyzw\n"
        "mov r389.xyzw,cb0[l0.x + 2].xyzw\n"
        "mov r390.xyzw,cb0[l0.x + 3].xyzw\n"
        "mov r391.xyzw,cb0[l0.x + 4].xyzw\n"
        "mov r394.xyzw,cb0[l0.x + 5].xyzw\n"
        "mov r395.xyzw,cb0[l0.x + 6].xyzw\n"
        "mov r396.xyzw,cb0[l0.x + 7].xyzw\n"
        "mov r397.xyzw,cb0[l0.x + 8].xyzw\n"
        "mov r398.xyzw,cb0[l0.x + 9].xyzw\n"
        "mov r399.xyzw,cb0[l0.x + 10].xyzw\n"
        "mov r400.xyzw,cb0[l0.x + 11].xyzw\n"
        "mov r401.xyzw,cb0[l0.x + 12].xyzw\n"
        "call 41 \n"
        "call 0 \n"
        "endmain\n"
        "\n"
        "func 0\n"
        "mov o0.xyzw,r386.xyzw\n"
        "ret\n"
        "\n"
        "func 2\n"
        "ieq r0.x___,r17.x000,l0.x000\n"
        "if_logicalnz r0.x000\n"
        "sample_l_resource(0)_sampler(0) r19.xyzw,r18.xy00,r18.0000\n"
        "endif\n"
        "mov r16.x___,r19.x000\n"
        "ret_dyn\n"
        "ret\n"
        "\n"
        "func 31\n"
        "mul_ieee r249.xyzw,r246.xyzw,r247.xyzw\n"
        "mov r409.xyzw,l12.xxxx\n"
        "add r250.xyzw,r249.xyzw,r409.xyzw\n"
        "mul_ieee r251.xyzw,r250.xyzw,r248.xyzw\n"
        "round_neginf r252.xyzw,r251.xyzw\n"
        "mov r245.xyzw,r252.xyzw\n"
        "ret_dyn\n"
        "ret\n"
        "\n"
        "func 32\n"
        "add r262.xyzw,r255.xyzw,r256.xyzw\n"
        "dp4_ieee r263.x___,r262.xyzw,r257.xyzw\n"
        "add r264.x___,r263.x000,l12.x000\n"
        "mov r259.x___,r264.x000\n"
        "mul_ieee r265.x___,r259.x000,r258.x000\n"
        "round_neginf r266.x___,r265.x000\n"
        "mov r260._y__,r266.0x00\n"
        "mov r410.x___,r260.y000\n"
        "mov r411.x___,r258.z000\n"
        "mul_ieee r267.x___,r410.x000,r411.x000\n"
        "sub r268.x___,r259.x000,r267.x000\n"
        "round_neginf r269.x___,r268.x000\n"
        "mov r260.x___,r269.x000\n"
        "mov r412.xy__,l12.xx00\n"
        "add r270.xy__,r260.xy00,r412.xy00\n"
        "mov r261.xy__,r270.xy00\n"
        "mov r254.xy__,r261.xy00\n"
        "ret_dyn\n"
        "ret\n"
        "\n"
        "func 33\n"
        "round_neginf r284.xy__,r272.xy00\n"
        "mov r280.xy__,r284.xy00\n"
        "dp2_ieee r285.x___,r280.xy00,r273.xy00\n"
        "mov r281.x___,r285.x000\n"
        "add r286.x___,r281.x000,l12.x000\n"
        "mov r413.xyzw,r286.xxxx\n"
        "mul_ieee r287.xyzw,r413.xyzw,r275.xyzw\n"
        "round_neginf r288.xyzw,r287.xyzw\n"
        "mov r282.xyzw,r288.xyzw\n"
        "mul_ieee r289.xyzw,r282.xyzw,r274.xyzw\n"
        "mov r414.xyzw,r281.xxxx\n"
        "sub r290.xyzw,r414.xyzw,r289.xyzw\n"
        "mov r283.xyzw,r290.xyzw\n"
        "mov r415.xyzw,l12.xxxx\n"
        "add r291.xyzw,r283.xyzw,r415.xyzw\n"
        "mul_ieee r292.xyzw,r291.xyzw,r276.xyzw\n"
        "sub r293.xyzw,r292.xyzw,r277.xyzw\n"
        "round_neginf r294.xyzw,r293.xyzw\n"
        "mov r279.xyzw,r294.xyzw\n"
        "mov r416.xyzw,l0.xxxx\n"
        "itof r417.xyzw,r416.xyzw\n"
        "lt r295.xyzw,r279.xyzw,r417.xyzw\n"
        "ior r296.xy__,r295.xy00,r295.zy00\n"
        "ior r296.x___,r296.x000,r296.y000\n"
        "if_logicalnz r296.x000\n"
        "discard_logicalz l0.xyzw\n"
        "endif\n"
        "ge r297.xyzw,r279.xyzw,r278.xyzw\n"
        "ior r298.xy__,r297.xy00,r297.zy00\n"
        "ior r298.x___,r298.x000,r298.y000\n"
        "if_logicalnz r298.x000\n"
        "discard_logicalz l0.xyzw\n"
        "endif\n"
        "ret\n"
        "\n"
        "func 39\n"
        "mul_ieee r346.x___,l13.x000,r345.x000\n"
        "mov r344.x___,r346.x000\n"
        "ret\n"
        "\n"
        "func 41\n"
        "mov r272.xy__,r393.xy00\n"
        "mov r418.xy__,r394.xyzw\n"
        "mov r273.xy__,r418.xy00\n"
        "mov r274.xyzw,r395.xyzw\n"
        "mov r275.xyzw,r396.xyzw\n"
        "mov r276.xyzw,r397.xyzw\n"
        "mov r277.xyzw,r398.xyzw\n"
        "mov r278.xyzw,r399.xyzw\n"
        "call 33 \n"
        "mov r407.xyzw,r279.xyzw\n"
        "mov r403.xyzw,r407.xyzw\n"
        "mov r246.xyzw,r407.xyzw\n"
        "mov r247.xyzw,r387.xyzw\n"
        "mov r248.xyzw,r388.xyzw\n"
        "call 31 \n"
        "mov r419.xyzw,r245.xyzw\n"
        "mov r405.xyzw,r419.xyzw\n"
        "mov r255.xyzw,r405.xyzw\n"
        "mov r256.xyzw,r391.xyzw\n"
        "mov r257.xyzw,r389.xyzw\n"
        "mov r258.xyzw,r390.xyzw\n"
        "call 32 \n"
        "mov r420.xy__,r254.xy00\n"
        "mov r406.xy__,r420.xy00\n"
        "mov r17.x___,l0.x000\n"
        "mov r18.xy__,r406.xy00\n"
        "call 2 \n"
        "mov r421.x___,r16.x000\n"
        "mov r404.x___,r421.x000\n"
        "mov r345.x___,r404.x000\n"
        "call 39 \n"
        "mov r402.x___,r344.x000\n"
        "mov r408.x___,r402.x000\n"
        "mov r408._y__,l0.0x00\n"
        "mov r408.__z_,l0.00x0\n"
        "mov r408.___w,l0.000x\n"
        "mov r386.xyzw,r408.xyzw\n"
        "ret\n"
        "\n"
        "end\n"
        "";

    static const gpu_kernel_desc __times2_cal_desc = gpu_kernel_desc()
        .technique( gpu_technique_desc()
            .pass( gpu_pass_desc( __times2_cal_desc_tech0_pass0 )
                .interpolant(2, kStreamInterpolant_Position)
                .output(1, 0)
                .sampler(2, 0)
            )
        )
        .technique( gpu_technique_desc()
            .output_address_translation()
            .pass( gpu_pass_desc( __times2_cal_desc_tech1_pass0 )
                .constant(2, kStreamConstant_ATIndexofNumer)
                .constant(2, kStreamConstant_ATIndexofDenom)
                .constant(2, kStreamConstant_ATLinearize)
                .constant(2, kStreamConstant_ATTextureShape)
                .constant(2, kStreamConstant_ATDomainMin)
                .constant(0, kGlobalConstant_ATOutputLinearize)
                .constant(0, kGlobalConstant_ATOutputStride)
                .constant(0, kGlobalConstant_ATOutputInvStride)
                .constant(0, kGlobalConstant_ATOutputInvExtent)
                .constant(0, kGlobalConstant_ATOutputDomainMin)
                .constant(0, kGlobalConstant_ATOutputDomainSize)
                .constant(0, kGlobalConstant_ATOutputInvShape)
                .constant(0, kGlobalConstant_ATHackConstant)
                .interpolant(0, kGlobalInterpolant_ATOutputTex)
                .interpolant(0, kGlobalInterpolant_ATOutputAddress)
                .output(1, 0)
                .sampler(2, 0)
            )
        )
        .technique( gpu_technique_desc()
            .output_address_translation()
            .input_address_translation()
            .pass( gpu_pass_desc( __times2_cal_desc_tech2_pass0 )
                .constant(2, kStreamConstant_ATIndexofNumer)
                .constant(2, kStreamConstant_ATIndexofDenom)
                .constant(2, kStreamConstant_ATLinearize)
                .constant(2, kStreamConstant_ATTextureShape)
                .constant(2, kStreamConstant_ATDomainMin)
                .constant(0, kGlobalConstant_ATOutputLinearize)
                .constant(0, kGlobalConstant_ATOutputStride)
                .constant(0, kGlobalConstant_ATOutputInvStride)
                .constant(0, kGlobalConstant_ATOutputInvExtent)
                .constant(0, kGlobalConstant_ATOutputDomainMin)
                .constant(0, kGlobalConstant_ATOutputDomainSize)
                .constant(0, kGlobalConstant_ATOutputInvShape)
                .constant(0, kGlobalConstant_ATHackConstant)
                .interpolant(0, kGlobalInterpolant_ATOutputTex)
                .interpolant(0, kGlobalInterpolant_ATOutputAddress)
                .output(1, 0)
                .sampler(2, 0)
            )
        );
    static const void* __times2_cal = &__times2_cal_desc;
}

0 Likes
sgratton
Adept I

Success with Brook+?


Hi there,

Just two quick notes about the brook+ il code: it comes out simpler if you disable address virtualization (brcc -r I think) and also that despite appearing complex it nontheless often leads to "simple" final gpuisa code when further optimized and compiled.

I have mainly focussed on getting to grips with il so far, but like the look of brook+ for tasks that do map clearly onto the streaming paradigm and plan to have more of a go sometime soon.

Best,
Steven.
0 Likes
kos
Journeyman III

Success with Brook+?

//

0 Likes
dukeleto
Adept I

Success with Brook+?

Re the first post,
I have ported a finite-differences based Navier-Stokes solver in 2D (double precision) to brook, and found it does quite well but perhaps a little less than I hoped.
I have not yet managed to measure performance with the brook+ code directly, the gpuanalyzer software refusing to compile my (long) kernels.
Thus the only comparison I have for the moment is with my reference fortran version of the same code, which, for large grids, runs at around
1.5 GFlops on a single core of a xeon dual-core 3.2 GHz processor. For large grids, I get a speedup of around 12x with a HD3870+core2duo7200.
A naive interpretation of this would be that the code runs at around 18 GFlops on the graphics card.
I suspect my performance at the moment is limited by memory speed, as the computational intensity of my kernels is not very high.
Regards
Olivier

0 Likes
ryta1203
Journeyman III

Success with Brook+?

Originally posted by: dukeleto

Re the first post,

I have ported a finite-differences based Navier-Stokes solver in 2D (double precision) to brook, and found it does quite well but perhaps a little less than I hoped.

I have not yet managed to measure performance with the brook+ code directly, the gpuanalyzer software refusing to compile my (long) kernels.

Thus the only comparison I have for the moment is with my reference fortran version of the same code, which, for large grids, runs at around

1.5 GFlops on a single core of a xeon dual-core 3.2 GHz processor. For large grids, I get a speedup of around 12x with a HD3870+core2duo7200.

A naive interpretation of this would be that the code runs at around 18 GFlops on the graphics card.

I suspect my performance at the moment is limited by memory speed, as the computational intensity of my kernels is not very high.

Regards

Olivier


I have a collleague who did a nonlinear LBM Navier-Stokes algorithm in CUDA. It took him less than a few days and he got large increases in time: 100x+, with optimizations and 100% occupancy. Just an FYI.

0 Likes
BarsMonster
Journeyman III

Success with Brook+?

Hope this is not too old thread 🙂

I can say, that I have a success with Brook+ 🙂

My aplication (BarsWF, hash bruteforcer) works with near 100% of real theoretical performance on both Brook+, CUDA and SSE2.

Making it work efficiently was quite challenging with Brook, but it is defenetly possible. Right now my application shows that performance/$(I mean performance of my program) on AMD cards is like twice better then on nVidia cards:

4870 makes ~1286 Mhash/sec

GTX280 makes ~710 MHash/sec while being more expensive (and this is not a development issue, competitors are even slower)

0 Likes
nberger
Adept I

Success with Brook+?

Just to add to the successes: My partial wave analysis fit runs about 150 times faster on a 4870 than the reference FORTRAN implementation on the same Core Duo machine.
0 Likes
bayoumi
Journeyman III

Success with Brook+?

I would like to disagree about avoiding "embarrassingly parallel" or "non-simple". I think the GPU from any manufacturer is just a SIMD machine. If you have a part of your problem which is either explicitly parallel or "vector-like", and takes a good portion of the computation time then the GPU is the perfect call. The GPU is just one piece of the infrastructure, and we should not be obsessed by putting everything on it.
0 Likes