48 Replies Latest reply on Mar 2, 2011 4:16 AM by himanshu.gautam

    Performance Comparison ATI-NVidia

    eci

      Hi

      I'm currently comparing performance of OpenCL on different platforms. I'm especially interested in comparing NVidia and ATI graphics cards. The cards I'm currently using are a NVidia Quadro FX5600 and an ATI Radeon HD 5870.

      The task I use for comparison is a backprojection algorithm.

      After running some experiments the ATI card is two times slower then the card from NVidia while from a theoretical point of view it should be at least twice as fast. That's a very disappointing result and I'm curious what the reason for this poor performance is.

       

      The Problem is as follows:

      I reconstruct a volume from projection data. The volume size is 512ˆ3 and I have 400 Projections. For every projection one kernel-run is launched. The task of the kernels is to compute for every voxel a position in the current projection image and take this value to increment the voxel value. For the projection images I am using image_2d with a sampler for linear interpolation.

      On the NVidia graphics card I am using a 2D problem over the x-z-direction of the volume. Every kernel runs one line in y-direction and work-groups are aligned along the x-direction. This way memory access is coalesced and I get very good performance.

      On the ATI graphics card I tried the same approach, but performance was devastating. So I went back to a 3D problem. I experimented with work-group sizes and alignment along the x-direction seems to be beneficial here too. This type of implementation currently yields the best performance on ATI, but as stated it takes double the time of the NVidia card.

      I tried different versions of the kernel and I'm pretty sure, that memory access is the limiting factor. But why? Do I miss something?

      One more question: Is there a way to get around the memory restrictions with the current Stream SDK? I'm already using the environment variables to get access to the full 1GB of memory, but can still only allocate 256MB of memory in one block which is very annoying!

      I attached the simples version of my kernel code. For NVidia the inner part is surrounded by a loop and for my current ATI version every kernel processes two voxels on two different memory blocks because of the limitation of the maximum memory block size of 256MB on the current Stream SDK.

       

      Thanks for your support!

      Greetings

      Christian

      const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR; __kernel void RoadRunner( __global float* volume, __read_only image2d_t projection, int L, int S_x, int S_y, float R_L, float O_L, float m0, float m1, float m2, float m3, float m4, float m5, float m6, float m7, float m8, float m9, float m10, float m11) { size_t id1 = get_global_id(0); size_t id2 = get_global_id(1); size_t id3 = get_global_id(2); float z = O_L + (float)id1 * R_L; float y = O_L + (float)id2 * R_L; float x = O_L + (float)id3 * R_L; float w_n = m2 * x + m5 * y + m8 * z + m11; float u_n = (m0 * x + m3 * y + m6 * z + m9 ) / w_n; float v_n = (m1 * x + m4 * y + m7 * z + m10) / w_n; volume[id1 * L * L + id2 * L + id3] += (float)(1.0 / (w_n * w_n) * read_imagef(projection, sampler, (float2)(u_n+0.5, v_n+0.5)).x); return; }

        • Performance Comparison ATI-NVidia
          cjang

          Some suggestions:

          1. sampler = CLK_FILTER_NEAREST | CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_NONE;

          2. vectorize memory access so: __global floatX * volume, where X is 2 or 4, scalar calculation is ok inside the kernel, just gather and scatter should be with float2 or float4

          3. layout image data to maximize use of texture cache, right now your code is only using the first "x" component while each read_imagef returns a quad of four floats

          I don't mean this purely as a shameless plug. I've done a lot of optimization from an applications level. Here's something I wrote that may give you some ideas of how different the vectorized cores of ATI are from the scalar cores of NVIDIA: http://golem5.org/gatlas/CaseStudyGATLAS.htm . Kernel design can be different between vendors (and different architectures).

            • Performance Comparison ATI-NVidia
              zeland

              Simple way to optimize is reducing of operation like /

              And consider to use __constant  for m1,m2 etc.

              float w_n = 1/( m2 * x + m5 * y + m8 * z + m11); float u_n = (m0 * x + m3 * y + m6 * z + m9 ) * w_n; float v_n = (m1 * x + m4 * y + m7 * z + m10) * w_n; volume[id1 * L * L + id2 * L + id3] += (float)((w_n * w_n) / read_imagef(projection, sampler, (float2)(u_n+0.5, v_n+0.5)).x);

                • Performance Comparison ATI-NVidia
                  debdatta.basu

                  Does vectorizing solve the problem? If it does, the compiler is REALLY bad coz it wasn't able to  vectorize three consecutive identical statements....

                  Are you sure the compiler is set to fully optimize? In any case... try the vectorized version and share the details..

                  Cheers!

                  -Debdatta Basu.

                    • Performance Comparison ATI-NVidia
                      cjang

                      To clarify: I suggest vectorizing the gather/scatter memory access but not the calculations. Leave those as scalar operations. eci's speculation that "memory access is the limiting factor" is almost certainly correct.

                      debdatta.basu, my experience is that the static analysis required for a compiler to auto-vectorize and schedule is non-trivial. It's a hard problem in general.

                        • Performance Comparison ATI-NVidia
                          n0thing

                          Try this kernel, it might pass correctness .

                          The number of global threads should now be 1/4th of original kernel.

                          const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR; __kernel void RoadRunner( __global float4* volume, __read_only image2d_t projection, float R_L, float O_L, float4 m2m5m8m11, float4 m0m3m6m9, float4 m1m4m7m10, float4 L_eq) // (L^2, L, 1, 0) { float4 id = (float4)(get_global_id(0), get_global_id(1), get_global_id(2), 0); float4 pos = O_L + id * R_L; float w_n = dot(pos, m2m5m8m11); float w_n_rcp = native_recip(w_n); float u_n = dot(pos, m0m3m6m9) * w_n_rcp; float v_n = dot(pos, m1m4m7m10) * w_n_rcp; int write_pos = convert_int(dot(id, L_eq)); volume[write_pos] += w_n_rcp * w_n_rcp * read_imagef(projection, sampler, (float2)(u_n+0.5, v_n+0.5)); }

                            • Performance Comparison ATI-NVidia
                              eduardoschardong

                              I put the original code on SKA and got the attached ISA, can I blame the compiler now?

                              ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(122) KCACHE0(CB0:0-15) KCACHE1(CB1:0-31) 0 t: MULLO_INT ____, R1.z, KC0[1].z 1 z: ADD_INT ____, R0.z, PS0 t: MULLO_INT ____, R1.y, KC0[1].y 2 x: ADD_INT R1.x, PV1.z, KC0[6].z w: ADD_INT ____, R0.y, PS1 t: MULLO_INT ____, R1.x, KC0[1].x 3 x: ADD_INT ____, R0.x, PS2 y: ADD_INT T3.y, PV2.w, KC0[6].y t: U_TO_F T0.x, PV2.x 4 y: ADD_INT T2.y, PV3.x, KC0[6].x t: U_TO_F ____, PV3.y 5 y: MULADD_e T1.y, KC1[6].x, PS4, KC1[7].x w: MULADD_e T0.w, KC1[6].x, T0.x, KC1[7].x t: U_TO_F T0.x, PV4.y 6 x: MUL_e T1.x, KC1[9].x, PV5.w z: MUL_e ____, KC1[10].x, PV5.w 7 x: MUL_e T2.x, KC1[8].x, T0.w y: MULADD_e T0.y, KC1[13].x, T1.y, PV6.z 8 w: MULADD_e T1.w, KC1[6].x, T0.x, KC1[7].x 9 z: MULADD_e ____, KC1[16].x, PV8.w, T0.y w: MULADD_e T0.w, KC1[12].x, T1.y, T1.x 10 y: ADD ____, KC1[19].x, PV9.z w: MULADD_e ____, KC1[11].x, T1.y, T2.x 11 x: MOV T2.x, PV10.y y: MULADD_e ____, KC1[14].x, T1.w, PV10.w z: MULADD_e ____, KC1[15].x, T1.w, T0.w w: MOV T0.w, PV10.y t: MUL_e R0.y, PV10.y, PV10.y 12 x: AND_INT ____, PV11.x, (0x807FFFFF, -1.175494211e-38f).x y: ADD T0.y, KC1[18].x, PV11.z z: AND_INT T2.z, PV11.x, (0x7F800000, 1.#INFf).y w: ADD T3.w, KC1[17].x, PV11.y t: AND_INT T0.z, PV11.x, (0x80000000, -0.0f).z 13 x: SETE_INT T0.x, PV12.z, 0.0f y: AND_INT T1.y, PV12.y, (0x807FFFFF, -1.175494211e-38f).x z: OR_INT ____, PV12.x, (0x3F800000, 1.0f).y w: AND_INT R0.w, PV12.y, (0x7F800000, 1.#INFf).z t: AND_INT R1.w, PV12.w, (0x7F800000, 1.#INFf).z 14 x: SETE_INT T1.x, T2.z, (0x7F800000, 1.#INFf).x y: AND_INT ____, T3.w, (0x807FFFFF, -1.175494211e-38f).y z: CNDE_INT T0.z, PV13.x, PV13.z, T0.z w: SETE_INT T2.w, PV13.w, 0.0f t: AND_INT T1.w, T0.y, (0x80000000, -0.0f).z 15 x: SETE_INT T3.x, R1.w, 0.0f y: OR_INT ____, T1.y, (0x3F800000, 1.0f).x z: AND_INT T1.z, T0.w, (0x7F800000, 1.#INFf).y VEC_120 w: AND_INT ____, T3.w, (0x80000000, -0.0f).z VEC_201 t: OR_INT ____, PV14.y, (0x3F800000, 1.0f).x 16 x: CNDE_INT T2.x, T1.x, T0.z, T2.x y: SETE_INT ____, R0.w, (0x7F800000, 1.#INFf).x z: CNDE_INT T0.z, PV15.x, PS15, PV15.w w: CNDE_INT ____, T2.w, PV15.y, T1.w VEC_102 t: SETE_INT T1.w, PV15.z, (0x7F800000, 1.#INFf).x 17 x: SETE_INT ____, R1.w, (0x7F800000, 1.#INFf).x y: SETE_INT ____, T1.z, 0.0f z: CNDE_INT T1.z, PV16.y, PV16.w, T0.y w: OR_INT ____, PV16.y, T1.x VEC_021 t: OR_INT ____, T2.w, T0.x 18 x: OR_INT ____, T1.w, PV17.x y: CNDE_INT ____, PV17.x, T0.z, T3.w z: OR_INT T0.z, PV17.w, PS17 w: OR_INT ____, PV17.y, T3.x t: RCP_e ____, T2.x 19 x: SUB_INT ____, R0.w, T2.z y: MUL_e T0.y, T1.z, PS18 z: OR_INT R2.z, PV18.x, PV18.w w: SUB_INT ____, R1.w, T2.z VEC_120 t: MUL_e T3.w, PS18, PV18.y 20 x: AND_INT ____, (0x7FFFFFFF, 1.#QNANf).x, PV19.y y: CNDE_INT T2.y, PV19.z, PV19.w, 0.0f z: AND_INT ____, (0x7FFFFFFF, 1.#QNANf).x, PS19 w: CNDE_INT T1.w, T0.z, PV19.x, 0.0f t: MULLO_INT ____, T2.y, KC1[3].x 21 x: ASHR ____, PV20.z, (0x00000017, 3.222986468e-44f).x y: ASHR ____, PV20.w, (0x00000017, 3.222986468e-44f).x z: ASHR ____, PV20.x, (0x00000017, 3.222986468e-44f).x w: ADD_INT ____, T3.y, PS20 t: ASHR ____, PV20.y, (0x00000017, 3.222986468e-44f).x 22 x: AND_INT T3.x, (0x80000000, -0.0f).x, T0.y y: ADD_INT T0.y, PV21.x, PS21 z: ADD_INT T2.z, T1.w, T0.y w: ADD_INT ____, PV21.z, PV21.y t: MULLO_INT ____, PV21.w, KC1[3].x 23 x: SETGE_INT T2.x, PV22.w, (0x000000FF, 3.573311084e-43f).x y: ADD_INT T2.y, R1.x, PS22 z: SETGE_INT T1.z, 0.0f, PV22.w w: AND_INT R2.w, (0x80000000, -0.0f).y, T3.w t: ADD_INT R5.z, T2.y, T3.w 24 x: AND_INT ____, R0.y, (0x7F800000, 1.#INFf).x y: AND_INT T0.y, R0.y, (0x80000000, -0.0f).y z: SETGE_INT R4.z, 0.0f, T0.y w: AND_INT ____, R0.y, (0x807FFFFF, -1.175494211e-38f).z t: SETGE_INT R3.z, T0.y, (0x000000FF, 3.573311084e-43f).w 25 x: SUB_INT R1.x, (0x3F800000, 1.0f).x, PV24.x y: SETE_INT ____, PV24.x, 0.0f z: SETE_INT R0.z, PV24.x, (0x7F800000, 1.#INFf).y w: OR_INT ____, PV24.w, (0x3F800000, 1.0f).x t: LSHL R0.w, T2.y, (0x00000002, 2.802596929e-45f).z 26 x: OR_INT R0.x, PV25.z, PV25.y y: CNDE_INT R1.y, T1.z, T2.z, T3.x z: CNDE_INT R1.z, PV25.y, PV25.w, T0.y w: CNDE_INT R1.w, T0.z, T2.x, 0.0f VEC_201 t: OR_INT R2.y, T3.x, (0x7F800000, 1.#INFf).x 01 ALU: ADDR(154) CNT(24) KCACHE0(CB1:0-15) 27 x: ADD_INT T2.x, KC0[0].x, R0.w y: CNDE_INT ____, R0.z, R1.z, R0.y z: CNDE_INT T0.z, R0.x, R1.x, 0.0f w: OR_INT T3.w, R2.w, (0x7F800000, 1.#INFf).x t: CNDE_INT ____, R1.w, R1.y, R2.y 28 x: CNDE_INT T3.x, R2.z, R3.z, 0.0f y: ASHR T0.y, PV27.z, (0x00000017, 3.222986468e-44f).x z: ADD R2.z, PS27, 0.5 t: RCP_e ____, PV27.y 29 x: AND_INT ____, (0x7FFFFFFF, 1.#QNANf).x, PS28 y: AND_INT R0.y, (0x80000000, -0.0f).y, PS28 z: ADD_INT R4.z, T0.z, PS28 w: CNDE_INT ____, R4.z, R5.z, R2.w VEC_120 t: LSHR R3.x, T2.x, (0x00000002, 2.802596929e-45f).z 30 x: CNDE_INT ____, T3.x, PV29.w, T3.w y: OR_INT R1.y, PV29.y, (0x7F800000, 1.#INFf).x z: ASHR ____, PV29.x, (0x00000017, 3.222986468e-44f).y 31 x: ADD R2.x, PV30.x, 0.5 w: ADD_INT R2.w, PV30.z, T0.y 02 TEX: ADDR(192) CNT(2) 32 SAMPLE R2.x___, R2.xz0x, t0, s0 UNNORM(XYZW) 33 VFETCH R1.x___, R3.x, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 03 ALU: ADDR(178) CNT(7) 34 x: SETGE_INT ____, R2.w, (0x000000FF, 3.573311084e-43f).x z: SETGE_INT ____, 0.0f, R2.w 35 z: CNDE_INT ____, PV34.z, R4.z, R0.y w: CNDE_INT ____, R0.x, PV34.x, 0.0f 36 y: CNDE_INT ____, PV35.w, PV35.z, R1.y 37 x: MULADD_e R1.x, R2.x, PV36.y, R1.x 04 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R3].x___, R1, ARRAY_SIZE(4) MARK VPM END_OF_PROGRAM

                    • Performance Comparison ATI-NVidia
                      MicahVillmow
                      eduardoschardong,
                      What do you see as inefficient in the ISA generation? The compiler looks like it is packing pretty well into the ALU slots. It looks like you have ~150 ALU instructions being packed into 35 ALU bundles for a packing ratio of ~4.28.

                        • Performance Comparison ATI-NVidia
                          nou

                          try use profiler to see what is a bottlneck.

                          • Performance Comparison ATI-NVidia
                            eduardoschardong

                            Micah, what I'm seeing is 17 multiplications, 17 additions and 3 divisions becoming ~150 instructions, it is doing much more work than what is seen in that small piece of code and likely much more work than on nVidia GPUs

                             

                            Well... With such below expected performance something should be wrong, right?

                              • Performance Comparison ATI-NVidia
                                eci

                                First of all thanks for your replies so far!

                                If forgot to mention in my initial post that I already experimented with vectorization. The vectorized version is performing roughly the same, maybe a little bit worse. But I'm with debdatt.basu on this topic. The compiler should really be able to do this!

                                Another thing I discovered lately while testing some things out is the variations in runtimes on the ATI card. I get runtimes between 22.5s and 31s (with a mean somewhere just below 30s). When I run the problem on the NVidia card runtimes vary much less (somewhere around 17s).

                                What I want to add here is that I am totally aware that the code I'm using isn't optimized to the maximum. But that isn't really what I'm looking for. I want to compare OpenCL performance on different platforms.

                                @n0thing: I'm not sure if I really get the code you posted. The image2d_t projection is a CL_R, CL_FLOAT image. In the last line you access the projection image on position u_n and v_n and use the resulting vector to update four positions in the volume. But with an CL_R image y, z and w of this vector are not really defined. So I'm pretty sure this isn't working. I'm aware of the fact that I could load four of my projection images into an CL_RGBA image to speed up the execution of my problem. But I am working with another framework so it is easier for me this way and the NVidia implementation works the same way.

                                  • Performance Comparison ATI-NVidia
                                    eduardoschardong

                                    Hello eci,

                                     

                                    The code is well packed as Micah pointed, vectorization won't help as it is, reducing the number of instructions first could make vectorization more profitable, one of the operations that are generating a lot of instructions is the division, I'm not sure how different the native_recip is but it should give you correct results, finally doing x + y * z + w * v is better than y * z + w * v + x because the compiler won't reorder operations to emit MULLADDs, in your case will only help in the vectored version and by a small bit, you can try the attached code first, try to vector it too, may help, also profiling would help a lot.

                                     

                                    const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR; __kernel void RoadRunner( __global float* volume, __read_only image2d_t projection, int L, int S_x, int S_y, float R_L, float O_L, float m0, float m1, float m2, float m3, float m4, float m5, float m6, float m7, float m8, float m9, float m10, float m11) { size_t id1 = get_global_id(0); size_t id2 = get_global_id(1); size_t id3 = get_global_id(2); float z = O_L + (float)id1 * R_L; float y = O_L + (float)id2 * R_L; float x = O_L + (float)id3 * R_L; float w_n = m11 + m2 * x + m5 * y + m8 * z; float w_n_r = native_recip(w_n); float u_n = ( m9 + m0 * x + m3 * y + m6 * z) * w_n_r; float v_n = (m10 + m1 * x + m4 * y + m7 * z) * w_n_r; volume[id1 * L * L + id2 * L + id3] += w_n_r * w_n_r * native_recip(read_imagef(projection, sampler, (float2)(u_n+0.5, v_n+0.5)).x); }

                                      • Performance Comparison ATI-NVidia
                                        debdatta.basu

                                        @cjang... Its definitely not trivial.... But it  is well studied, and almost every modern compiler (gcc, intel.. you name it) could vectorize that piece of code...

                                        But as it turns out, vectorizing didnt solve it. Im waiting to see how eduard's kernel performs...

                                        • Performance Comparison ATI-NVidia
                                          MicahVillmow
                                          eduardo,
                                          Yeah, I missed the division. That is where the large increase in instructions is coming from.
                                          Here is the CL/IL/ISA that gets generated for a simple Division shader. As you can see, the division is a majority of the instructions.

                                          CL: kernel void div_test(global float* a) { a[0] = a[1] / a[2]; } IL: mdef(209)_out(1)_in(2) mov r0, in0 mov r1, in1 dcl_literal l1, 0x7f800000, 0x7f800000, 0x807fffff, 0x807fffff dcl_literal l2, 0x7f800000, 0x7f800000, 0, 0 dcl_literal l3, 0x80000000, 0x80000000, 0x80000000, 0x80000000 dcl_literal l4, 0x3f800000, 0x3f800000, 0, 0 dcl_literal l5, 0, 0, 0, 0 dcl_literal l6, 0x7fffffff, 0x80000000, 0x7fffffff, 0x80000000 dcl_literal l7, 0x00800000, 0x00800000, 0x00800000, 0x00800000 dcl_literal l8, 0x00000017, 0x00000017, 0x00000017, 0x00000017 dcl_literal l9, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff mov r2.x___, r0.x mov r2._y__, r1.x and r3.xyzw, r2.xyxy, l1 ieq r4.xyzw, r3.xyxy, l2 and r5.xy__, r2.xy, l3 ior r3.__zw, r3.zwzw, l4.xyxy cmov_logical r3.__zw, r4.zwzw, r5.xyxy, r3.zwzw cmov_logical r3.__zw, r4.xyxy, r2.xyxy, r3.zwzw ior r5.xy__, r4.xz, r4.yw ior r5.x___, r5.x, r5.y inegate r5.__z_, r3.yyyy iadd r3.x___, r3.x, r5.z cmov_logical r3.x___, r5.xxxx, l5, r3.xxxx rcp_zeroop(infinity) r2._y__, r3.ww mul_ieee r2.x___, r3.z, r2.y and r2.__zw, r2.xxxx, l6.xyzw ishr r6.x___, r2.z, l8 ishr r6._y__, r3.xxxx, l8 iadd r2.xy__, r2.xzxz, r3.xxxx iadd r6.x___, r6.x, r6.y ige r4.__z_, l5, r6.x ior r4._y__, r2.wwww, l1 ige r4.x, r6.x, l9 cmov_logical r4.x, r5.x, l5, r4.x cmov_logical r2.x, r4.z, r2.w, r2.x cmov_logical r0.x, r4.x, r4.y, r2.x mov out0, r0 mend il_cs_2_0 dcl_cb cb0[9] ; Constant buffer that holds ABI data dcl_literal l0, 4, 1, 2, 3 dcl_literal l1, 0x00FFFFFF, -1, -2, -3 dcl_literal l2, 0x0000FFFF, 0xFFFFFFFE,0x000000FF,0xFFFFFFFC dcl_literal l3, 24, 16, 8, 0xFFFFFFFF dcl_literal l4, 0xFFFFFF00, 0xFFFF0000, 0xFF00FFFF, 0xFFFF00FF dcl_literal l5, 0, 4, 8, 12 dcl_literal l6, 32, 32, 32, 32 dcl_literal l7, 24, 31, 16, 31 mov r1013, cb0[8].x call 1024;$ endmain func 1024 ; __OpenCL_div_test_kernel mov r1019, l1.0 dcl_max_thread_per_group 256 dcl_raw_uav_id(0) dcl_arena_uav_id(8) mov r0.z, vThreadGrpIdFlat.x mov r1022.xyz0, vTidInGrp.xyz mov r1023.xyz0, vThreadGrpId.xyz imad r1021.xyz0, r1023.xyz0, cb0[1].xyz0, r1022.xyz0 iadd r1021.xyz0, r1021.xyz0, cb0[6].xyz0 iadd r1023.xyz0, r1023.xyz0, cb0[7].xyz0 mov r1023.w, r0.z ishl r1023.w, r1023.w, l0.z mov r1018.x, l0.0 mov r1015.x, cb0[8].x dcl_literal l9, 0x00000004, 0x00000004, 0x00000004, 0x00000004; f32:i32 4 dcl_literal l8, 0x00000008, 0x00000008, 0x00000008, 0x00000008; f32:i32 8 dcl_cb cb1[1] ; Kernel arg setup: a mov r1, cb1[0] call 2 ; div_test ret endfunc ; __OpenCL_div_test_kernel ;ARGSTART:__OpenCL_div_test_kernel ;version:1:4:53 ;device:cayman ;uniqueid:1024 ;memory:hwprivate:0 ;memory:hwlocal:0 ;version:1:4:53 ;pointer:a:float:1:1:0:uav:0:4 ;function:1:2 ;uavid:8 ;ARGEND:__OpenCL_div_test_kernel func 2 ; div_test ; @__OpenCL_div_test_kernel ; BB#0: ; %entry mov r258, l8 iadd r258.x___, r1.xxxx, r258.xxxx mov r1010.x___, r258.xxxx uav_raw_load_id(0) r1011.x___, r1010.x mov r258, r1011 mov r259, l9 iadd r259.x___, r1.xxxx, r259.xxxx mov r1010.x___, r259.xxxx uav_raw_load_id(0) r1011.x___, r1010.x mov r259, r1011 ;__fdiv_f32 mcall(209) (r258),(r259, r258) mov r1011, r258 mov r1010.x___, r1.x uav_raw_store_id(0) mem0.x___, r1010.x, r1011.x ret_dyn endfunc ; div_test ;ARGSTART:div_test ;uniqueid:2 ;memory:hwlocal:0 ;intrinsic:1:209 ;ARGEND:div_test end ISA: ShaderType = IL_SHADER_COMPUTE TargetChip = c ; ------------- SC_SRCSHADER Dump ------------------ SC_SHADERSTATE: u32NumIntVSConst = 0 SC_SHADERSTATE: u32NumIntPSConst = 0 SC_SHADERSTATE: u32NumIntGSConst = 0 SC_SHADERSTATE: u32NumBoolVSConst = 0 SC_SHADERSTATE: u32NumBoolPSConst = 0 SC_SHADERSTATE: u32NumBoolGSConst = 0 SC_SHADERSTATE: u32NumFloatVSConst = 0 SC_SHADERSTATE: u32NumFloatPSConst = 0 SC_SHADERSTATE: u32NumFloatGSConst = 0 fConstantsAvailable = 0 iConstantsAvailable = 0 bConstantsAvailable = 0 u32SCOptions[0] = 0x01A00000 SCOption_IGNORE_SAMPLE_L_BUG SCOption_FLOAT_DO_NOT_DIST SCOption_FLOAT_DO_NOT_REASSOC u32SCOptions[1] = 0x00202000 SCOption_R600_ERROR_ON_DOUBLE_MEMEXP SCOption_SET_VPM_FOR_SCATTER u32SCOptions[2] = 0x00020040 SCOption_R800_UAV_NONUAV_SYNC_WORKAROUND_BUG216513_1 SCOption_R900_BRANCH_IN_NESTED_LOOPS_WORKAROUND_BUG281276 ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(8) KCACHE0(CB1:0-15) 0 x: LSHR R0.x, KC0[0].x, (0x00000002, 2.802596929e-45f).x y: ADD_INT R0.y, KC0[0].x, (0x00000004, 5.605193857e-45f).y z: ADD_INT R0.z, KC0[0].x, (0x00000008, 1.121038771e-44f).z 1 x: LSHR R1.x, PV0.y, (0x00000002, 2.802596929e-45f).x w: LSHR R0.w, PV0.z, (0x00000002, 2.802596929e-45f).x 01 TEX: ADDR(96) CNT(2) 2 VFETCH R0.__x_, R0.w, fc154 FETCH_TYPE(NO_INDEX_OFFSET) 3 VFETCH R1.x___, R1.x, fc154 FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(40) CNT(48) 4 x: AND_INT R2.x, R1.x, (0x807FFFFF, -1.175494211e-38f).x y: AND_INT R0.y, R1.x, (0x80000000, -0.0f).y z: AND_INT R0.z, R1.x, (0x7F800000, 1.#INFf).z w: MOV R0.w, R0.z 5 x: AND_INT R3.x, PV4.w, (0x80000000, -0.0f).x y: AND_INT R1.y, PV4.w, (0x7F800000, 1.#INFf).y z: SETE_INT R1.z, PV4.z, 0.0f w: AND_INT R1.w, PV4.w, (0x807FFFFF, -1.175494211e-38f).z 6 x: OR_INT R2.x, PV5.w, (0x3F800000, 1.0f).x y: SETE_INT R2.y, PV5.y, (0x7F800000, 1.#INFf).y z: OR_INT R2.z, R2.x, (0x3F800000, 1.0f).x w: SETE_INT R1.w, PV5.y, 0.0f 7 x: SETE_INT R2.x, R0.z, (0x7F800000, 1.#INFf).x y: CNDE_INT R0.y, PV6.w, PV6.x, R3.x z: CNDE_INT R1.z, R1.z, PV6.z, R0.y VEC_120 w: OR_INT R1.w, R1.z, PV6.w VEC_120 8 x: CNDE_INT R1.x, PV7.x, PV7.z, R1.x y: SUB_INT R1.y, R0.z, R1.y z: OR_INT R1.z, PV7.x, R2.y VEC_021 w: CNDE_INT R0.w, R2.y, PV7.y, R0.w 9 x: RCP_e ____, PV8.w y: RCP_e R0.y, PV8.w z: RCP_e ____, PV8.w w: OR_INT R0.w, PV8.z, R1.w 10 x: MUL_e R1.x, R1.x, PV9.y z: CNDE_INT R0.z, PV9.w, R1.y, 0.0f 11 x: ASHR R1.x, PV10.z, (0x00000017, 3.222986468e-44f).x y: ADD_INT R0.y, PV10.z, PV10.x z: AND_INT R0.z, (0x80000000, -0.0f).y, PV10.x w: AND_INT R1.w, (0x7FFFFFFF, 1.#QNANf).z, PV10.x 12 x: OR_INT R2.x, PV11.z, (0x7F800000, 1.#INFf).x y: ASHR R1.y, PV11.w, (0x00000017, 3.222986468e-44f).y 13 z: ADD_INT R2.z, PV12.y, R1.x 14 y: SETGE_INT R2.y, 0.0f, PV13.z w: SETGE_INT R2.w, PV13.z, (0x000000FF, 3.573311084e-43f).x 15 y: CNDE_INT R0.y, PV14.y, R0.y, R0.z z: CNDE_INT R0.z, R0.w, PV14.w, 0.0f 16 x: CNDE_INT R1.x, PV15.z, PV15.y, R2.x 03 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(0)[R0].x___, R1, ARRAY_SIZE(4) MARK VPM 04 END END_OF_PROGRAM ; ----------------- CS Data ------------------------ ; Input Semantic Mappings ; No input mappings GprPoolSize = 0 CodeLen = 800;Bytes PGM_END_CF = 0; words(64 bit) PGM_END_ALU = 0; words(64 bit) PGM_END_FETCH = 0; words(64 bit) MaxScratchRegsNeeded = 0 ;AluPacking = 0.0 ;AluClauses = 0 ;PowerThrottleRate = 0.0 ; texResourceUsage[0] = 0x00000000 ; texResourceUsage[1] = 0x00000000 ; texResourceUsage[2] = 0x00000000 ; texResourceUsage[3] = 0x00000000 ; fetch4ResourceUsage[0] = 0x00000000 ; fetch4ResourceUsage[1] = 0x00000000 ; fetch4ResourceUsage[2] = 0x00000000 ; fetch4ResourceUsage[3] = 0x00000000 ; texSamplerUsage = 0x00000000 ; constBufUsage = 0x00000000 ResourcesAffectAlphaOutput[0] = 0x00000000 ResourcesAffectAlphaOutput[1] = 0x00000000 ResourcesAffectAlphaOutput[2] = 0x00000000 ResourcesAffectAlphaOutput[3] = 0x00000000 ;SQ_PGM_RESOURCES = 0x30000104 SQ_PGM_RESOURCES:NUM_GPRS = 4 SQ_PGM_RESOURCES:STACK_SIZE = 1 SQ_PGM_RESOURCES:PRIME_CACHE_ENABLE = 1 ;SQ_PGM_RESOURCES_2 = 0x000000C0 SQ_LDS_ALLOC:SIZE = 0x00000000 ; RatOpIsUsed = 0x1 ; NumThreadPerGroupFlattened = 256 ; SetBufferForNumGroup = true

                                            • Performance Comparison ATI-NVidia
                                              n0thing

                                              @eci Sorry, didn't realize that the texture format in your case was scalar , but changing the image format to RGBA shouldn't be that hard. If you compare the ISA generated with my kernel vs eduardo's kernel, you can see the manual vectorizing is resulting in less number of instructions and ALU packing is better. 

                                               

                                              ; -------- my kernel -------------------- 00 ALU: ADDR(32) CNT(44) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15) 0 w: MOV T0.w, 0.0f t: MULLO_INT ____, R1.x, KC0[1].x 1 y: ADD_INT ____, R0.x, PS0 t: MULLO_INT ____, R1.y, KC0[1].y 2 z: ADD_INT ____, R0.y, PS1 w: ADD_INT ____, PV1.y, KC0[6].x t: MULLO_INT ____, R1.z, KC0[1].z 3 x: ADD_INT ____, R0.z, PS2 y: ADD_INT ____, PV2.z, KC0[6].y t: U_TO_F T0.x, PV2.w 4 w: ADD_INT ____, PV3.x, KC0[6].z t: U_TO_F T0.y, PV3.y 5 x: MULADD_e T1.x, KC1[3].x, T0.x, KC1[4].x y: MULADD_e T1.y, KC1[3].x, PS4, KC1[4].x w: MULADD_e T0.w, KC1[3].x, T0.w, KC1[4].x t: U_TO_F T0.z, PV4.w 6 x: DOT4 ____, KC1[8].x, T0.x y: DOT4 ____, KC1[8].y, T0.y z: DOT4 ____, KC1[8].z, PS5 w: DOT4 ____, (0x80000000, -0.0f).x, 0.0f 7 z: MULADD_e T0.z, KC1[3].x, T0.z, KC1[4].x t: F_TO_I ____, PV6.x 8 x: DOT4 ____, KC1[5].x, T1.x y: DOT4 ____, KC1[5].y, T1.y z: DOT4 ____, KC1[5].z, PV7.z w: DOT4 ____, KC1[5].w, T0.w t: LSHL T1.z, PS7, (0x00000004, 5.605193857e-45f).x 9 x: DOT4 ____, KC1[6].x, T1.x y: DOT4 ____, KC1[6].y, T1.y z: DOT4 ____, KC1[6].z, T0.z w: DOT4 ____, KC1[6].w, T0.w t: RCP_e T0.x, PV8.x 10 x: DOT4 ____, KC1[7].x, T1.x y: DOT4 ____, KC1[7].y, T1.y z: DOT4 ____, KC1[7].z, T0.z w: DOT4 ____, KC1[7].w, T0.w t: MULADD_e R0.x, PS9, PV9.x, 0.5 11 x: MUL_e R1.x, T0.x, T0.x y: ADD_INT ____, KC1[0].x, T1.z w: MULADD_e R0.w, T0.x, PV10.x, 0.5 12 x: LSHR R3.x, PV11.y, (0x00000002, 2.802596929e-45f).x 01 TEX: ADDR(80) CNT(2) 13 VFETCH R2, R3.x, fc156 FORMAT(32_32_32_32_FLOAT) MEGA(16) FETCH_TYPE(NO_INDEX_OFFSET) 14 SAMPLE R0, R0.xw0x, t0, s0 UNNORM(XYZW) 02 ALU: ADDR(76) CNT(4) 15 x: MULADD_e R0.x, R0.x, R1.x, R2.x y: MULADD_e R0.y, R0.y, R1.x, R2.y z: MULADD_e R0.z, R0.z, R1.x, R2.z w: MULADD_e R0.w, R0.w, R1.x, R2.w 03 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R3], R0, ARRAY_SIZE(4) MARK VPM END_OF_PROGRAM ; -------- eduardo's kernel -------------------- 00 ALU: ADDR(32) CNT(37) KCACHE0(CB0:0-15) KCACHE1(CB1:0-31) 0 t: MULLO_INT ____, R1.z, KC0[1].z 1 z: ADD_INT ____, R0.z, PS0 t: MULLO_INT ____, R1.y, KC0[1].y 2 x: ADD_INT ____, R0.y, PS1 y: ADD_INT T1.y, PV1.z, KC0[6].z t: MULLO_INT ____, R1.x, KC0[1].x 3 x: ADD_INT T1.x, PV2.x, KC0[6].y w: ADD_INT ____, R0.x, PS2 t: U_TO_F T0.w, PV2.y 4 x: ADD_INT T0.x, PV3.w, KC0[6].x t: U_TO_F T0.z, PV3.x 5 t: MULLO_INT ____, PV4.x, KC1[3].x 6 x: MULADD_e T1.x, KC1[6].x, T0.z, KC1[7].x y: MULADD_e T0.y, KC1[6].x, T0.w, KC1[7].x w: ADD_INT T1.w, T1.x, PS5 t: U_TO_F T0.w, T0.x 7 x: MULADD_e T0.x, KC1[10].x, PV6.y, KC1[19].x 8 z: MULADD_e T0.z, KC1[9].x, T0.y, KC1[18].x 9 y: MULADD_e T2.y, KC1[6].x, T0.w, KC1[7].x 10 w: MULADD_e T1.w, KC1[13].x, T1.x, T0.x t: MULLO_INT ____, T1.w, KC1[3].x 11 x: MULADD_e T0.x, KC1[8].x, T0.y, KC1[17].x z: ADD_INT ____, T1.y, PS10 12 x: MULADD_e ____, KC1[12].x, T1.x, T0.z y: MULADD_e ____, KC1[16].x, T2.y, T1.w z: LSHL T0.z, PV11.z, (0x00000002, 2.802596929e-45f).x 13 z: MULADD_e ____, KC1[15].x, T2.y, PV12.x w: MULADD_e ____, KC1[11].x, T1.x, T0.x t: RCP_e T1.z, PV12.y 14 x: MUL_e R1.x, PS13, PS13 y: MULADD_e R0.y, PV13.z, PS13, 0.5 z: MULADD_e ____, KC1[14].x, T2.y, PV13.w w: ADD_INT ____, KC1[0].x, T0.z 15 x: MULADD_e R0.x, T1.z, PV14.z, 0.5 t: LSHR R2.x, PV14.w, (0x00000002, 2.802596929e-45f).x 01 TEX: ADDR(80) CNT(2) 16 SAMPLE R0.___x, R0.xy0x, t0, s0 UNNORM(XYZW) 17 VFETCH R0.x___, R2.x, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(69) CNT(2) 18 t: RCP_e ____, R0.w 19 x: MULADD_e R0.x, PS18, R1.x, R0.x 03 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R2].x___, R0, ARRAY_SIZE(4) MARK VPM END_OF_PROGRAM

                                                • Performance Comparison ATI-NVidia
                                                  eci

                                                  You see my current kernel attached. The reason for the split up of statements is the fact that I can reuse uTmp, vTmp and wTmp and thus reduce instructions. The reason for the two volumes is the limitation of the current Stream SDK which only allows to allocate 256MB of continuous memory in my case. I wasn't aware that I would find such well-founded help here and wanted to keep things simple.

                                                  I tried the reordering of instructions eduardo suggested, but sadly this also didn't do the trick.

                                                  @nothing: I cannot use the vectorization the way you suggested. The problem is, that in your kernel you use the volume position of the first voxel and compute the projection into the projection image (which is a float greyscale image by the way). What really has to be done is compute for all four voxel positions the appropriate position in the projection image.

                                                  I'm still researching if I missed something big in the way I access memory! Is the access layout I use in any way flawed?

                                                  I also digged a little bit into the ISA code and am not quite sure if I get it by now. If I start from the kernel from my first post, the kernel has 37 operations. The compiler makes roundabout 150 operations out of them. Then he packs them into these units of 5 instructions which can be run in parallel. That gets me back to around 37 operations. So if I get it right I access at best 1/5 of the cards peak performance!

                                                  constant sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR; __kernel void RoadRunner( __global float* volume1, __global float* volume2, __read_only image2d_t projection, int L, int S_x, int S_y, float R_L, float O_L, float m0, float m1, float m2, float m3, float m4, float m5, float m6, float m7, float m8, float m9, float m10, float m11) { size_t id1 = get_global_id(0); size_t id2 = get_global_id(1); size_t id3 = get_global_id(2); float z = O_L + (float)id3 * R_L; float y = O_L + (float)(id2*2) * R_L; float x = O_L + (float)id1 * R_L; int tmp = id3 * L * L/2 + id2 * L + id1; float wTmp = m2 * x + m8 * z + m11; float uTmp = (m0 * x + m6 * z + m9 ); float vTmp = (m1 * x + m7 * z + m10); float w_n = wTmp + m5 * y; float w = native_recip(w_n); float u_n = (uTmp + (m3 * y)) * w; float v_n = (vTmp + (m4 * y)) * w; volume1[tmp] += read_imagef(projection, sampler, (float2)(u_n+0.5f, v_n+0.5f)).x * w * w; y += R_L; w_n = wTmp + m5 * y; w = native_recip(w_n); u_n = (uTmp + (m3 * y)) * w; v_n = (vTmp + (m4 * y)) * w; volume2[tmp] += read_imagef(projection, sampler, (float2)(u_n+0.5f, v_n+0.5f)).x * w * w; return; }

                                                    • Performance Comparison ATI-NVidia
                                                      MicahVillmow
                                                      eci,
                                                      1 CL instruction does not equal 1 ISA instruction. Your new kernel with our upcoming SDK release has 78 ALU instructions packed into 25 ALU bundles for a little greater than 3 packing ratio. So there is room for improvement in packing, but you definitely are using more than 1/5th the chip, that would imply your packing ratio is 1.
                                                      The ISA is as follows.

                                                      ShaderType = IL_SHADER_COMPUTE TargetChip = c ; ------------- SC_SRCSHADER Dump ------------------ SC_SHADERSTATE: u32NumIntVSConst = 0 SC_SHADERSTATE: u32NumIntPSConst = 0 SC_SHADERSTATE: u32NumIntGSConst = 0 SC_SHADERSTATE: u32NumBoolVSConst = 0 SC_SHADERSTATE: u32NumBoolPSConst = 0 SC_SHADERSTATE: u32NumBoolGSConst = 0 SC_SHADERSTATE: u32NumFloatVSConst = 0 SC_SHADERSTATE: u32NumFloatPSConst = 0 SC_SHADERSTATE: u32NumFloatGSConst = 0 fConstantsAvailable = 0 iConstantsAvailable = 0 bConstantsAvailable = 0 u32SCOptions[0] = 0x01A00000 SCOption_IGNORE_SAMPLE_L_BUG SCOption_FLOAT_DO_NOT_DIST SCOption_FLOAT_DO_NOT_REASSOC u32SCOptions[1] = 0x00202000 SCOption_R600_ERROR_ON_DOUBLE_MEMEXP SCOption_SET_VPM_FOR_SCATTER u32SCOptions[2] = 0x00020040 SCOption_R800_UAV_NONUAV_SYNC_WORKAROUND_BUG216513_1 SCOption_R900_BRANCH_IN_NESTED_LOOPS_WORKAROUND_BUG281276 ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(66) KCACHE0(CB0:0-15) KCACHE1(CB1:0-31) 0 x: MULLO_INT ____, R1.z, KC0[1].z y: MULLO_INT ____, R1.z, KC0[1].z z: MULLO_INT ____, R1.z, KC0[1].z w: MULLO_INT R0.w, R1.z, KC0[1].z 1 x: ADD_INT R2.x, R0.z, PV0.w 2 x: MULLO_INT ____, R1.x, KC0[1].x y: MULLO_INT R2.y, R1.x, KC0[1].x z: MULLO_INT ____, R1.x, KC0[1].x w: MULLO_INT ____, R1.x, KC0[1].x 3 y: ADD_INT R2.y, R2.x, KC0[6].z z: ADD_INT R0.z, R0.x, PV2.y VEC_120 4 x: MULLO_INT R0.x, R1.y, KC0[1].y y: MULLO_INT ____, R1.y, KC0[1].y z: MULLO_INT ____, R1.y, KC0[1].y w: MULLO_INT ____, R1.y, KC0[1].y 5 y: ADD_INT R1.y, R0.y, PV4.x z: U_TO_F R1.z, R2.y VEC_120 w: ADD_INT R1.w, R0.z, KC0[6].x 6 x: U_TO_F R1.x, PV5.w z: ADD_INT R2.z, PV5.y, KC0[6].y 7 x: LSHL R2.x, PV6.z, 1 y: MULADD_e R1.y, KC1[7].x, R1.z, KC1[8].x w: MULADD_e R2.w, KC1[7].x, PV6.x, KC1[8].x 8 x: MUL_e R2.x, KC1[16].x, PV7.y y: MUL_e R0.y, KC1[17].x, PV7.y z: U_TO_F R1.z, PV7.x 9 x: MULLO_INT R1.x, KC1[4].x, KC1[4].x y: MULLO_INT ____, KC1[4].x, KC1[4].x z: MULLO_INT ____, KC1[4].x, KC1[4].x w: MULLO_INT ____, KC1[4].x, KC1[4].x 10 x: MULLO_INT ____, PV9.x, R2.y y: MULLO_INT ____, PV9.x, R2.y z: MULLO_INT ____, PV9.x, R2.y w: MULLO_INT R0.w, PV9.x, R2.y 11 x: MULADD_e R1.x, KC1[11].x, R2.w, R0.y y: MUL_e R1.y, KC1[15].x, R1.y z: LSHR R0.z, PV10.w, 1 12 x: MULLO_INT ____, R2.z, KC1[4].x y: MULLO_INT ____, R2.z, KC1[4].x z: MULLO_INT R2.z, R2.z, KC1[4].x w: MULLO_INT ____, R2.z, KC1[4].x 13 x: ADD R1.x, KC1[20].x, R1.x y: ADD_INT R0.y, R1.w, PV12.z w: MULADD_e R1.w, KC1[10].x, R2.w, R2.x 14 y: ADD_INT R2.y, PV13.y, R0.z w: MULADD_e R0.w, KC1[7].x, R1.z, KC1[8].x VEC_021 15 x: LSHL R0.x, PV14.y, (0x00000002, 2.802596929e-45f).x z: ADD R0.z, KC1[19].x, R1.w w: MULADD_e R1.w, KC1[9].x, R2.w, R1.y VEC_021 16 x: ADD R2.x, KC1[18].x, PV15.w z: MULADD_e R1.z, KC1[14].x, R0.w, R1.x 17 z: MULADD_e R2.z, KC1[13].x, R0.w, R0.z w: MULADD_e R1.w, KC1[12].x, R0.w, PV16.x 18 x: RCP_e ____, R1.z y: RCP_e R0.y, R1.z z: RCP_e ____, R1.z w: ADD_INT R2.w, KC1[0].x, R0.x 19 x: MULADD_e R3.x, PV18.y, R1.w, 0.5 y: MULADD_e R3.y, R2.z, PV18.y, 0.5 z: ADD R2.z, KC1[7].x, R0.w VEC_021 w: ADD_INT R0.w, KC1[1].x, R0.x 20 x: LSHR R0.x, R2.w, (0x00000002, 2.802596929e-45f).x z: MULADD_e R0.z, KC1[13].x, PV19.z, R0.z w: MULADD_e R2.w, KC1[14].x, PV19.z, R1.x 01 TEX: ADDR(112) CNT(2) 21 SAMPLE R2._x__, R3.xy0x, t0, s0 UNNORM(XYZW) 22 VFETCH R1.___x, R0.x, fc155 FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(98) CNT(10) KCACHE0(CB1:0-15) 23 x: LSHR R1.x, R0.w, (0x00000002, 2.802596929e-45f).x z: MULADD_e R1.z, KC0[12].x, R2.z, R2.x w: MUL_e R0.w, R0.y, R2.y 24 x: RCP_e R2.x, R2.w y: RCP_e ____, R2.w z: RCP_e ____, R2.w 25 x: MULADD_e R3.x, R0.y, R0.w, R1.w y: MULADD_e R0.y, R0.z, PV24.x, 0.5 z: MULADD_e R0.z, PV24.x, R1.z, 0.5 03 MEM_RAT_CACHELESS_STORE_DWORD__NI_ACK: RAT(1)[R0].x___, R3, ARRAY_SIZE(4) MARK VPM 04 TEX: ADDR(116) CNT(1) 26 SAMPLE R1.__x_, R0.zy0z, t0, s0 UNNORM(XYZW) 05 WAIT_ACK: Outstanding_acks <= 0 06 TEX: ADDR(118) CNT(1) 27 VFETCH R1._x__, R1.x, fc155 FETCH_TYPE(NO_INDEX_OFFSET) 07 ALU: ADDR(108) CNT(2) 28 y: MUL_e R2.y, R2.x, R1.z 29 x: MULADD_e R0.x, R2.x, PV28.y, R1.y 08 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(1)[R1].x___, R0, ARRAY_SIZE(4) MARK VPM 09 END END_OF_PROGRAM ; ----------------- CS Data ------------------------ ; Input Semantic Mappings ; No input mappings GprPoolSize = 0 CodeLen = 960;Bytes PGM_END_CF = 0; words(64 bit) PGM_END_ALU = 0; words(64 bit) PGM_END_FETCH = 0; words(64 bit) MaxScratchRegsNeeded = 0 ;AluPacking = 0.0 ;AluClauses = 0 ;PowerThrottleRate = 0.0 ; texResourceUsage[0] = 0x00000000 ; texResourceUsage[1] = 0x00000000 ; texResourceUsage[2] = 0x00000000 ; texResourceUsage[3] = 0x00000000 ; fetch4ResourceUsage[0] = 0x00000000 ; fetch4ResourceUsage[1] = 0x00000000 ; fetch4ResourceUsage[2] = 0x00000000 ; fetch4ResourceUsage[3] = 0x00000000 ; texSamplerUsage = 0x00000000 ; constBufUsage = 0x00000000 ResourcesAffectAlphaOutput[0] = 0x00000000 ResourcesAffectAlphaOutput[1] = 0x00000000 ResourcesAffectAlphaOutput[2] = 0x00000000 ResourcesAffectAlphaOutput[3] = 0x00000000 ;SQ_PGM_RESOURCES = 0x30000104 SQ_PGM_RESOURCES:NUM_GPRS = 4 SQ_PGM_RESOURCES:STACK_SIZE = 1 SQ_PGM_RESOURCES:PRIME_CACHE_ENABLE = 1 ;SQ_PGM_RESOURCES_2 = 0x000000C0 SQ_LDS_ALLOC:SIZE = 0x00000000 ; RatOpIsUsed = 0x2 ; NumThreadPerGroupFlattened = 256 ; SetBufferForNumGroup = true

                                                        • Performance Comparison ATI-NVidia
                                                          eci

                                                          OK, then I got it basically right. I only looked at the code that was posted further up so that's why I got the packing ratio a little bit wrong

                                                          That leaves the question if I do a major mistake in the way I access memory?

                                                            • Performance Comparison ATI-NVidia
                                                              eduardoschardong

                                                               

                                                              Originally posted by: eci That leaves the question if I do a major mistake in the way I access memory?


                                                              Maybe, profiling output would help.

                                                               

                                                              Without this info the best I can do is guess, and a bad guess since I never used texture sampling, I don't care about imaging at all, well... Get two bad guesses by the price of one:

                                                               

                                                              1) IIRC textures on GPUs aren't organized in row-major order or any other "linear" way, it's z-order, if your kernel ids are linear the sampler will read from very sparse memory locations limiting performance.

                                                              2) I'm not sure how 3D kernels wavefronts are formed, if they aren't all from the same row there may be stores to sparse address or a pattern causing the GPU to serialize accesses, again limiting a lot performance.

                                                               

                                                               

                                                              TIP: In your last kernel avoid breaking clauses, the compiler apparently failed to realize the two loads/stores are independent, make all loads/sampling in a sequence, calculate everything from both stores and then store, this will allow to reduce the number of clauses and increasing packing.

                                                               

                                            • Performance Comparison ATI-NVidia
                                              iya

                                               

                                              Originally posted by: eci Hi

                                               

                                               

                                              I tried different versions of the kernel and I'm pretty sure, that memory access is the limiting factor. But why? Do I miss something?

                                               

                                              ...

                                              For NVidia the inner part is surrounded by a loop and for my current ATI version every kernel processes two voxels on two different memory blocks.

                                               

                                              The points from above paper are even more important for ATI.

                                              For example:

                                              • Do more work per thread.
                                              • Use more registers per thread.

                                              Having no loop in the kernel is not good. You're doing all the constant calculations each time. Try to find a loop, and if necessary unroll it manually until you get a high enough register usage.

                                                • Performance Comparison ATI-NVidia
                                                  eci

                                                  I would love to profile the code, but the profiler is not working for me. It only gives me the following message:

                                                  ATI Stream Profiler

                                                  The program failed to profile. Try to compile and run the active project manually and make sure you have a write access permission to the projects's directory.

                                                  I can run the program manually and I have write access. Does somebody know in more detail what's behind this message?

                                                    • Performance Comparison ATI-NVidia
                                                      himanshu.gautam

                                                      eci,

                                                      Does profiler give the same message while running any other code?

                                                      Does it show any profiling data for samples?

                                                      You can read this document to get help regarding the profiler.

                                                      C:\Program Files (x86)\ATI Stream\Tools\ATI Stream Profiler 2.1\Help

                                                        • Performance Comparison ATI-NVidia
                                                          eci

                                                          I finally got time to work on this issue again.

                                                          I downloaded the Stream SDK samples and the profiler works with them.

                                                          For my own project it still won't work. Still the same error I mentioned above. I have to add, that the project I'm working with has several subprojects and the actual OpenCL-code is loaded as a .dll into the main program. Is there a way of making this work?

                                                            • Performance Comparison ATI-NVidia
                                                              nou

                                                              you can try profile it manualy via GPUPerfAPI. http://developer.amd.com/gpu/GPUPerfAPI/Pages/default.aspx

                                                                • Performance Comparison ATI-NVidia
                                                                  eci

                                                                  Thanks for that tip. I'll look into that.

                                                                   

                                                                  In the mean time I have tried profiling it a little bit by hand and stumbled upon something interesting:

                                                                  I've put QueryPerformanceCounter's around clEnqueueNDRangeKernel + clFinish and what I can't explain right now is why approximately the first 100 Kernel executions take around 145ms and every execution after that is between 10 and 11ms.

                                                                  Do you know how to explain that? Every kernel does basically the same work and I would understand some fluctuations in runtimes due to the projection geometries, but I can't explain this pattern...

                                                                • Performance Comparison ATI-NVidia
                                                                  lbin

                                                                  eci

                                                                  Which version of StreamProfiler were you using? A new version is coming very soon.

                                                                  When you were using StreamProfiler to profile your app, did your app start to run or the app simply didn't start and you got the error message immediately after you click on the profile button?

                                                                   

                                                                  --------------------------------
                                                                  The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.

                                                                    • Performance Comparison ATI-NVidia
                                                                      eci

                                                                      I'm using the newest version. I needed to download it because I'm using a 64bit build.

                                                                      It's hard to say if the app starts. The command-box pops up, but non of my program output shows up. Then the error is shown.

                                                                        • Performance Comparison ATI-NVidia
                                                                          lbin

                                                                          Can you try with command line version of Stream Profiler?

                                                                          sprofile -w [workingDir] [app] [app args]

                                                                          You can refer to release note for more detailed usage.

                                                                          If it is possible, please send your visual studio project file (.vcproj) to gputools.support@amd.com

                                                                            • Performance Comparison ATI-NVidia
                                                                              eci

                                                                              OK, the command line version of the compiler works and after statically linking my whole project even the normal way of using the compiler out of Visual Studio works.

                                                                              I still can't get any profiler results because of the memory consumption. When I use the compile the size of my pagefile goes through the roof and the execution gets very very slow. Is there a solution for that?

                                                                                • Performance Comparison ATI-NVidia
                                                                                  lbin

                                                                                  Running the application under the profiler will always require more memory than the original application’s memory requirement.  This is because the profiler needs to save memory states in order to collect all the GPU performance counters correctly. 
                                                                                  The memory requirement under the profiler directly translates to the number of clSetKernelArgs for buffers that are created with read and write flag, size of these buffers, and the number of enqueue kernel commands.  You can reduce the memory requirements under the profiler by reducing these calls.
                                                                                  In an upcoming profiler version, we have performed some optimizations to improve the memory consumption. 

                                                                                    • Performance Comparison ATI-NVidia
                                                                                      eci

                                                                                      Hi

                                                                                      I can provide profiler results now:

                                                                                      Method , ExecutionOrder , GlobalWorkSize , GroupWorkSize , Time , LDSSize , DataTransferSize , GPRs , ScratchRegs , FCStacks , Wavefronts , ALUInsts , FetchInsts , WriteInsts , LDSFetchInsts , LDSWriteInsts , ALUBusy , ALUFetchRatio , ALUPacking , FetchSize , CacheHit , FetchUnitBusy , FetchUnitStalled , WriteUnitStalled , FastPath , CompletePath , PathUtilization , ALUStalledByLDS , LDSBankConflict
                                                                                      WriteBuffer ,     1 , , ,       219.35086 , ,    262144.00 , , ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,
                                                                                      WriteBuffer ,     2 , , ,       170.11653 , ,    262144.00 , , ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,
                                                                                      WriteImage2D ,     3 , , ,         0.00042 , ,      4680.00 , , ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,
                                                                                      RoadRunner__k1_Cypress1 ,     4 , {    512     512     512} , {   64     1     1} ,       141.96102 ,           0 , ,     7 ,     0 ,     1 ,   2097152.00 ,        25.00 ,         2.00 ,         1.00 ,         0.00 ,         0.00 ,         8.58 ,        12.50 ,        38.00 ,    270610.31 ,        98.38 ,         5.85 ,         5.85 ,        51.11 ,    524301.88 ,         0.00 ,       100.00 ,         0.00 ,         0.00
                                                                                      WriteImage2D ,     5 , , ,         0.00042 , ,      4680.00 , , ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,
                                                                                      RoadRunner__k1_Cypress1 ,     6 , {    512     512     512} , {   64     1     1} ,       142.21886 ,           0 , ,     7 ,     0 ,     1 ,   2097152.00 ,        25.00 ,         2.00 ,         1.00 ,         0.00 ,         0.00 ,         8.50 ,        12.50 ,        38.00 ,    269159.88 ,        98.27 ,         5.75 ,         5.75 ,        50.83 ,    524302.38 ,         0.00 ,       100.00 ,         0.00 ,         0.00
                                                                                      WriteImage2D ,     7 , , ,         0.00042 , ,      4680.00 , , ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,
                                                                                      RoadRunner__k1_Cypress1 ,     8 , {    512     512     512} , {   64     1     1} ,       136.65246 ,           0 , ,     7 ,     0 ,     1 ,   2097152.00 ,        25.00 ,         2.00 ,         1.00 ,         0.00 ,         0.00 ,         8.66 ,        12.50 ,        38.00 ,    265210.19 ,        99.15 ,         5.79 ,         5.79 ,        50.81 ,    524290.75 ,         0.00 ,       100.00 ,         0.00 ,         0.00
                                                                                      WriteImage2D ,     9 , , ,         0.00042 , ,      4680.00 , , ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,

                                                                                                                   
                                                                                                                   
                                                                                                                   
                                                                                                                   
                                                                                                                   
                                                                                                                   
                                                                                                                   
                                                                                                                   
                                                                                                                   

                                                                                      The current version of my kernel is attached. Total runtime is 4.5 times longer than on a NVidia Quadro FX 5600. Can you give me any pointers why this is the case? Based on some experiments I would guess it has something to do with global memory access.

                                                                                      I'm working on a second kernel (for another problem) that only works on images and that one performs abound 2 times faster than the NVidia card.

                                                                                      Anyone?

                                                                                       

                                                                                      constant sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR; __kernel void RoadRunner( __global float* volume1, __global float* volume2, __read_only image2d_t projection, int L, int S_x, int S_y, float R_L, float O_L, float m0, float m1, float m2, float m3, float m4, float m5, float m6, float m7, float m8, float m9, float m10, float m11) { size_t id1 = get_global_id(0); size_t id2 = get_global_id(1); size_t id3 = get_global_id(2); float z = O_L + (float)id3 * R_L; float y = O_L + (float)id2 * R_L; float x = O_L + (float)id1 * R_L; //int tmp = id3 * L * L/2 + id2 * L + id1; float wTmp = m2 * x + m8 * z + m11; float uTmp = (m0 * x + m6 * z + m9 ); float vTmp = (m1 * x + m7 * z + m10); float w_n = wTmp + m5 * y; float w = native_recip(w_n); float u_n = (uTmp + (m3 * y)) * w; float v_n = (vTmp + (m4 * y)) * w; if(id2 < L/2) volume1[id3 * L * L/2 + id2 * L + id1] += read_imagef(projection, sampler, (float2)(u_n+0.5f, v_n+0.5f)).x * w * w; else volume2[id3 * L * L/2 + (id2-L/2) * L + id1] += read_imagef(projection, sampler, (float2)(u_n+0.5f, v_n+0.5f)).x * w * w; return; }

                                                                                        • Performance Comparison ATI-NVidia
                                                                                          eci

                                                                                          Maybe the profiler output is more readable as attached Code:

                                                                                          Method , ExecutionOrder , GlobalWorkSize , GroupWorkSize , Time , LDSSize , DataTransferSize , GPRs , ScratchRegs , FCStacks , Wavefronts , ALUInsts , FetchInsts , WriteInsts , LDSFetchInsts , LDSWriteInsts , ALUBusy , ALUFetchRatio , ALUPacking , FetchSize , CacheHit , FetchUnitBusy , FetchUnitStalled , WriteUnitStalled , FastPath , CompletePath , PathUtilization , ALUStalledByLDS , LDSBankConflict WriteBuffer , 1 , , , 219.35086 , , 262144.00 , , , , , , , , , , , , , , , , , , , , , , WriteBuffer , 2 , , , 170.11653 , , 262144.00 , , , , , , , , , , , , , , , , , , , , , , WriteImage2D , 3 , , , 0.00042 , , 4680.00 , , , , , , , , , , , , , , , , , , , , , , RoadRunner__k1_Cypress1 , 4 , { 512 512 512} , { 64 1 1} , 141.96102 , 0 , , 7 , 0 , 1 , 2097152.00 , 25.00 , 2.00 , 1.00 , 0.00 , 0.00 , 8.58 , 12.50 , 38.00 , 270610.31 , 98.38 , 5.85 , 5.85 , 51.11 , 524301.88 , 0.00 , 100.00 , 0.00 , 0.00 WriteImage2D , 5 , , , 0.00042 , , 4680.00 , , , , , , , , , , , , , , , , , , , , , , RoadRunner__k1_Cypress1 , 6 , { 512 512 512} , { 64 1 1} , 142.21886 , 0 , , 7 , 0 , 1 , 2097152.00 , 25.00 , 2.00 , 1.00 , 0.00 , 0.00 , 8.50 , 12.50 , 38.00 , 269159.88 , 98.27 , 5.75 , 5.75 , 50.83 , 524302.38 , 0.00 , 100.00 , 0.00 , 0.00 WriteImage2D , 7 , , , 0.00042 , , 4680.00 , , , , , , , , , , , , , , , , , , , , , , RoadRunner__k1_Cypress1 , 8 , { 512 512 512} , { 64 1 1} , 136.65246 , 0 , , 7 , 0 , 1 , 2097152.00 , 25.00 , 2.00 , 1.00 , 0.00 , 0.00 , 8.66 , 12.50 , 38.00 , 265210.19 , 99.15 , 5.79 , 5.79 , 50.81 , 524290.75 , 0.00 , 100.00 , 0.00 , 0.00 WriteImage2D , 9 , , , 0.00042 , , 4680.00 , , , , , , , , , , , , , , , , , , , , , ,

                                                                                          • Performance Comparison ATI-NVidia
                                                                                            kbrafford

                                                                                            @eci, would your RoadRunner kernel go faster if you changed this as follows: ?

                                                                                             

                                                                                             

                                                                                             

                                                                                             

                                                                                             

                                                                                            From this: if(id2 < L/2) volume1[id3 * L * L/2 + id2 * L + id1] += read_imagef(projection, sampler, (float2)(u_n+0.5f, v_n+0.5f)).x * w * w; else volume2[id3 * L * L/2 + (id2-L/2) * L + id1] += read_imagef(projection, sampler, (float2)(u_n+0.5f, v_n+0.5f)).x * w * w; To this: float temp = read_imagef(projection, sampler, (float2)(u_n+0.5f, v_n+0.5f)).x * w * w; int index = id3 * L * L/2 + id1; if(id2 < L/2) volume1[index + id2 * L ] += temp; else volume2[index + (id2-L/2) * L] += temp;

                                                                                              • Performance Comparison ATI-NVidia
                                                                                                bubu

                                                                                                I think your kernel is slower because the 5870's shaders run at 850Mhz, while the Quadro FX5600's shaders run at 1350Mhz.

                                                                                                The 5870's has a bit more SIMD power ( like 2x due to large wavefronts and VLWI design ), but if your kernel is not very well optimized and vectorised then more Mhz will win.

                                                                                                 

                                                                                                And, yep, NVIDIA's OpenCL/CUDA drivers might be a bit more optimised. Remind that Quadro == super high end.

                                                                                                  • Performance Comparison ATI-NVidia
                                                                                                    Jawed

                                                                                                    I noticed in the stats you posted, back in December, that ALU Busy is ~9%.

                                                                                                    And the write unit is stalled for 51% of the time.

                                                                                                    OUCH.

                                                                                                    Each kernel launch is only doing 3.4 billion ALU cycles. If you were ALU limited that would take less than 1.5 milliseconds. (I don't know why the ALUInsts column says 25, I thought you were reporting more - so perhaps 2ms.)

                                                                                                    You might like to try using a 3D local work size, i.e. 4x4x4. This might change the memory access patterns reducing the write bottleneck. ATI isn't good at intensive scatter with lots of clashes, which this seems to be doing. HD6970 is meant to be better.

                                                                                                    I got better performance with 3D images than using 2D images simulating 3D images, so you might like to try that (though I was read-bottlenecked and you're write-bottlenecked). I'm able to allocate a 512MB 3D image (actually I can allocate two on a 1GB HD 5870).

                                                                                                    You need to use:

                                                                                                    #pragma OPENCL EXTENSION cl_khr_3d_image_writes : enable

                                                                                                    (hmm, not sure about SDK 2.3, defnitely in SDK 2.2).

                                                                                                    You can read and write 3D images from one kernel. If you're really sneaky you can read and write the same 3D image (though OpenCL says this is not supported), but I don't think that's relevant here.

                                                                                                • Performance Comparison ATI-NVidia
                                                                                                  himanshu.gautam

                                                                                                   

                                                                                                  Originally posted by: eci Hi

                                                                                                  I can provide profiler results now:

                                                                                                  Method , ExecutionOrder , GlobalWorkSize , GroupWorkSize , Time , LDSSize , DataTransferSize , GPRs , ScratchRegs , FCStacks , Wavefronts , ALUInsts , FetchInsts , WriteInsts , LDSFetchInsts , LDSWriteInsts , ALUBusy , ALUFetchRatio , ALUPacking , FetchSize , CacheHit , FetchUnitBusy , FetchUnitStalled , WriteUnitStalled , FastPath , CompletePath , PathUtilization , ALUStalledByLDS , LDSBankConflict WriteBuffer ,     1 , , ,       219.35086 , ,    262144.00 , , ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  , WriteBuffer ,     2 , , ,       170.11653 , ,    262144.00 , , ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  , WriteImage2D ,     3 , , ,         0.00042 , ,      4680.00 , , ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  , RoadRunner__k1_Cypress1 ,     4 , {    512     512     512} , {   64     1     1} ,       141.96102 ,           0 , ,     7 ,     0 ,     1 ,   2097152.00 ,        25.00 ,         2.00 ,         1.00 ,         0.00 ,         0.00 ,         8.58 ,        12.50 ,        38.00 ,    270610.31 ,        98.38 ,         5.85 ,         5.85 ,        51.11 ,    524301.88 ,         0.00 ,       100.00 ,         0.00 ,         0.00 WriteImage2D ,     5 , , ,         0.00042 , ,      4680.00 , , ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  , RoadRunner__k1_Cypress1 ,     6 , {    512     512     512} , {   64     1     1} ,       142.21886 ,           0 , ,     7 ,     0 ,     1 ,   2097152.00 ,        25.00 ,         2.00 ,         1.00 ,         0.00 ,         0.00 ,         8.50 ,        12.50 ,        38.00 ,    269159.88 ,        98.27 ,         5.75 ,         5.75 ,        50.83 ,    524302.38 ,         0.00 ,       100.00 ,         0.00 ,         0.00 WriteImage2D ,     7 , , ,         0.00042 , ,      4680.00 , , ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  , RoadRunner__k1_Cypress1 ,     8 , {    512     512     512} , {   64     1     1} ,       136.65246 ,           0 , ,     7 ,     0 ,     1 ,   2097152.00 ,        25.00 ,         2.00 ,         1.00 ,         0.00 ,         0.00 ,         8.66 ,        12.50 ,        38.00 ,    265210.19 ,        99.15 ,         5.79 ,         5.79 ,        50.81 ,    524290.75 ,         0.00 ,       100.00 ,         0.00 ,         0.00 WriteImage2D ,     9 , , ,         0.00042 , ,      4680.00 , , ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,

                                                                                                                               
                                                                                                                               
                                                                                                                               
                                                                                                                               
                                                                                                                               
                                                                                                                               
                                                                                                                               
                                                                                                                               
                                                                                                                               

                                                                                                  The current version of my kernel is attached. Total runtime is 4.5 times longer than on a NVidia Quadro FX 5600. Can you give me any pointers why this is the case? Based on some experiments I would guess it has something to do with global memory access.

                                                                                                  I'm working on a second kernel (for another problem) that only works on images and that one performs abound 2 times faster than the NVidia card.

                                                                                                  Anyone?

                                                                                                   

                                                                                                  I guess Quadro FX 5600 and cypress do not have similar compute power. IS that so? Can someone quote the ocmpute power of quadro FX 5600?

                                                                                                  In addition the kernel doesn't seem to be optimized for AMD device? There seems to be large number of memory accesses and I think they can be vectorized which would be lot more efficient.