Archives Discussions

eci · ‎11-19-2010

Hi

I'm currently comparing performance of OpenCL on different platforms. I'm especially interested in comparing NVidia and ATI graphics cards. The cards I'm currently using are a NVidia Quadro FX5600 and an ATI Radeon HD 5870.

The task I use for comparison is a backprojection algorithm.

After running some experiments the ATI card is two times slower then the card from NVidia while from a theoretical point of view it should be at least twice as fast. That's a very disappointing result and I'm curious what the reason for this poor performance is.

The Problem is as follows:

I reconstruct a volume from projection data. The volume size is 512ˆ3 and I have 400 Projections. For every projection one kernel-run is launched. The task of the kernels is to compute for every voxel a position in the current projection image and take this value to increment the voxel value. For the projection images I am using image_2d with a sampler for linear interpolation.

On the NVidia graphics card I am using a 2D problem over the x-z-direction of the volume. Every kernel runs one line in y-direction and work-groups are aligned along the x-direction. This way memory access is coalesced and I get very good performance.

On the ATI graphics card I tried the same approach, but performance was devastating. So I went back to a 3D problem. I experimented with work-group sizes and alignment along the x-direction seems to be beneficial here too. This type of implementation currently yields the best performance on ATI, but as stated it takes double the time of the NVidia card.

I tried different versions of the kernel and I'm pretty sure, that memory access is the limiting factor. But why? Do I miss something?

One more question: Is there a way to get around the memory restrictions with the current Stream SDK? I'm already using the environment variables to get access to the full 1GB of memory, but can still only allocate 256MB of memory in one block which is very annoying!

I attached the simples version of my kernel code. For NVidia the inner part is surrounded by a loop and for my current ATI version every kernel processes two voxels on two different memory blocks because of the limitation of the maximum memory block size of 256MB on the current Stream SDK.

Thanks for your support!

Greetings

Christian

const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR; __kernel void RoadRunner( __global float* volume, __read_only image2d_t projection, int L, int S_x, int S_y, float R_L, float O_L, float m0, float m1, float m2, float m3, float m4, float m5, float m6, float m7, float m8, float m9, float m10, float m11) { size_t id1 = get_global_id(0); size_t id2 = get_global_id(1); size_t id3 = get_global_id(2); float z = O_L + (float)id1 * R_L; float y = O_L + (float)id2 * R_L; float x = O_L + (float)id3 * R_L; float w_n = m2 * x + m5 * y + m8 * z + m11; float u_n = (m0 * x + m3 * y + m6 * z + m9 ) / w_n; float v_n = (m1 * x + m4 * y + m7 * z + m10) / w_n; volume[id1 * L * L + id2 * L + id3] += (float)(1.0 / (w_n * w_n) * read_imagef(projection, sampler, (float2)(u_n+0.5, v_n+0.5)).x); return; }

cjang · ‎11-19-2010

Some suggestions:

1. sampler = CLK_FILTER_NEAREST | CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_NONE;

2. vectorize memory access so: __global floatX * volume, where X is 2 or 4, scalar calculation is ok inside the kernel, just gather and scatter should be with float2 or float4

3. layout image data to maximize use of texture cache, right now your code is only using the first "x" component while each read_imagef returns a quad of four floats

I don't mean this purely as a shameless plug. I've done a lot of optimization from an applications level. Here's something I wrote that may give you some ideas of how different the vectorized cores of ATI are from the scalar cores of NVIDIA: http://golem5.org/gatlas/CaseStudyGATLAS.htm . Kernel design can be different between vendors (and different architectures).

zeland · ‎11-19-2010

Simple way to optimize is reducing of operation like /

And consider to use __constant for m1,m2 etc.

float w_n = 1/( m2 * x + m5 * y + m8 * z + m11); float u_n = (m0 * x + m3 * y + m6 * z + m9 ) * w_n; float v_n = (m1 * x + m4 * y + m7 * z + m10) * w_n; volume[id1 * L * L + id2 * L + id3] += (float)((w_n * w_n) / read_imagef(projection, sampler, (float2)(u_n+0.5, v_n+0.5)).x);

debdatta_basu · ‎11-19-2010

Does vectorizing solve the problem? If it does, the compiler is REALLY bad coz it wasn't able to vectorize three consecutive identical statements....

Are you sure the compiler is set to fully optimize? In any case... try the vectorized version and share the details..

Cheers!

-Debdatta Basu.

cjang · ‎11-19-2010

To clarify: I suggest vectorizing the gather/scatter memory access but not the calculations. Leave those as scalar operations. eci's speculation that "memory access is the limiting factor" is almost certainly correct.

debdatta.basu, my experience is that the static analysis required for a compiler to auto-vectorize and schedule is non-trivial. It's a hard problem in general.

n0thing · ‎11-19-2010

Try this kernel, it might pass correctness .

The number of global threads should now be 1/4th of original kernel.

const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR; __kernel void RoadRunner( __global float4* volume, __read_only image2d_t projection, float R_L, float O_L, float4 m2m5m8m11, float4 m0m3m6m9, float4 m1m4m7m10, float4 L_eq) // (L^2, L, 1, 0) { float4 id = (float4)(get_global_id(0), get_global_id(1), get_global_id(2), 0); float4 pos = O_L + id * R_L; float w_n = dot(pos, m2m5m8m11); float w_n_rcp = native_recip(w_n); float u_n = dot(pos, m0m3m6m9) * w_n_rcp; float v_n = dot(pos, m1m4m7m10) * w_n_rcp; int write_pos = convert_int(dot(id, L_eq)); volume[write_pos] += w_n_rcp * w_n_rcp * read_imagef(projection, sampler, (float2)(u_n+0.5, v_n+0.5)); }

eduardoschardong · ‎11-19-2010

I put the original code on SKA and got the attached ISA, can I blame the compiler now?

; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(122) KCACHE0(CB0:0-15) KCACHE1(CB1:0-31) 0 t: MULLO_INT ____, R1.z, KC0[1].z 1 z: ADD_INT ____, R0.z, PS0 t: MULLO_INT ____, R1.y, KC0[1].y 2 x: ADD_INT R1.x, PV1.z, KC0[6].z w: ADD_INT ____, R0.y, PS1 t: MULLO_INT ____, R1.x, KC0[1].x 3 x: ADD_INT ____, R0.x, PS2 y: ADD_INT T3.y, PV2.w, KC0[6].y t: U_TO_F T0.x, PV2.x 4 y: ADD_INT T2.y, PV3.x, KC0[6].x t: U_TO_F ____, PV3.y 5 y: MULADD_e T1.y, KC1[6].x, PS4, KC1[7].x w: MULADD_e T0.w, KC1[6].x, T0.x, KC1[7].x t: U_TO_F T0.x, PV4.y 6 x: MUL_e T1.x, KC1[9].x, PV5.w z: MUL_e ____, KC1[10].x, PV5.w 7 x: MUL_e T2.x, KC1[8].x, T0.w y: MULADD_e T0.y, KC1[13].x, T1.y, PV6.z 8 w: MULADD_e T1.w, KC1[6].x, T0.x, KC1[7].x 9 z: MULADD_e ____, KC1[16].x, PV8.w, T0.y w: MULADD_e T0.w, KC1[12].x, T1.y, T1.x 10 y: ADD ____, KC1[19].x, PV9.z w: MULADD_e ____, KC1[11].x, T1.y, T2.x 11 x: MOV T2.x, PV10.y y: MULADD_e ____, KC1[14].x, T1.w, PV10.w z: MULADD_e ____, KC1[15].x, T1.w, T0.w w: MOV T0.w, PV10.y t: MUL_e R0.y, PV10.y, PV10.y 12 x: AND_INT ____, PV11.x, (0x807FFFFF, -1.175494211e-38f).x y: ADD T0.y, KC1[18].x, PV11.z z: AND_INT T2.z, PV11.x, (0x7F800000, 1.#INFf).y w: ADD T3.w, KC1[17].x, PV11.y t: AND_INT T0.z, PV11.x, (0x80000000, -0.0f).z 13 x: SETE_INT T0.x, PV12.z, 0.0f y: AND_INT T1.y, PV12.y, (0x807FFFFF, -1.175494211e-38f).x z: OR_INT ____, PV12.x, (0x3F800000, 1.0f).y w: AND_INT R0.w, PV12.y, (0x7F800000, 1.#INFf).z t: AND_INT R1.w, PV12.w, (0x7F800000, 1.#INFf).z 14 x: SETE_INT T1.x, T2.z, (0x7F800000, 1.#INFf).x y: AND_INT ____, T3.w, (0x807FFFFF, -1.175494211e-38f).y z: CNDE_INT T0.z, PV13.x, PV13.z, T0.z w: SETE_INT T2.w, PV13.w, 0.0f t: AND_INT T1.w, T0.y, (0x80000000, -0.0f).z 15 x: SETE_INT T3.x, R1.w, 0.0f y: OR_INT ____, T1.y, (0x3F800000, 1.0f).x z: AND_INT T1.z, T0.w, (0x7F800000, 1.#INFf).y VEC_120 w: AND_INT ____, T3.w, (0x80000000, -0.0f).z VEC_201 t: OR_INT ____, PV14.y, (0x3F800000, 1.0f).x 16 x: CNDE_INT T2.x, T1.x, T0.z, T2.x y: SETE_INT ____, R0.w, (0x7F800000, 1.#INFf).x z: CNDE_INT T0.z, PV15.x, PS15, PV15.w w: CNDE_INT ____, T2.w, PV15.y, T1.w VEC_102 t: SETE_INT T1.w, PV15.z, (0x7F800000, 1.#INFf).x 17 x: SETE_INT ____, R1.w, (0x7F800000, 1.#INFf).x y: SETE_INT ____, T1.z, 0.0f z: CNDE_INT T1.z, PV16.y, PV16.w, T0.y w: OR_INT ____, PV16.y, T1.x VEC_021 t: OR_INT ____, T2.w, T0.x 18 x: OR_INT ____, T1.w, PV17.x y: CNDE_INT ____, PV17.x, T0.z, T3.w z: OR_INT T0.z, PV17.w, PS17 w: OR_INT ____, PV17.y, T3.x t: RCP_e ____, T2.x 19 x: SUB_INT ____, R0.w, T2.z y: MUL_e T0.y, T1.z, PS18 z: OR_INT R2.z, PV18.x, PV18.w w: SUB_INT ____, R1.w, T2.z VEC_120 t: MUL_e T3.w, PS18, PV18.y 20 x: AND_INT ____, (0x7FFFFFFF, 1.#QNANf).x, PV19.y y: CNDE_INT T2.y, PV19.z, PV19.w, 0.0f z: AND_INT ____, (0x7FFFFFFF, 1.#QNANf).x, PS19 w: CNDE_INT T1.w, T0.z, PV19.x, 0.0f t: MULLO_INT ____, T2.y, KC1[3].x 21 x: ASHR ____, PV20.z, (0x00000017, 3.222986468e-44f).x y: ASHR ____, PV20.w, (0x00000017, 3.222986468e-44f).x z: ASHR ____, PV20.x, (0x00000017, 3.222986468e-44f).x w: ADD_INT ____, T3.y, PS20 t: ASHR ____, PV20.y, (0x00000017, 3.222986468e-44f).x 22 x: AND_INT T3.x, (0x80000000, -0.0f).x, T0.y y: ADD_INT T0.y, PV21.x, PS21 z: ADD_INT T2.z, T1.w, T0.y w: ADD_INT ____, PV21.z, PV21.y t: MULLO_INT ____, PV21.w, KC1[3].x 23 x: SETGE_INT T2.x, PV22.w, (0x000000FF, 3.573311084e-43f).x y: ADD_INT T2.y, R1.x, PS22 z: SETGE_INT T1.z, 0.0f, PV22.w w: AND_INT R2.w, (0x80000000, -0.0f).y, T3.w t: ADD_INT R5.z, T2.y, T3.w 24 x: AND_INT ____, R0.y, (0x7F800000, 1.#INFf).x y: AND_INT T0.y, R0.y, (0x80000000, -0.0f).y z: SETGE_INT R4.z, 0.0f, T0.y w: AND_INT ____, R0.y, (0x807FFFFF, -1.175494211e-38f).z t: SETGE_INT R3.z, T0.y, (0x000000FF, 3.573311084e-43f).w 25 x: SUB_INT R1.x, (0x3F800000, 1.0f).x, PV24.x y: SETE_INT ____, PV24.x, 0.0f z: SETE_INT R0.z, PV24.x, (0x7F800000, 1.#INFf).y w: OR_INT ____, PV24.w, (0x3F800000, 1.0f).x t: LSHL R0.w, T2.y, (0x00000002, 2.802596929e-45f).z 26 x: OR_INT R0.x, PV25.z, PV25.y y: CNDE_INT R1.y, T1.z, T2.z, T3.x z: CNDE_INT R1.z, PV25.y, PV25.w, T0.y w: CNDE_INT R1.w, T0.z, T2.x, 0.0f VEC_201 t: OR_INT R2.y, T3.x, (0x7F800000, 1.#INFf).x 01 ALU: ADDR(154) CNT(24) KCACHE0(CB1:0-15) 27 x: ADD_INT T2.x, KC0[0].x, R0.w y: CNDE_INT ____, R0.z, R1.z, R0.y z: CNDE_INT T0.z, R0.x, R1.x, 0.0f w: OR_INT T3.w, R2.w, (0x7F800000, 1.#INFf).x t: CNDE_INT ____, R1.w, R1.y, R2.y 28 x: CNDE_INT T3.x, R2.z, R3.z, 0.0f y: ASHR T0.y, PV27.z, (0x00000017, 3.222986468e-44f).x z: ADD R2.z, PS27, 0.5 t: RCP_e ____, PV27.y 29 x: AND_INT ____, (0x7FFFFFFF, 1.#QNANf).x, PS28 y: AND_INT R0.y, (0x80000000, -0.0f).y, PS28 z: ADD_INT R4.z, T0.z, PS28 w: CNDE_INT ____, R4.z, R5.z, R2.w VEC_120 t: LSHR R3.x, T2.x, (0x00000002, 2.802596929e-45f).z 30 x: CNDE_INT ____, T3.x, PV29.w, T3.w y: OR_INT R1.y, PV29.y, (0x7F800000, 1.#INFf).x z: ASHR ____, PV29.x, (0x00000017, 3.222986468e-44f).y 31 x: ADD R2.x, PV30.x, 0.5 w: ADD_INT R2.w, PV30.z, T0.y 02 TEX: ADDR(192) CNT(2) 32 SAMPLE R2.x___, R2.xz0x, t0, s0 UNNORM(XYZW) 33 VFETCH R1.x___, R3.x, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 03 ALU: ADDR(178) CNT(7) 34 x: SETGE_INT ____, R2.w, (0x000000FF, 3.573311084e-43f).x z: SETGE_INT ____, 0.0f, R2.w 35 z: CNDE_INT ____, PV34.z, R4.z, R0.y w: CNDE_INT ____, R0.x, PV34.x, 0.0f 36 y: CNDE_INT ____, PV35.w, PV35.z, R1.y 37 x: MULADD_e R1.x, R2.x, PV36.y, R1.x 04 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R3].x___, R1, ARRAY_SIZE(4) MARK VPM END_OF_PROGRAM

MicahVillmow · ‎11-19-2010

eduardoschardong,
What do you see as inefficient in the ISA generation? The compiler looks like it is packing pretty well into the ALU slots. It looks like you have ~150 ALU instructions being packed into 35 ALU bundles for a packing ratio of ~4.28.

nou · ‎11-20-2010

try use profiler to see what is a bottlneck.

eduardoschardong · ‎11-20-2010

Micah, what I'm seeing is 17 multiplications, 17 additions and 3 divisions becoming ~150 instructions, it is doing much more work than what is seen in that small piece of code and likely much more work than on nVidia GPUs

Well... With such below expected performance something should be wrong, right?

eci · ‎11-21-2010

First of all thanks for your replies so far!

If forgot to mention in my initial post that I already experimented with vectorization. The vectorized version is performing roughly the same, maybe a little bit worse. But I'm with debdatt.basu on this topic. The compiler should really be able to do this!

Another thing I discovered lately while testing some things out is the variations in runtimes on the ATI card. I get runtimes between 22.5s and 31s (with a mean somewhere just below 30s). When I run the problem on the NVidia card runtimes vary much less (somewhere around 17s).

What I want to add here is that I am totally aware that the code I'm using isn't optimized to the maximum. But that isn't really what I'm looking for. I want to compare OpenCL performance on different platforms.

@n0thing: I'm not sure if I really get the code you posted. The image2d_t projection is a CL_R, CL_FLOAT image. In the last line you access the projection image on position u_n and v_n and use the resulting vector to update four positions in the volume. But with an CL_R image y, z and w of this vector are not really defined. So I'm pretty sure this isn't working. I'm aware of the fact that I could load four of my projection images into an CL_RGBA image to speed up the execution of my problem. But I am working with another framework so it is easier for me this way and the NVidia implementation works the same way.

eduardoschardong · ‎11-21-2010

Hello eci,

The code is well packed as Micah pointed, vectorization won't help as it is, reducing the number of instructions first could make vectorization more profitable, one of the operations that are generating a lot of instructions is the division, I'm not sure how different the native_recip is but it should give you correct results, finally doing x + y * z + w * v is better than y * z + w * v + x because the compiler won't reorder operations to emit MULLADDs, in your case will only help in the vectored version and by a small bit, you can try the attached code first, try to vector it too, may help, also profiling would help a lot.

const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR; __kernel void RoadRunner( __global float* volume, __read_only image2d_t projection, int L, int S_x, int S_y, float R_L, float O_L, float m0, float m1, float m2, float m3, float m4, float m5, float m6, float m7, float m8, float m9, float m10, float m11) { size_t id1 = get_global_id(0); size_t id2 = get_global_id(1); size_t id3 = get_global_id(2); float z = O_L + (float)id1 * R_L; float y = O_L + (float)id2 * R_L; float x = O_L + (float)id3 * R_L; float w_n = m11 + m2 * x + m5 * y + m8 * z; float w_n_r = native_recip(w_n); float u_n = ( m9 + m0 * x + m3 * y + m6 * z) * w_n_r; float v_n = (m10 + m1 * x + m4 * y + m7 * z) * w_n_r; volume[id1 * L * L + id2 * L + id3] += w_n_r * w_n_r * native_recip(read_imagef(projection, sampler, (float2)(u_n+0.5, v_n+0.5)).x); }

debdatta_basu · ‎11-22-2010

@cjang... Its definitely not trivial.... But it is well studied, and almost every modern compiler (gcc, intel.. you name it) could vectorize that piece of code...

But as it turns out, vectorizing didnt solve it. Im waiting to see how eduard's kernel performs...

zeland · ‎11-22-2010

May be good way for achieve better performance is multi output per thread.

Like it is described in the Volkov paper:

http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

MicahVillmow · ‎11-22-2010

eduardo,
Yeah, I missed the division. That is where the large increase in instructions is coming from.
Here is the CL/IL/ISA that gets generated for a simple Division shader. As you can see, the division is a majority of the instructions.

CL: kernel void div_test(global float* a) { a[0] = a[1] / a[2]; } IL: mdef(209)_out(1)_in(2) mov r0, in0 mov r1, in1 dcl_literal l1, 0x7f800000, 0x7f800000, 0x807fffff, 0x807fffff dcl_literal l2, 0x7f800000, 0x7f800000, 0, 0 dcl_literal l3, 0x80000000, 0x80000000, 0x80000000, 0x80000000 dcl_literal l4, 0x3f800000, 0x3f800000, 0, 0 dcl_literal l5, 0, 0, 0, 0 dcl_literal l6, 0x7fffffff, 0x80000000, 0x7fffffff, 0x80000000 dcl_literal l7, 0x00800000, 0x00800000, 0x00800000, 0x00800000 dcl_literal l8, 0x00000017, 0x00000017, 0x00000017, 0x00000017 dcl_literal l9, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff mov r2.x___, r0.x mov r2._y__, r1.x and r3.xyzw, r2.xyxy, l1 ieq r4.xyzw, r3.xyxy, l2 and r5.xy__, r2.xy, l3 ior r3.__zw, r3.zwzw, l4.xyxy cmov_logical r3.__zw, r4.zwzw, r5.xyxy, r3.zwzw cmov_logical r3.__zw, r4.xyxy, r2.xyxy, r3.zwzw ior r5.xy__, r4.xz, r4.yw ior r5.x___, r5.x, r5.y inegate r5.__z_, r3.yyyy iadd r3.x___, r3.x, r5.z cmov_logical r3.x___, r5.xxxx, l5, r3.xxxx rcp_zeroop(infinity) r2._y__, r3.ww mul_ieee r2.x___, r3.z, r2.y and r2.__zw, r2.xxxx, l6.xyzw ishr r6.x___, r2.z, l8 ishr r6._y__, r3.xxxx, l8 iadd r2.xy__, r2.xzxz, r3.xxxx iadd r6.x___, r6.x, r6.y ige r4.__z_, l5, r6.x ior r4._y__, r2.wwww, l1 ige r4.x, r6.x, l9 cmov_logical r4.x, r5.x, l5, r4.x cmov_logical r2.x, r4.z, r2.w, r2.x cmov_logical r0.x, r4.x, r4.y, r2.x mov out0, r0 mend il_cs_2_0 dcl_cb cb0[9] ; Constant buffer that holds ABI data dcl_literal l0, 4, 1, 2, 3 dcl_literal l1, 0x00FFFFFF, -1, -2, -3 dcl_literal l2, 0x0000FFFF, 0xFFFFFFFE,0x000000FF,0xFFFFFFFC dcl_literal l3, 24, 16, 8, 0xFFFFFFFF dcl_literal l4, 0xFFFFFF00, 0xFFFF0000, 0xFF00FFFF, 0xFFFF00FF dcl_literal l5, 0, 4, 8, 12 dcl_literal l6, 32, 32, 32, 32 dcl_literal l7, 24, 31, 16, 31 mov r1013, cb0[8].x call 1024;$ endmain func 1024 ; __OpenCL_div_test_kernel mov r1019, l1.0 dcl_max_thread_per_group 256 dcl_raw_uav_id(0) dcl_arena_uav_id(8) mov r0.z, vThreadGrpIdFlat.x mov r1022.xyz0, vTidInGrp.xyz mov r1023.xyz0, vThreadGrpId.xyz imad r1021.xyz0, r1023.xyz0, cb0[1].xyz0, r1022.xyz0 iadd r1021.xyz0, r1021.xyz0, cb0[6].xyz0 iadd r1023.xyz0, r1023.xyz0, cb0[7].xyz0 mov r1023.w, r0.z ishl r1023.w, r1023.w, l0.z mov r1018.x, l0.0 mov r1015.x, cb0[8].x dcl_literal l9, 0x00000004, 0x00000004, 0x00000004, 0x00000004; f32:i32 4 dcl_literal l8, 0x00000008, 0x00000008, 0x00000008, 0x00000008; f32:i32 8 dcl_cb cb1[1] ; Kernel arg setup: a mov r1, cb1[0] call 2 ; div_test ret endfunc ; __OpenCL_div_test_kernel ;ARGSTART:__OpenCL_div_test_kernel ;version:1:4:53 ;device:cayman ;uniqueid:1024 ;memory:hwprivate:0 ;memory:hwlocal:0 ;version:1:4:53 ;pointer:a:float:1:1:0:uav:0:4 ;function:1:2 ;uavid:8 ;ARGEND:__OpenCL_div_test_kernel func 2 ; div_test ; @__OpenCL_div_test_kernel ; BB#0: ; %entry mov r258, l8 iadd r258.x___, r1.xxxx, r258.xxxx mov r1010.x___, r258.xxxx uav_raw_load_id(0) r1011.x___, r1010.x mov r258, r1011 mov r259, l9 iadd r259.x___, r1.xxxx, r259.xxxx mov r1010.x___, r259.xxxx uav_raw_load_id(0) r1011.x___, r1010.x mov r259, r1011 ;__fdiv_f32 mcall(209) (r258),(r259, r258) mov r1011, r258 mov r1010.x___, r1.x uav_raw_store_id(0) mem0.x___, r1010.x, r1011.x ret_dyn endfunc ; div_test ;ARGSTART:div_test ;uniqueid:2 ;memory:hwlocal:0 ;intrinsic:1:209 ;ARGEND:div_test end ISA: ShaderType = IL_SHADER_COMPUTE TargetChip = c ; ------------- SC_SRCSHADER Dump ------------------ SC_SHADERSTATE: u32NumIntVSConst = 0 SC_SHADERSTATE: u32NumIntPSConst = 0 SC_SHADERSTATE: u32NumIntGSConst = 0 SC_SHADERSTATE: u32NumBoolVSConst = 0 SC_SHADERSTATE: u32NumBoolPSConst = 0 SC_SHADERSTATE: u32NumBoolGSConst = 0 SC_SHADERSTATE: u32NumFloatVSConst = 0 SC_SHADERSTATE: u32NumFloatPSConst = 0 SC_SHADERSTATE: u32NumFloatGSConst = 0 fConstantsAvailable = 0 iConstantsAvailable = 0 bConstantsAvailable = 0 u32SCOptions[0] = 0x01A00000 SCOption_IGNORE_SAMPLE_L_BUG SCOption_FLOAT_DO_NOT_DIST SCOption_FLOAT_DO_NOT_REASSOC u32SCOptions[1] = 0x00202000 SCOption_R600_ERROR_ON_DOUBLE_MEMEXP SCOption_SET_VPM_FOR_SCATTER u32SCOptions[2] = 0x00020040 SCOption_R800_UAV_NONUAV_SYNC_WORKAROUND_BUG216513_1 SCOption_R900_BRANCH_IN_NESTED_LOOPS_WORKAROUND_BUG281276 ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(8) KCACHE0(CB1:0-15) 0 x: LSHR R0.x, KC0[0].x, (0x00000002, 2.802596929e-45f).x y: ADD_INT R0.y, KC0[0].x, (0x00000004, 5.605193857e-45f).y z: ADD_INT R0.z, KC0[0].x, (0x00000008, 1.121038771e-44f).z 1 x: LSHR R1.x, PV0.y, (0x00000002, 2.802596929e-45f).x w: LSHR R0.w, PV0.z, (0x00000002, 2.802596929e-45f).x 01 TEX: ADDR(96) CNT(2) 2 VFETCH R0.__x_, R0.w, fc154 FETCH_TYPE(NO_INDEX_OFFSET) 3 VFETCH R1.x___, R1.x, fc154 FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(40) CNT(48) 4 x: AND_INT R2.x, R1.x, (0x807FFFFF, -1.175494211e-38f).x y: AND_INT R0.y, R1.x, (0x80000000, -0.0f).y z: AND_INT R0.z, R1.x, (0x7F800000, 1.#INFf).z w: MOV R0.w, R0.z 5 x: AND_INT R3.x, PV4.w, (0x80000000, -0.0f).x y: AND_INT R1.y, PV4.w, (0x7F800000, 1.#INFf).y z: SETE_INT R1.z, PV4.z, 0.0f w: AND_INT R1.w, PV4.w, (0x807FFFFF, -1.175494211e-38f).z 6 x: OR_INT R2.x, PV5.w, (0x3F800000, 1.0f).x y: SETE_INT R2.y, PV5.y, (0x7F800000, 1.#INFf).y z: OR_INT R2.z, R2.x, (0x3F800000, 1.0f).x w: SETE_INT R1.w, PV5.y, 0.0f 7 x: SETE_INT R2.x, R0.z, (0x7F800000, 1.#INFf).x y: CNDE_INT R0.y, PV6.w, PV6.x, R3.x z: CNDE_INT R1.z, R1.z, PV6.z, R0.y VEC_120 w: OR_INT R1.w, R1.z, PV6.w VEC_120 8 x: CNDE_INT R1.x, PV7.x, PV7.z, R1.x y: SUB_INT R1.y, R0.z, R1.y z: OR_INT R1.z, PV7.x, R2.y VEC_021 w: CNDE_INT R0.w, R2.y, PV7.y, R0.w 9 x: RCP_e ____, PV8.w y: RCP_e R0.y, PV8.w z: RCP_e ____, PV8.w w: OR_INT R0.w, PV8.z, R1.w 10 x: MUL_e R1.x, R1.x, PV9.y z: CNDE_INT R0.z, PV9.w, R1.y, 0.0f 11 x: ASHR R1.x, PV10.z, (0x00000017, 3.222986468e-44f).x y: ADD_INT R0.y, PV10.z, PV10.x z: AND_INT R0.z, (0x80000000, -0.0f).y, PV10.x w: AND_INT R1.w, (0x7FFFFFFF, 1.#QNANf).z, PV10.x 12 x: OR_INT R2.x, PV11.z, (0x7F800000, 1.#INFf).x y: ASHR R1.y, PV11.w, (0x00000017, 3.222986468e-44f).y 13 z: ADD_INT R2.z, PV12.y, R1.x 14 y: SETGE_INT R2.y, 0.0f, PV13.z w: SETGE_INT R2.w, PV13.z, (0x000000FF, 3.573311084e-43f).x 15 y: CNDE_INT R0.y, PV14.y, R0.y, R0.z z: CNDE_INT R0.z, R0.w, PV14.w, 0.0f 16 x: CNDE_INT R1.x, PV15.z, PV15.y, R2.x 03 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(0)[R0].x___, R1, ARRAY_SIZE(4) MARK VPM 04 END END_OF_PROGRAM ; ----------------- CS Data ------------------------ ; Input Semantic Mappings ; No input mappings GprPoolSize = 0 CodeLen = 800;Bytes PGM_END_CF = 0; words(64 bit) PGM_END_ALU = 0; words(64 bit) PGM_END_FETCH = 0; words(64 bit) MaxScratchRegsNeeded = 0 ;AluPacking = 0.0 ;AluClauses = 0 ;PowerThrottleRate = 0.0 ; texResourceUsage[0] = 0x00000000 ; texResourceUsage[1] = 0x00000000 ; texResourceUsage[2] = 0x00000000 ; texResourceUsage[3] = 0x00000000 ; fetch4ResourceUsage[0] = 0x00000000 ; fetch4ResourceUsage[1] = 0x00000000 ; fetch4ResourceUsage[2] = 0x00000000 ; fetch4ResourceUsage[3] = 0x00000000 ; texSamplerUsage = 0x00000000 ; constBufUsage = 0x00000000 ResourcesAffectAlphaOutput[0] = 0x00000000 ResourcesAffectAlphaOutput[1] = 0x00000000 ResourcesAffectAlphaOutput[2] = 0x00000000 ResourcesAffectAlphaOutput[3] = 0x00000000 ;SQ_PGM_RESOURCES = 0x30000104 SQ_PGM_RESOURCES:NUM_GPRS = 4 SQ_PGM_RESOURCES:STACK_SIZE = 1 SQ_PGM_RESOURCES:PRIME_CACHE_ENABLE = 1 ;SQ_PGM_RESOURCES_2 = 0x000000C0 SQ_LDS_ALLOC:SIZE = 0x00000000 ; RatOpIsUsed = 0x1 ; NumThreadPerGroupFlattened = 256 ; SetBufferForNumGroup = true

n0thing · ‎11-22-2010

@eci Sorry, didn't realize that the texture format in your case was scalar , but changing the image format to RGBA shouldn't be that hard. If you compare the ISA generated with my kernel vs eduardo's kernel, you can see the manual vectorizing is resulting in less number of instructions and ALU packing is better.

; -------- my kernel -------------------- 00 ALU: ADDR(32) CNT(44) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15) 0 w: MOV T0.w, 0.0f t: MULLO_INT ____, R1.x, KC0[1].x 1 y: ADD_INT ____, R0.x, PS0 t: MULLO_INT ____, R1.y, KC0[1].y 2 z: ADD_INT ____, R0.y, PS1 w: ADD_INT ____, PV1.y, KC0[6].x t: MULLO_INT ____, R1.z, KC0[1].z 3 x: ADD_INT ____, R0.z, PS2 y: ADD_INT ____, PV2.z, KC0[6].y t: U_TO_F T0.x, PV2.w 4 w: ADD_INT ____, PV3.x, KC0[6].z t: U_TO_F T0.y, PV3.y 5 x: MULADD_e T1.x, KC1[3].x, T0.x, KC1[4].x y: MULADD_e T1.y, KC1[3].x, PS4, KC1[4].x w: MULADD_e T0.w, KC1[3].x, T0.w, KC1[4].x t: U_TO_F T0.z, PV4.w 6 x: DOT4 ____, KC1[8].x, T0.x y: DOT4 ____, KC1[8].y, T0.y z: DOT4 ____, KC1[8].z, PS5 w: DOT4 ____, (0x80000000, -0.0f).x, 0.0f 7 z: MULADD_e T0.z, KC1[3].x, T0.z, KC1[4].x t: F_TO_I ____, PV6.x 8 x: DOT4 ____, KC1[5].x, T1.x y: DOT4 ____, KC1[5].y, T1.y z: DOT4 ____, KC1[5].z, PV7.z w: DOT4 ____, KC1[5].w, T0.w t: LSHL T1.z, PS7, (0x00000004, 5.605193857e-45f).x 9 x: DOT4 ____, KC1[6].x, T1.x y: DOT4 ____, KC1[6].y, T1.y z: DOT4 ____, KC1[6].z, T0.z w: DOT4 ____, KC1[6].w, T0.w t: RCP_e T0.x, PV8.x 10 x: DOT4 ____, KC1[7].x, T1.x y: DOT4 ____, KC1[7].y, T1.y z: DOT4 ____, KC1[7].z, T0.z w: DOT4 ____, KC1[7].w, T0.w t: MULADD_e R0.x, PS9, PV9.x, 0.5 11 x: MUL_e R1.x, T0.x, T0.x y: ADD_INT ____, KC1[0].x, T1.z w: MULADD_e R0.w, T0.x, PV10.x, 0.5 12 x: LSHR R3.x, PV11.y, (0x00000002, 2.802596929e-45f).x 01 TEX: ADDR(80) CNT(2) 13 VFETCH R2, R3.x, fc156 FORMAT(32_32_32_32_FLOAT) MEGA(16) FETCH_TYPE(NO_INDEX_OFFSET) 14 SAMPLE R0, R0.xw0x, t0, s0 UNNORM(XYZW) 02 ALU: ADDR(76) CNT(4) 15 x: MULADD_e R0.x, R0.x, R1.x, R2.x y: MULADD_e R0.y, R0.y, R1.x, R2.y z: MULADD_e R0.z, R0.z, R1.x, R2.z w: MULADD_e R0.w, R0.w, R1.x, R2.w 03 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R3], R0, ARRAY_SIZE(4) MARK VPM END_OF_PROGRAM ; -------- eduardo's kernel -------------------- 00 ALU: ADDR(32) CNT(37) KCACHE0(CB0:0-15) KCACHE1(CB1:0-31) 0 t: MULLO_INT ____, R1.z, KC0[1].z 1 z: ADD_INT ____, R0.z, PS0 t: MULLO_INT ____, R1.y, KC0[1].y 2 x: ADD_INT ____, R0.y, PS1 y: ADD_INT T1.y, PV1.z, KC0[6].z t: MULLO_INT ____, R1.x, KC0[1].x 3 x: ADD_INT T1.x, PV2.x, KC0[6].y w: ADD_INT ____, R0.x, PS2 t: U_TO_F T0.w, PV2.y 4 x: ADD_INT T0.x, PV3.w, KC0[6].x t: U_TO_F T0.z, PV3.x 5 t: MULLO_INT ____, PV4.x, KC1[3].x 6 x: MULADD_e T1.x, KC1[6].x, T0.z, KC1[7].x y: MULADD_e T0.y, KC1[6].x, T0.w, KC1[7].x w: ADD_INT T1.w, T1.x, PS5 t: U_TO_F T0.w, T0.x 7 x: MULADD_e T0.x, KC1[10].x, PV6.y, KC1[19].x 8 z: MULADD_e T0.z, KC1[9].x, T0.y, KC1[18].x 9 y: MULADD_e T2.y, KC1[6].x, T0.w, KC1[7].x 10 w: MULADD_e T1.w, KC1[13].x, T1.x, T0.x t: MULLO_INT ____, T1.w, KC1[3].x 11 x: MULADD_e T0.x, KC1[8].x, T0.y, KC1[17].x z: ADD_INT ____, T1.y, PS10 12 x: MULADD_e ____, KC1[12].x, T1.x, T0.z y: MULADD_e ____, KC1[16].x, T2.y, T1.w z: LSHL T0.z, PV11.z, (0x00000002, 2.802596929e-45f).x 13 z: MULADD_e ____, KC1[15].x, T2.y, PV12.x w: MULADD_e ____, KC1[11].x, T1.x, T0.x t: RCP_e T1.z, PV12.y 14 x: MUL_e R1.x, PS13, PS13 y: MULADD_e R0.y, PV13.z, PS13, 0.5 z: MULADD_e ____, KC1[14].x, T2.y, PV13.w w: ADD_INT ____, KC1[0].x, T0.z 15 x: MULADD_e R0.x, T1.z, PV14.z, 0.5 t: LSHR R2.x, PV14.w, (0x00000002, 2.802596929e-45f).x 01 TEX: ADDR(80) CNT(2) 16 SAMPLE R0.___x, R0.xy0x, t0, s0 UNNORM(XYZW) 17 VFETCH R0.x___, R2.x, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(69) CNT(2) 18 t: RCP_e ____, R0.w 19 x: MULADD_e R0.x, PS18, R1.x, R0.x 03 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R2].x___, R0, ARRAY_SIZE(4) MARK VPM END_OF_PROGRAM

eci · ‎11-22-2010

You see my current kernel attached. The reason for the split up of statements is the fact that I can reuse uTmp, vTmp and wTmp and thus reduce instructions. The reason for the two volumes is the limitation of the current Stream SDK which only allows to allocate 256MB of continuous memory in my case. I wasn't aware that I would find such well-founded help here and wanted to keep things simple.

I tried the reordering of instructions eduardo suggested, but sadly this also didn't do the trick.

@nothing: I cannot use the vectorization the way you suggested. The problem is, that in your kernel you use the volume position of the first voxel and compute the projection into the projection image (which is a float greyscale image by the way). What really has to be done is compute for all four voxel positions the appropriate position in the projection image.

I'm still researching if I missed something big in the way I access memory! Is the access layout I use in any way flawed?

I also digged a little bit into the ISA code and am not quite sure if I get it by now. If I start from the kernel from my first post, the kernel has 37 operations. The compiler makes roundabout 150 operations out of them. Then he packs them into these units of 5 instructions which can be run in parallel. That gets me back to around 37 operations. So if I get it right I access at best 1/5 of the cards peak performance!

constant sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR; __kernel void RoadRunner( __global float* volume1, __global float* volume2, __read_only image2d_t projection, int L, int S_x, int S_y, float R_L, float O_L, float m0, float m1, float m2, float m3, float m4, float m5, float m6, float m7, float m8, float m9, float m10, float m11) { size_t id1 = get_global_id(0); size_t id2 = get_global_id(1); size_t id3 = get_global_id(2); float z = O_L + (float)id3 * R_L; float y = O_L + (float)(id2*2) * R_L; float x = O_L + (float)id1 * R_L; int tmp = id3 * L * L/2 + id2 * L + id1; float wTmp = m2 * x + m8 * z + m11; float uTmp = (m0 * x + m6 * z + m9 ); float vTmp = (m1 * x + m7 * z + m10); float w_n = wTmp + m5 * y; float w = native_recip(w_n); float u_n = (uTmp + (m3 * y)) * w; float v_n = (vTmp + (m4 * y)) * w; volume1[tmp] += read_imagef(projection, sampler, (float2)(u_n+0.5f, v_n+0.5f)).x * w * w; y += R_L; w_n = wTmp + m5 * y; w = native_recip(w_n); u_n = (uTmp + (m3 * y)) * w; v_n = (vTmp + (m4 * y)) * w; volume2[tmp] += read_imagef(projection, sampler, (float2)(u_n+0.5f, v_n+0.5f)).x * w * w; return; }

MicahVillmow · ‎11-22-2010

eci,
1 CL instruction does not equal 1 ISA instruction. Your new kernel with our upcoming SDK release has 78 ALU instructions packed into 25 ALU bundles for a little greater than 3 packing ratio. So there is room for improvement in packing, but you definitely are using more than 1/5th the chip, that would imply your packing ratio is 1.
The ISA is as follows.

ShaderType = IL_SHADER_COMPUTE TargetChip = c ; ------------- SC_SRCSHADER Dump ------------------ SC_SHADERSTATE: u32NumIntVSConst = 0 SC_SHADERSTATE: u32NumIntPSConst = 0 SC_SHADERSTATE: u32NumIntGSConst = 0 SC_SHADERSTATE: u32NumBoolVSConst = 0 SC_SHADERSTATE: u32NumBoolPSConst = 0 SC_SHADERSTATE: u32NumBoolGSConst = 0 SC_SHADERSTATE: u32NumFloatVSConst = 0 SC_SHADERSTATE: u32NumFloatPSConst = 0 SC_SHADERSTATE: u32NumFloatGSConst = 0 fConstantsAvailable = 0 iConstantsAvailable = 0 bConstantsAvailable = 0 u32SCOptions[0] = 0x01A00000 SCOption_IGNORE_SAMPLE_L_BUG SCOption_FLOAT_DO_NOT_DIST SCOption_FLOAT_DO_NOT_REASSOC u32SCOptions[1] = 0x00202000 SCOption_R600_ERROR_ON_DOUBLE_MEMEXP SCOption_SET_VPM_FOR_SCATTER u32SCOptions[2] = 0x00020040 SCOption_R800_UAV_NONUAV_SYNC_WORKAROUND_BUG216513_1 SCOption_R900_BRANCH_IN_NESTED_LOOPS_WORKAROUND_BUG281276 ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(66) KCACHE0(CB0:0-15) KCACHE1(CB1:0-31) 0 x: MULLO_INT ____, R1.z, KC0[1].z y: MULLO_INT ____, R1.z, KC0[1].z z: MULLO_INT ____, R1.z, KC0[1].z w: MULLO_INT R0.w, R1.z, KC0[1].z 1 x: ADD_INT R2.x, R0.z, PV0.w 2 x: MULLO_INT ____, R1.x, KC0[1].x y: MULLO_INT R2.y, R1.x, KC0[1].x z: MULLO_INT ____, R1.x, KC0[1].x w: MULLO_INT ____, R1.x, KC0[1].x 3 y: ADD_INT R2.y, R2.x, KC0[6].z z: ADD_INT R0.z, R0.x, PV2.y VEC_120 4 x: MULLO_INT R0.x, R1.y, KC0[1].y y: MULLO_INT ____, R1.y, KC0[1].y z: MULLO_INT ____, R1.y, KC0[1].y w: MULLO_INT ____, R1.y, KC0[1].y 5 y: ADD_INT R1.y, R0.y, PV4.x z: U_TO_F R1.z, R2.y VEC_120 w: ADD_INT R1.w, R0.z, KC0[6].x 6 x: U_TO_F R1.x, PV5.w z: ADD_INT R2.z, PV5.y, KC0[6].y 7 x: LSHL R2.x, PV6.z, 1 y: MULADD_e R1.y, KC1[7].x, R1.z, KC1[8].x w: MULADD_e R2.w, KC1[7].x, PV6.x, KC1[8].x 8 x: MUL_e R2.x, KC1[16].x, PV7.y y: MUL_e R0.y, KC1[17].x, PV7.y z: U_TO_F R1.z, PV7.x 9 x: MULLO_INT R1.x, KC1[4].x, KC1[4].x y: MULLO_INT ____, KC1[4].x, KC1[4].x z: MULLO_INT ____, KC1[4].x, KC1[4].x w: MULLO_INT ____, KC1[4].x, KC1[4].x 10 x: MULLO_INT ____, PV9.x, R2.y y: MULLO_INT ____, PV9.x, R2.y z: MULLO_INT ____, PV9.x, R2.y w: MULLO_INT R0.w, PV9.x, R2.y 11 x: MULADD_e R1.x, KC1[11].x, R2.w, R0.y y: MUL_e R1.y, KC1[15].x, R1.y z: LSHR R0.z, PV10.w, 1 12 x: MULLO_INT ____, R2.z, KC1[4].x y: MULLO_INT ____, R2.z, KC1[4].x z: MULLO_INT R2.z, R2.z, KC1[4].x w: MULLO_INT ____, R2.z, KC1[4].x 13 x: ADD R1.x, KC1[20].x, R1.x y: ADD_INT R0.y, R1.w, PV12.z w: MULADD_e R1.w, KC1[10].x, R2.w, R2.x 14 y: ADD_INT R2.y, PV13.y, R0.z w: MULADD_e R0.w, KC1[7].x, R1.z, KC1[8].x VEC_021 15 x: LSHL R0.x, PV14.y, (0x00000002, 2.802596929e-45f).x z: ADD R0.z, KC1[19].x, R1.w w: MULADD_e R1.w, KC1[9].x, R2.w, R1.y VEC_021 16 x: ADD R2.x, KC1[18].x, PV15.w z: MULADD_e R1.z, KC1[14].x, R0.w, R1.x 17 z: MULADD_e R2.z, KC1[13].x, R0.w, R0.z w: MULADD_e R1.w, KC1[12].x, R0.w, PV16.x 18 x: RCP_e ____, R1.z y: RCP_e R0.y, R1.z z: RCP_e ____, R1.z w: ADD_INT R2.w, KC1[0].x, R0.x 19 x: MULADD_e R3.x, PV18.y, R1.w, 0.5 y: MULADD_e R3.y, R2.z, PV18.y, 0.5 z: ADD R2.z, KC1[7].x, R0.w VEC_021 w: ADD_INT R0.w, KC1[1].x, R0.x 20 x: LSHR R0.x, R2.w, (0x00000002, 2.802596929e-45f).x z: MULADD_e R0.z, KC1[13].x, PV19.z, R0.z w: MULADD_e R2.w, KC1[14].x, PV19.z, R1.x 01 TEX: ADDR(112) CNT(2) 21 SAMPLE R2._x__, R3.xy0x, t0, s0 UNNORM(XYZW) 22 VFETCH R1.___x, R0.x, fc155 FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(98) CNT(10) KCACHE0(CB1:0-15) 23 x: LSHR R1.x, R0.w, (0x00000002, 2.802596929e-45f).x z: MULADD_e R1.z, KC0[12].x, R2.z, R2.x w: MUL_e R0.w, R0.y, R2.y 24 x: RCP_e R2.x, R2.w y: RCP_e ____, R2.w z: RCP_e ____, R2.w 25 x: MULADD_e R3.x, R0.y, R0.w, R1.w y: MULADD_e R0.y, R0.z, PV24.x, 0.5 z: MULADD_e R0.z, PV24.x, R1.z, 0.5 03 MEM_RAT_CACHELESS_STORE_DWORD__NI_ACK: RAT(1)[R0].x___, R3, ARRAY_SIZE(4) MARK VPM 04 TEX: ADDR(116) CNT(1) 26 SAMPLE R1.__x_, R0.zy0z, t0, s0 UNNORM(XYZW) 05 WAIT_ACK: Outstanding_acks <= 0 06 TEX: ADDR(118) CNT(1) 27 VFETCH R1._x__, R1.x, fc155 FETCH_TYPE(NO_INDEX_OFFSET) 07 ALU: ADDR(108) CNT(2) 28 y: MUL_e R2.y, R2.x, R1.z 29 x: MULADD_e R0.x, R2.x, PV28.y, R1.y 08 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(1)[R1].x___, R0, ARRAY_SIZE(4) MARK VPM 09 END END_OF_PROGRAM ; ----------------- CS Data ------------------------ ; Input Semantic Mappings ; No input mappings GprPoolSize = 0 CodeLen = 960;Bytes PGM_END_CF = 0; words(64 bit) PGM_END_ALU = 0; words(64 bit) PGM_END_FETCH = 0; words(64 bit) MaxScratchRegsNeeded = 0 ;AluPacking = 0.0 ;AluClauses = 0 ;PowerThrottleRate = 0.0 ; texResourceUsage[0] = 0x00000000 ; texResourceUsage[1] = 0x00000000 ; texResourceUsage[2] = 0x00000000 ; texResourceUsage[3] = 0x00000000 ; fetch4ResourceUsage[0] = 0x00000000 ; fetch4ResourceUsage[1] = 0x00000000 ; fetch4ResourceUsage[2] = 0x00000000 ; fetch4ResourceUsage[3] = 0x00000000 ; texSamplerUsage = 0x00000000 ; constBufUsage = 0x00000000 ResourcesAffectAlphaOutput[0] = 0x00000000 ResourcesAffectAlphaOutput[1] = 0x00000000 ResourcesAffectAlphaOutput[2] = 0x00000000 ResourcesAffectAlphaOutput[3] = 0x00000000 ;SQ_PGM_RESOURCES = 0x30000104 SQ_PGM_RESOURCES:NUM_GPRS = 4 SQ_PGM_RESOURCES:STACK_SIZE = 1 SQ_PGM_RESOURCES:PRIME_CACHE_ENABLE = 1 ;SQ_PGM_RESOURCES_2 = 0x000000C0 SQ_LDS_ALLOC:SIZE = 0x00000000 ; RatOpIsUsed = 0x2 ; NumThreadPerGroupFlattened = 256 ; SetBufferForNumGroup = true

eci · ‎11-22-2010

OK, then I got it basically right. I only looked at the code that was posted further up so that's why I got the packing ratio a little bit wrong

That leaves the question if I do a major mistake in the way I access memory?

eduardoschardong · ‎11-22-2010

Originally posted by: eci That leaves the question if I do a major mistake in the way I access memory?

Maybe, profiling output would help.

Without this info the best I can do is guess, and a bad guess since I never used texture sampling, I don't care about imaging at all, well... Get two bad guesses by the price of one:

1) IIRC textures on GPUs aren't organized in row-major order or any other "linear" way, it's z-order, if your kernel ids are linear the sampler will read from very sparse memory locations limiting performance.

2) I'm not sure how 3D kernels wavefronts are formed, if they aren't all from the same row there may be stores to sparse address or a pattern causing the GPU to serialize accesses, again limiting a lot performance.

TIP: In your last kernel avoid breaking clauses, the compiler apparently failed to realize the two loads/stores are independent, make all loads/sampling in a sequence, calculate everything from both stores and then store, this will allow to reduce the number of clauses and increasing packing.

iya · ‎11-23-2010

Originally posted by: eci Hi

I tried different versions of the kernel and I'm pretty sure, that memory access is the limiting factor. But why? Do I miss something?

...
For NVidia the inner part is surrounded by a loop and for my current ATI version every kernel processes two voxels on two different memory blocks.

The points from above paper are even more important for ATI.

For example:

Do more work per thread.
Use more registers per thread.

Having no loop in the kernel is not good. You're doing all the constant calculations each time. Try to find a loop, and if necessary unroll it manually until you get a high enough register usage.

eci · ‎11-23-2010

I would love to profile the code, but the profiler is not working for me. It only gives me the following message:

ATI Stream Profiler

The program failed to profile. Try to compile and run the active project manually and make sure you have a write access permission to the projects's directory.

I can run the program manually and I have write access. Does somebody know in more detail what's behind this message?

himanshu_gautam · ‎11-28-2010

eci,

Does profiler give the same message while running any other code?

Does it show any profiling data for samples?

You can read this document to get help regarding the profiler.

C:\Program Files (x86)\ATI Stream\Tools\ATI Stream Profiler 2.1\Help

eci · ‎12-03-2010

I finally got time to work on this issue again.

I downloaded the Stream SDK samples and the profiler works with them.

For my own project it still won't work. Still the same error I mentioned above. I have to add, that the project I'm working with has several subprojects and the actual OpenCL-code is loaded as a .dll into the main program. Is there a way of making this work?

nou · ‎12-03-2010

you can try profile it manualy via GPUPerfAPI. http://developer.amd.com/gpu/GPUPerfAPI/Pages/default.aspx

eci · ‎12-05-2010

Thanks for that tip. I'll look into that.

In the mean time I have tried profiling it a little bit by hand and stumbled upon something interesting:

I've put QueryPerformanceCounter's around clEnqueueNDRangeKernel + clFinish and what I can't explain right now is why approximately the first 100 Kernel executions take around 145ms and every execution after that is between 10 and 11ms.

Do you know how to explain that? Every kernel does basically the same work and I would understand some fluctuations in runtimes due to the projection geometries, but I can't explain this pattern...

lbin · ‎12-06-2010

eci

Which version of StreamProfiler were you using? A new version is coming very soon.

When you were using StreamProfiler to profile your app, did your app start to run or the app simply didn't start and you got the error message immediately after you click on the profile button?

--------------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.

eci · ‎12-06-2010

I'm using the newest version. I needed to download it because I'm using a 64bit build.

It's hard to say if the app starts. The command-box pops up, but non of my program output shows up. Then the error is shown.

lbin · ‎12-06-2010

Can you try with command line version of Stream Profiler?

sprofile -w [workingDir] [app] [app args]

You can refer to release note for more detailed usage.

If it is possible, please send your visual studio project file (.vcproj) to gputools.support@amd.com

eci · ‎12-09-2010

OK, the command line version of the compiler works and after statically linking my whole project even the normal way of using the compiler out of Visual Studio works.

I still can't get any profiler results because of the memory consumption. When I use the compile the size of my pagefile goes through the roof and the execution gets very very slow. Is there a solution for that?

lbin · ‎12-09-2010

Running the application under the profiler will always require more memory than the original application’s memory requirement. This is because the profiler needs to save memory states in order to collect all the GPU performance counters correctly.
The memory requirement under the profiler directly translates to the number of clSetKernelArgs for buffers that are created with read and write flag, size of these buffers, and the number of enqueue kernel commands. You can reduce the memory requirements under the profiler by reducing these calls.
In an upcoming profiler version, we have performed some optimizations to improve the memory consumption.

eci · ‎12-15-2010

Hi

I can provide profiler results now:

Method , ExecutionOrder , GlobalWorkSize , GroupWorkSize , Time , LDSSize , DataTransferSize , GPRs , ScratchRegs , FCStacks , Wavefronts , ALUInsts , FetchInsts , WriteInsts , LDSFetchInsts , LDSWriteInsts , ALUBusy , ALUFetchRatio , ALUPacking , FetchSize , CacheHit , FetchUnitBusy , FetchUnitStalled , WriteUnitStalled , FastPath , CompletePath , PathUtilization , ALUStalledByLDS , LDSBankConflict
WriteBuffer ,     1 , , ,       219.35086 , ,    262144.00 , , , , , , , , , , , , , , , , , , , , , ,
WriteBuffer ,     2 , , ,       170.11653 , ,    262144.00 , , , , , , , , , , , , , , , , , , , , , ,
WriteImage2D ,     3 , , ,         0.00042 , ,      4680.00 , , , , , , , , , , , , , , , , , , , , , ,
RoadRunner__k1_Cypress1 ,     4 , {    512     512     512} , {   64     1     1} ,       141.96102 ,           0 , ,     7 ,     0 ,     1 ,   2097152.00 ,        25.00 ,         2.00 ,         1.00 ,         0.00 ,         0.00 ,         8.58 ,        12.50 ,        38.00 ,    270610.31 ,        98.38 ,         5.85 ,         5.85 ,        51.11 ,    524301.88 ,         0.00 ,       100.00 ,         0.00 ,         0.00
WriteImage2D ,     5 , , ,         0.00042 , ,      4680.00 , , , , , , , , , , , , , , , , , , , , , ,
RoadRunner__k1_Cypress1 ,     6 , {    512     512     512} , {   64     1     1} ,       142.21886 ,           0 , ,     7 ,     0 ,     1 ,   2097152.00 ,        25.00 ,         2.00 ,         1.00 ,         0.00 ,         0.00 ,         8.50 ,        12.50 ,        38.00 ,    269159.88 ,        98.27 ,         5.75 ,         5.75 ,        50.83 ,    524302.38 ,         0.00 ,       100.00 ,         0.00 ,         0.00
WriteImage2D ,     7 , , ,         0.00042 , ,      4680.00 , , , , , , , , , , , , , , , , , , , , , ,
RoadRunner__k1_Cypress1 ,     8 , {    512     512     512} , {   64     1     1} ,       136.65246 ,           0 , ,     7 ,     0 ,     1 ,   2097152.00 ,        25.00 ,         2.00 ,         1.00 ,         0.00 ,         0.00 ,         8.66 ,        12.50 ,        38.00 ,    265210.19 ,        99.15 ,         5.79 ,         5.79 ,        50.81 ,    524290.75 ,         0.00 ,       100.00 ,         0.00 ,         0.00
WriteImage2D ,     9 , , ,         0.00042 , ,      4680.00 , , , , , , , , , , , , , , , , , , , , , ,

The current version of my kernel is attached. Total runtime is 4.5 times longer than on a NVidia Quadro FX 5600. Can you give me any pointers why this is the case? Based on some experiments I would guess it has something to do with global memory access.

I'm working on a second kernel (for another problem) that only works on images and that one performs abound 2 times faster than the NVidia card.

Anyone?

constant sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR; __kernel void RoadRunner( __global float* volume1, __global float* volume2, __read_only image2d_t projection, int L, int S_x, int S_y, float R_L, float O_L, float m0, float m1, float m2, float m3, float m4, float m5, float m6, float m7, float m8, float m9, float m10, float m11) { size_t id1 = get_global_id(0); size_t id2 = get_global_id(1); size_t id3 = get_global_id(2); float z = O_L + (float)id3 * R_L; float y = O_L + (float)id2 * R_L; float x = O_L + (float)id1 * R_L; //int tmp = id3 * L * L/2 + id2 * L + id1; float wTmp = m2 * x + m8 * z + m11; float uTmp = (m0 * x + m6 * z + m9 ); float vTmp = (m1 * x + m7 * z + m10); float w_n = wTmp + m5 * y; float w = native_recip(w_n); float u_n = (uTmp + (m3 * y)) * w; float v_n = (vTmp + (m4 * y)) * w; if(id2 < L/2) volume1[id3 * L * L/2 + id2 * L + id1] += read_imagef(projection, sampler, (float2)(u_n+0.5f, v_n+0.5f)).x * w * w; else volume2[id3 * L * L/2 + (id2-L/2) * L + id1] += read_imagef(projection, sampler, (float2)(u_n+0.5f, v_n+0.5f)).x * w * w; return; }

eci · ‎12-15-2010

Maybe the profiler output is more readable as attached Code:

Method , ExecutionOrder , GlobalWorkSize , GroupWorkSize , Time , LDSSize , DataTransferSize , GPRs , ScratchRegs , FCStacks , Wavefronts , ALUInsts , FetchInsts , WriteInsts , LDSFetchInsts , LDSWriteInsts , ALUBusy , ALUFetchRatio , ALUPacking , FetchSize , CacheHit , FetchUnitBusy , FetchUnitStalled , WriteUnitStalled , FastPath , CompletePath , PathUtilization , ALUStalledByLDS , LDSBankConflict WriteBuffer , 1 , , , 219.35086 , , 262144.00 , , , , , , , , , , , , , , , , , , , , , , WriteBuffer , 2 , , , 170.11653 , , 262144.00 , , , , , , , , , , , , , , , , , , , , , , WriteImage2D , 3 , , , 0.00042 , , 4680.00 , , , , , , , , , , , , , , , , , , , , , , RoadRunner__k1_Cypress1 , 4 , { 512 512 512} , { 64 1 1} , 141.96102 , 0 , , 7 , 0 , 1 , 2097152.00 , 25.00 , 2.00 , 1.00 , 0.00 , 0.00 , 8.58 , 12.50 , 38.00 , 270610.31 , 98.38 , 5.85 , 5.85 , 51.11 , 524301.88 , 0.00 , 100.00 , 0.00 , 0.00 WriteImage2D , 5 , , , 0.00042 , , 4680.00 , , , , , , , , , , , , , , , , , , , , , , RoadRunner__k1_Cypress1 , 6 , { 512 512 512} , { 64 1 1} , 142.21886 , 0 , , 7 , 0 , 1 , 2097152.00 , 25.00 , 2.00 , 1.00 , 0.00 , 0.00 , 8.50 , 12.50 , 38.00 , 269159.88 , 98.27 , 5.75 , 5.75 , 50.83 , 524302.38 , 0.00 , 100.00 , 0.00 , 0.00 WriteImage2D , 7 , , , 0.00042 , , 4680.00 , , , , , , , , , , , , , , , , , , , , , , RoadRunner__k1_Cypress1 , 8 , { 512 512 512} , { 64 1 1} , 136.65246 , 0 , , 7 , 0 , 1 , 2097152.00 , 25.00 , 2.00 , 1.00 , 0.00 , 0.00 , 8.66 , 12.50 , 38.00 , 265210.19 , 99.15 , 5.79 , 5.79 , 50.81 , 524290.75 , 0.00 , 100.00 , 0.00 , 0.00 WriteImage2D , 9 , , , 0.00042 , , 4680.00 , , , , , , , , , , , , , , , , , , , , , ,

Lev · ‎12-16-2010

How many thread blocks and work groups do you spawn?

eci · ‎12-16-2010

Like you can see from the profiler output there is a total of 512*512*512 = 134 million work-items. Every work-group consists of 64 work-items, so there is a total of around 2 million work-groups.

After trying serveral configurations this one gives me the best performance so far.

Lev · ‎12-16-2010

And how many kernell launches?

eci · ‎12-16-2010

In total I launch the kernel 496 times. Because of the memory issues of the profiler I am only able to profile the first few.

Lev · ‎12-16-2010

I think the difference between nv and ani may come from block management and kernell launch overhead, also try to use bigger work group size, 128.

eci · ‎12-16-2010

I tried different work group sizes it this one performed the best.

On NVidia there are also 496 kernel launches. On NVidia every Kernel computes a whole line of the volume, so the number of work-items is significantly smaller. But if I try the same on the ATI it performs even worse than the implementation I posted here.

Lev · ‎12-16-2010

So do you launch 1 block per one work item? 512 blocks instead of one on nv? Here is comes the difference.

/*You are lucky, you can speed up your execution on ATI by 128 times.

This was worng, I am sorry

Archives Discussions

Performance Comparison ATI-NVidia