cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

eci
Journeyman III

Performance Comparison ATI-NVidia

First of all thanks for your replies so far!

If forgot to mention in my initial post that I already experimented with vectorization. The vectorized version is performing roughly the same, maybe a little bit worse. But I'm with debdatt.basu on this topic. The compiler should really be able to do this!

Another thing I discovered lately while testing some things out is the variations in runtimes on the ATI card. I get runtimes between 22.5s and 31s (with a mean somewhere just below 30s). When I run the problem on the NVidia card runtimes vary much less (somewhere around 17s).

What I want to add here is that I am totally aware that the code I'm using isn't optimized to the maximum. But that isn't really what I'm looking for. I want to compare OpenCL performance on different platforms.

@n0thing: I'm not sure if I really get the code you posted. The image2d_t projection is a CL_R, CL_FLOAT image. In the last line you access the projection image on position u_n and v_n and use the resulting vector to update four positions in the volume. But with an CL_R image y, z and w of this vector are not really defined. So I'm pretty sure this isn't working. I'm aware of the fact that I could load four of my projection images into an CL_RGBA image to speed up the execution of my problem. But I am working with another framework so it is easier for me this way and the NVidia implementation works the same way.

0 Likes
eduardoschardong
Journeyman III

Performance Comparison ATI-NVidia

Hello eci,

 

The code is well packed as Micah pointed, vectorization won't help as it is, reducing the number of instructions first could make vectorization more profitable, one of the operations that are generating a lot of instructions is the division, I'm not sure how different the native_recip is but it should give you correct results, finally doing x + y * z + w * v is better than y * z + w * v + x because the compiler won't reorder operations to emit MULLADDs, in your case will only help in the vectored version and by a small bit, you can try the attached code first, try to vector it too, may help, also profiling would help a lot.

 

const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR; __kernel void RoadRunner( __global float* volume, __read_only image2d_t projection, int L, int S_x, int S_y, float R_L, float O_L, float m0, float m1, float m2, float m3, float m4, float m5, float m6, float m7, float m8, float m9, float m10, float m11) { size_t id1 = get_global_id(0); size_t id2 = get_global_id(1); size_t id3 = get_global_id(2); float z = O_L + (float)id1 * R_L; float y = O_L + (float)id2 * R_L; float x = O_L + (float)id3 * R_L; float w_n = m11 + m2 * x + m5 * y + m8 * z; float w_n_r = native_recip(w_n); float u_n = ( m9 + m0 * x + m3 * y + m6 * z) * w_n_r; float v_n = (m10 + m1 * x + m4 * y + m7 * z) * w_n_r; volume[id1 * L * L + id2 * L + id3] += w_n_r * w_n_r * native_recip(read_imagef(projection, sampler, (float2)(u_n+0.5, v_n+0.5)).x); }

0 Likes
debdatta_basu
Journeyman III

Performance Comparison ATI-NVidia

@cjang... Its definitely not trivial.... But it  is well studied, and almost every modern compiler (gcc, intel.. you name it) could vectorize that piece of code...

But as it turns out, vectorizing didnt solve it. Im waiting to see how eduard's kernel performs...

0 Likes
zeland
Journeyman III

Performance Comparison ATI-NVidia

May be good way for achieve better performance is multi output per thread.

Like it is described in the Volkov paper:

http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

0 Likes
MicahVillmow
Staff
Staff

Performance Comparison ATI-NVidia

eduardo,
Yeah, I missed the division. That is where the large increase in instructions is coming from.
Here is the CL/IL/ISA that gets generated for a simple Division shader. As you can see, the division is a majority of the instructions.

CL: kernel void div_test(global float* a) { a[0] = a[1] / a[2]; } IL: mdef(209)_out(1)_in(2) mov r0, in0 mov r1, in1 dcl_literal l1, 0x7f800000, 0x7f800000, 0x807fffff, 0x807fffff dcl_literal l2, 0x7f800000, 0x7f800000, 0, 0 dcl_literal l3, 0x80000000, 0x80000000, 0x80000000, 0x80000000 dcl_literal l4, 0x3f800000, 0x3f800000, 0, 0 dcl_literal l5, 0, 0, 0, 0 dcl_literal l6, 0x7fffffff, 0x80000000, 0x7fffffff, 0x80000000 dcl_literal l7, 0x00800000, 0x00800000, 0x00800000, 0x00800000 dcl_literal l8, 0x00000017, 0x00000017, 0x00000017, 0x00000017 dcl_literal l9, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff mov r2.x___, r0.x mov r2._y__, r1.x and r3.xyzw, r2.xyxy, l1 ieq r4.xyzw, r3.xyxy, l2 and r5.xy__, r2.xy, l3 ior r3.__zw, r3.zwzw, l4.xyxy cmov_logical r3.__zw, r4.zwzw, r5.xyxy, r3.zwzw cmov_logical r3.__zw, r4.xyxy, r2.xyxy, r3.zwzw ior r5.xy__, r4.xz, r4.yw ior r5.x___, r5.x, r5.y inegate r5.__z_, r3.yyyy iadd r3.x___, r3.x, r5.z cmov_logical r3.x___, r5.xxxx, l5, r3.xxxx rcp_zeroop(infinity) r2._y__, r3.ww mul_ieee r2.x___, r3.z, r2.y and r2.__zw, r2.xxxx, l6.xyzw ishr r6.x___, r2.z, l8 ishr r6._y__, r3.xxxx, l8 iadd r2.xy__, r2.xzxz, r3.xxxx iadd r6.x___, r6.x, r6.y ige r4.__z_, l5, r6.x ior r4._y__, r2.wwww, l1 ige r4.x, r6.x, l9 cmov_logical r4.x, r5.x, l5, r4.x cmov_logical r2.x, r4.z, r2.w, r2.x cmov_logical r0.x, r4.x, r4.y, r2.x mov out0, r0 mend il_cs_2_0 dcl_cb cb0[9] ; Constant buffer that holds ABI data dcl_literal l0, 4, 1, 2, 3 dcl_literal l1, 0x00FFFFFF, -1, -2, -3 dcl_literal l2, 0x0000FFFF, 0xFFFFFFFE,0x000000FF,0xFFFFFFFC dcl_literal l3, 24, 16, 8, 0xFFFFFFFF dcl_literal l4, 0xFFFFFF00, 0xFFFF0000, 0xFF00FFFF, 0xFFFF00FF dcl_literal l5, 0, 4, 8, 12 dcl_literal l6, 32, 32, 32, 32 dcl_literal l7, 24, 31, 16, 31 mov r1013, cb0[8].x call 1024;$ endmain func 1024 ; __OpenCL_div_test_kernel mov r1019, l1.0 dcl_max_thread_per_group 256 dcl_raw_uav_id(0) dcl_arena_uav_id(8) mov r0.z, vThreadGrpIdFlat.x mov r1022.xyz0, vTidInGrp.xyz mov r1023.xyz0, vThreadGrpId.xyz imad r1021.xyz0, r1023.xyz0, cb0[1].xyz0, r1022.xyz0 iadd r1021.xyz0, r1021.xyz0, cb0[6].xyz0 iadd r1023.xyz0, r1023.xyz0, cb0[7].xyz0 mov r1023.w, r0.z ishl r1023.w, r1023.w, l0.z mov r1018.x, l0.0 mov r1015.x, cb0[8].x dcl_literal l9, 0x00000004, 0x00000004, 0x00000004, 0x00000004; f32:i32 4 dcl_literal l8, 0x00000008, 0x00000008, 0x00000008, 0x00000008; f32:i32 8 dcl_cb cb1[1] ; Kernel arg setup: a mov r1, cb1[0] call 2 ; div_test ret endfunc ; __OpenCL_div_test_kernel ;ARGSTART:__OpenCL_div_test_kernel ;version:1:4:53 ;device:cayman ;uniqueid:1024 ;memory:hwprivate:0 ;memory:hwlocal:0 ;version:1:4:53 ;pointer:a:float:1:1:0:uav:0:4 ;function:1:2 ;uavid:8 ;ARGEND:__OpenCL_div_test_kernel func 2 ; div_test ; @__OpenCL_div_test_kernel ; BB#0: ; %entry mov r258, l8 iadd r258.x___, r1.xxxx, r258.xxxx mov r1010.x___, r258.xxxx uav_raw_load_id(0) r1011.x___, r1010.x mov r258, r1011 mov r259, l9 iadd r259.x___, r1.xxxx, r259.xxxx mov r1010.x___, r259.xxxx uav_raw_load_id(0) r1011.x___, r1010.x mov r259, r1011 ;__fdiv_f32 mcall(209) (r258),(r259, r258) mov r1011, r258 mov r1010.x___, r1.x uav_raw_store_id(0) mem0.x___, r1010.x, r1011.x ret_dyn endfunc ; div_test ;ARGSTART:div_test ;uniqueid:2 ;memory:hwlocal:0 ;intrinsic:1:209 ;ARGEND:div_test end ISA: ShaderType = IL_SHADER_COMPUTE TargetChip = c ; ------------- SC_SRCSHADER Dump ------------------ SC_SHADERSTATE: u32NumIntVSConst = 0 SC_SHADERSTATE: u32NumIntPSConst = 0 SC_SHADERSTATE: u32NumIntGSConst = 0 SC_SHADERSTATE: u32NumBoolVSConst = 0 SC_SHADERSTATE: u32NumBoolPSConst = 0 SC_SHADERSTATE: u32NumBoolGSConst = 0 SC_SHADERSTATE: u32NumFloatVSConst = 0 SC_SHADERSTATE: u32NumFloatPSConst = 0 SC_SHADERSTATE: u32NumFloatGSConst = 0 fConstantsAvailable = 0 iConstantsAvailable = 0 bConstantsAvailable = 0 u32SCOptions[0] = 0x01A00000 SCOption_IGNORE_SAMPLE_L_BUG SCOption_FLOAT_DO_NOT_DIST SCOption_FLOAT_DO_NOT_REASSOC u32SCOptions[1] = 0x00202000 SCOption_R600_ERROR_ON_DOUBLE_MEMEXP SCOption_SET_VPM_FOR_SCATTER u32SCOptions[2] = 0x00020040 SCOption_R800_UAV_NONUAV_SYNC_WORKAROUND_BUG216513_1 SCOption_R900_BRANCH_IN_NESTED_LOOPS_WORKAROUND_BUG281276 ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(8) KCACHE0(CB1:0-15) 0 x: LSHR R0.x, KC0[0].x, (0x00000002, 2.802596929e-45f).x y: ADD_INT R0.y, KC0[0].x, (0x00000004, 5.605193857e-45f).y z: ADD_INT R0.z, KC0[0].x, (0x00000008, 1.121038771e-44f).z 1 x: LSHR R1.x, PV0.y, (0x00000002, 2.802596929e-45f).x w: LSHR R0.w, PV0.z, (0x00000002, 2.802596929e-45f).x 01 TEX: ADDR(96) CNT(2) 2 VFETCH R0.__x_, R0.w, fc154 FETCH_TYPE(NO_INDEX_OFFSET) 3 VFETCH R1.x___, R1.x, fc154 FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(40) CNT(48) 4 x: AND_INT R2.x, R1.x, (0x807FFFFF, -1.175494211e-38f).x y: AND_INT R0.y, R1.x, (0x80000000, -0.0f).y z: AND_INT R0.z, R1.x, (0x7F800000, 1.#INFf).z w: MOV R0.w, R0.z 5 x: AND_INT R3.x, PV4.w, (0x80000000, -0.0f).x y: AND_INT R1.y, PV4.w, (0x7F800000, 1.#INFf).y z: SETE_INT R1.z, PV4.z, 0.0f w: AND_INT R1.w, PV4.w, (0x807FFFFF, -1.175494211e-38f).z 6 x: OR_INT R2.x, PV5.w, (0x3F800000, 1.0f).x y: SETE_INT R2.y, PV5.y, (0x7F800000, 1.#INFf).y z: OR_INT R2.z, R2.x, (0x3F800000, 1.0f).x w: SETE_INT R1.w, PV5.y, 0.0f 7 x: SETE_INT R2.x, R0.z, (0x7F800000, 1.#INFf).x y: CNDE_INT R0.y, PV6.w, PV6.x, R3.x z: CNDE_INT R1.z, R1.z, PV6.z, R0.y VEC_120 w: OR_INT R1.w, R1.z, PV6.w VEC_120 8 x: CNDE_INT R1.x, PV7.x, PV7.z, R1.x y: SUB_INT R1.y, R0.z, R1.y z: OR_INT R1.z, PV7.x, R2.y VEC_021 w: CNDE_INT R0.w, R2.y, PV7.y, R0.w 9 x: RCP_e ____, PV8.w y: RCP_e R0.y, PV8.w z: RCP_e ____, PV8.w w: OR_INT R0.w, PV8.z, R1.w 10 x: MUL_e R1.x, R1.x, PV9.y z: CNDE_INT R0.z, PV9.w, R1.y, 0.0f 11 x: ASHR R1.x, PV10.z, (0x00000017, 3.222986468e-44f).x y: ADD_INT R0.y, PV10.z, PV10.x z: AND_INT R0.z, (0x80000000, -0.0f).y, PV10.x w: AND_INT R1.w, (0x7FFFFFFF, 1.#QNANf).z, PV10.x 12 x: OR_INT R2.x, PV11.z, (0x7F800000, 1.#INFf).x y: ASHR R1.y, PV11.w, (0x00000017, 3.222986468e-44f).y 13 z: ADD_INT R2.z, PV12.y, R1.x 14 y: SETGE_INT R2.y, 0.0f, PV13.z w: SETGE_INT R2.w, PV13.z, (0x000000FF, 3.573311084e-43f).x 15 y: CNDE_INT R0.y, PV14.y, R0.y, R0.z z: CNDE_INT R0.z, R0.w, PV14.w, 0.0f 16 x: CNDE_INT R1.x, PV15.z, PV15.y, R2.x 03 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(0)[R0].x___, R1, ARRAY_SIZE(4) MARK VPM 04 END END_OF_PROGRAM ; ----------------- CS Data ------------------------ ; Input Semantic Mappings ; No input mappings GprPoolSize = 0 CodeLen = 800;Bytes PGM_END_CF = 0; words(64 bit) PGM_END_ALU = 0; words(64 bit) PGM_END_FETCH = 0; words(64 bit) MaxScratchRegsNeeded = 0 ;AluPacking = 0.0 ;AluClauses = 0 ;PowerThrottleRate = 0.0 ; texResourceUsage[0] = 0x00000000 ; texResourceUsage[1] = 0x00000000 ; texResourceUsage[2] = 0x00000000 ; texResourceUsage[3] = 0x00000000 ; fetch4ResourceUsage[0] = 0x00000000 ; fetch4ResourceUsage[1] = 0x00000000 ; fetch4ResourceUsage[2] = 0x00000000 ; fetch4ResourceUsage[3] = 0x00000000 ; texSamplerUsage = 0x00000000 ; constBufUsage = 0x00000000 ResourcesAffectAlphaOutput[0] = 0x00000000 ResourcesAffectAlphaOutput[1] = 0x00000000 ResourcesAffectAlphaOutput[2] = 0x00000000 ResourcesAffectAlphaOutput[3] = 0x00000000 ;SQ_PGM_RESOURCES = 0x30000104 SQ_PGM_RESOURCES:NUM_GPRS = 4 SQ_PGM_RESOURCES:STACK_SIZE = 1 SQ_PGM_RESOURCES:PRIME_CACHE_ENABLE = 1 ;SQ_PGM_RESOURCES_2 = 0x000000C0 SQ_LDS_ALLOC:SIZE = 0x00000000 ; RatOpIsUsed = 0x1 ; NumThreadPerGroupFlattened = 256 ; SetBufferForNumGroup = true

0 Likes
n0thing
Journeyman III

Performance Comparison ATI-NVidia

@eci Sorry, didn't realize that the texture format in your case was scalar , but changing the image format to RGBA shouldn't be that hard. If you compare the ISA generated with my kernel vs eduardo's kernel, you can see the manual vectorizing is resulting in less number of instructions and ALU packing is better. 

 

; -------- my kernel -------------------- 00 ALU: ADDR(32) CNT(44) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15) 0 w: MOV T0.w, 0.0f t: MULLO_INT ____, R1.x, KC0[1].x 1 y: ADD_INT ____, R0.x, PS0 t: MULLO_INT ____, R1.y, KC0[1].y 2 z: ADD_INT ____, R0.y, PS1 w: ADD_INT ____, PV1.y, KC0[6].x t: MULLO_INT ____, R1.z, KC0[1].z 3 x: ADD_INT ____, R0.z, PS2 y: ADD_INT ____, PV2.z, KC0[6].y t: U_TO_F T0.x, PV2.w 4 w: ADD_INT ____, PV3.x, KC0[6].z t: U_TO_F T0.y, PV3.y 5 x: MULADD_e T1.x, KC1[3].x, T0.x, KC1[4].x y: MULADD_e T1.y, KC1[3].x, PS4, KC1[4].x w: MULADD_e T0.w, KC1[3].x, T0.w, KC1[4].x t: U_TO_F T0.z, PV4.w 6 x: DOT4 ____, KC1[8].x, T0.x y: DOT4 ____, KC1[8].y, T0.y z: DOT4 ____, KC1[8].z, PS5 w: DOT4 ____, (0x80000000, -0.0f).x, 0.0f 7 z: MULADD_e T0.z, KC1[3].x, T0.z, KC1[4].x t: F_TO_I ____, PV6.x 8 x: DOT4 ____, KC1[5].x, T1.x y: DOT4 ____, KC1[5].y, T1.y z: DOT4 ____, KC1[5].z, PV7.z w: DOT4 ____, KC1[5].w, T0.w t: LSHL T1.z, PS7, (0x00000004, 5.605193857e-45f).x 9 x: DOT4 ____, KC1[6].x, T1.x y: DOT4 ____, KC1[6].y, T1.y z: DOT4 ____, KC1[6].z, T0.z w: DOT4 ____, KC1[6].w, T0.w t: RCP_e T0.x, PV8.x 10 x: DOT4 ____, KC1[7].x, T1.x y: DOT4 ____, KC1[7].y, T1.y z: DOT4 ____, KC1[7].z, T0.z w: DOT4 ____, KC1[7].w, T0.w t: MULADD_e R0.x, PS9, PV9.x, 0.5 11 x: MUL_e R1.x, T0.x, T0.x y: ADD_INT ____, KC1[0].x, T1.z w: MULADD_e R0.w, T0.x, PV10.x, 0.5 12 x: LSHR R3.x, PV11.y, (0x00000002, 2.802596929e-45f).x 01 TEX: ADDR(80) CNT(2) 13 VFETCH R2, R3.x, fc156 FORMAT(32_32_32_32_FLOAT) MEGA(16) FETCH_TYPE(NO_INDEX_OFFSET) 14 SAMPLE R0, R0.xw0x, t0, s0 UNNORM(XYZW) 02 ALU: ADDR(76) CNT(4) 15 x: MULADD_e R0.x, R0.x, R1.x, R2.x y: MULADD_e R0.y, R0.y, R1.x, R2.y z: MULADD_e R0.z, R0.z, R1.x, R2.z w: MULADD_e R0.w, R0.w, R1.x, R2.w 03 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R3], R0, ARRAY_SIZE(4) MARK VPM END_OF_PROGRAM ; -------- eduardo's kernel -------------------- 00 ALU: ADDR(32) CNT(37) KCACHE0(CB0:0-15) KCACHE1(CB1:0-31) 0 t: MULLO_INT ____, R1.z, KC0[1].z 1 z: ADD_INT ____, R0.z, PS0 t: MULLO_INT ____, R1.y, KC0[1].y 2 x: ADD_INT ____, R0.y, PS1 y: ADD_INT T1.y, PV1.z, KC0[6].z t: MULLO_INT ____, R1.x, KC0[1].x 3 x: ADD_INT T1.x, PV2.x, KC0[6].y w: ADD_INT ____, R0.x, PS2 t: U_TO_F T0.w, PV2.y 4 x: ADD_INT T0.x, PV3.w, KC0[6].x t: U_TO_F T0.z, PV3.x 5 t: MULLO_INT ____, PV4.x, KC1[3].x 6 x: MULADD_e T1.x, KC1[6].x, T0.z, KC1[7].x y: MULADD_e T0.y, KC1[6].x, T0.w, KC1[7].x w: ADD_INT T1.w, T1.x, PS5 t: U_TO_F T0.w, T0.x 7 x: MULADD_e T0.x, KC1[10].x, PV6.y, KC1[19].x 8 z: MULADD_e T0.z, KC1[9].x, T0.y, KC1[18].x 9 y: MULADD_e T2.y, KC1[6].x, T0.w, KC1[7].x 10 w: MULADD_e T1.w, KC1[13].x, T1.x, T0.x t: MULLO_INT ____, T1.w, KC1[3].x 11 x: MULADD_e T0.x, KC1[8].x, T0.y, KC1[17].x z: ADD_INT ____, T1.y, PS10 12 x: MULADD_e ____, KC1[12].x, T1.x, T0.z y: MULADD_e ____, KC1[16].x, T2.y, T1.w z: LSHL T0.z, PV11.z, (0x00000002, 2.802596929e-45f).x 13 z: MULADD_e ____, KC1[15].x, T2.y, PV12.x w: MULADD_e ____, KC1[11].x, T1.x, T0.x t: RCP_e T1.z, PV12.y 14 x: MUL_e R1.x, PS13, PS13 y: MULADD_e R0.y, PV13.z, PS13, 0.5 z: MULADD_e ____, KC1[14].x, T2.y, PV13.w w: ADD_INT ____, KC1[0].x, T0.z 15 x: MULADD_e R0.x, T1.z, PV14.z, 0.5 t: LSHR R2.x, PV14.w, (0x00000002, 2.802596929e-45f).x 01 TEX: ADDR(80) CNT(2) 16 SAMPLE R0.___x, R0.xy0x, t0, s0 UNNORM(XYZW) 17 VFETCH R0.x___, R2.x, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(69) CNT(2) 18 t: RCP_e ____, R0.w 19 x: MULADD_e R0.x, PS18, R1.x, R0.x 03 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R2].x___, R0, ARRAY_SIZE(4) MARK VPM END_OF_PROGRAM

0 Likes
eci
Journeyman III

Performance Comparison ATI-NVidia

You see my current kernel attached. The reason for the split up of statements is the fact that I can reuse uTmp, vTmp and wTmp and thus reduce instructions. The reason for the two volumes is the limitation of the current Stream SDK which only allows to allocate 256MB of continuous memory in my case. I wasn't aware that I would find such well-founded help here and wanted to keep things simple.

I tried the reordering of instructions eduardo suggested, but sadly this also didn't do the trick.

@nothing: I cannot use the vectorization the way you suggested. The problem is, that in your kernel you use the volume position of the first voxel and compute the projection into the projection image (which is a float greyscale image by the way). What really has to be done is compute for all four voxel positions the appropriate position in the projection image.

I'm still researching if I missed something big in the way I access memory! Is the access layout I use in any way flawed?

I also digged a little bit into the ISA code and am not quite sure if I get it by now. If I start from the kernel from my first post, the kernel has 37 operations. The compiler makes roundabout 150 operations out of them. Then he packs them into these units of 5 instructions which can be run in parallel. That gets me back to around 37 operations. So if I get it right I access at best 1/5 of the cards peak performance!

constant sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR; __kernel void RoadRunner( __global float* volume1, __global float* volume2, __read_only image2d_t projection, int L, int S_x, int S_y, float R_L, float O_L, float m0, float m1, float m2, float m3, float m4, float m5, float m6, float m7, float m8, float m9, float m10, float m11) { size_t id1 = get_global_id(0); size_t id2 = get_global_id(1); size_t id3 = get_global_id(2); float z = O_L + (float)id3 * R_L; float y = O_L + (float)(id2*2) * R_L; float x = O_L + (float)id1 * R_L; int tmp = id3 * L * L/2 + id2 * L + id1; float wTmp = m2 * x + m8 * z + m11; float uTmp = (m0 * x + m6 * z + m9 ); float vTmp = (m1 * x + m7 * z + m10); float w_n = wTmp + m5 * y; float w = native_recip(w_n); float u_n = (uTmp + (m3 * y)) * w; float v_n = (vTmp + (m4 * y)) * w; volume1[tmp] += read_imagef(projection, sampler, (float2)(u_n+0.5f, v_n+0.5f)).x * w * w; y += R_L; w_n = wTmp + m5 * y; w = native_recip(w_n); u_n = (uTmp + (m3 * y)) * w; v_n = (vTmp + (m4 * y)) * w; volume2[tmp] += read_imagef(projection, sampler, (float2)(u_n+0.5f, v_n+0.5f)).x * w * w; return; }

0 Likes
MicahVillmow
Staff
Staff

Performance Comparison ATI-NVidia

eci,
1 CL instruction does not equal 1 ISA instruction. Your new kernel with our upcoming SDK release has 78 ALU instructions packed into 25 ALU bundles for a little greater than 3 packing ratio. So there is room for improvement in packing, but you definitely are using more than 1/5th the chip, that would imply your packing ratio is 1.
The ISA is as follows.

ShaderType = IL_SHADER_COMPUTE TargetChip = c ; ------------- SC_SRCSHADER Dump ------------------ SC_SHADERSTATE: u32NumIntVSConst = 0 SC_SHADERSTATE: u32NumIntPSConst = 0 SC_SHADERSTATE: u32NumIntGSConst = 0 SC_SHADERSTATE: u32NumBoolVSConst = 0 SC_SHADERSTATE: u32NumBoolPSConst = 0 SC_SHADERSTATE: u32NumBoolGSConst = 0 SC_SHADERSTATE: u32NumFloatVSConst = 0 SC_SHADERSTATE: u32NumFloatPSConst = 0 SC_SHADERSTATE: u32NumFloatGSConst = 0 fConstantsAvailable = 0 iConstantsAvailable = 0 bConstantsAvailable = 0 u32SCOptions[0] = 0x01A00000 SCOption_IGNORE_SAMPLE_L_BUG SCOption_FLOAT_DO_NOT_DIST SCOption_FLOAT_DO_NOT_REASSOC u32SCOptions[1] = 0x00202000 SCOption_R600_ERROR_ON_DOUBLE_MEMEXP SCOption_SET_VPM_FOR_SCATTER u32SCOptions[2] = 0x00020040 SCOption_R800_UAV_NONUAV_SYNC_WORKAROUND_BUG216513_1 SCOption_R900_BRANCH_IN_NESTED_LOOPS_WORKAROUND_BUG281276 ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(66) KCACHE0(CB0:0-15) KCACHE1(CB1:0-31) 0 x: MULLO_INT ____, R1.z, KC0[1].z y: MULLO_INT ____, R1.z, KC0[1].z z: MULLO_INT ____, R1.z, KC0[1].z w: MULLO_INT R0.w, R1.z, KC0[1].z 1 x: ADD_INT R2.x, R0.z, PV0.w 2 x: MULLO_INT ____, R1.x, KC0[1].x y: MULLO_INT R2.y, R1.x, KC0[1].x z: MULLO_INT ____, R1.x, KC0[1].x w: MULLO_INT ____, R1.x, KC0[1].x 3 y: ADD_INT R2.y, R2.x, KC0[6].z z: ADD_INT R0.z, R0.x, PV2.y VEC_120 4 x: MULLO_INT R0.x, R1.y, KC0[1].y y: MULLO_INT ____, R1.y, KC0[1].y z: MULLO_INT ____, R1.y, KC0[1].y w: MULLO_INT ____, R1.y, KC0[1].y 5 y: ADD_INT R1.y, R0.y, PV4.x z: U_TO_F R1.z, R2.y VEC_120 w: ADD_INT R1.w, R0.z, KC0[6].x 6 x: U_TO_F R1.x, PV5.w z: ADD_INT R2.z, PV5.y, KC0[6].y 7 x: LSHL R2.x, PV6.z, 1 y: MULADD_e R1.y, KC1[7].x, R1.z, KC1[8].x w: MULADD_e R2.w, KC1[7].x, PV6.x, KC1[8].x 8 x: MUL_e R2.x, KC1[16].x, PV7.y y: MUL_e R0.y, KC1[17].x, PV7.y z: U_TO_F R1.z, PV7.x 9 x: MULLO_INT R1.x, KC1[4].x, KC1[4].x y: MULLO_INT ____, KC1[4].x, KC1[4].x z: MULLO_INT ____, KC1[4].x, KC1[4].x w: MULLO_INT ____, KC1[4].x, KC1[4].x 10 x: MULLO_INT ____, PV9.x, R2.y y: MULLO_INT ____, PV9.x, R2.y z: MULLO_INT ____, PV9.x, R2.y w: MULLO_INT R0.w, PV9.x, R2.y 11 x: MULADD_e R1.x, KC1[11].x, R2.w, R0.y y: MUL_e R1.y, KC1[15].x, R1.y z: LSHR R0.z, PV10.w, 1 12 x: MULLO_INT ____, R2.z, KC1[4].x y: MULLO_INT ____, R2.z, KC1[4].x z: MULLO_INT R2.z, R2.z, KC1[4].x w: MULLO_INT ____, R2.z, KC1[4].x 13 x: ADD R1.x, KC1[20].x, R1.x y: ADD_INT R0.y, R1.w, PV12.z w: MULADD_e R1.w, KC1[10].x, R2.w, R2.x 14 y: ADD_INT R2.y, PV13.y, R0.z w: MULADD_e R0.w, KC1[7].x, R1.z, KC1[8].x VEC_021 15 x: LSHL R0.x, PV14.y, (0x00000002, 2.802596929e-45f).x z: ADD R0.z, KC1[19].x, R1.w w: MULADD_e R1.w, KC1[9].x, R2.w, R1.y VEC_021 16 x: ADD R2.x, KC1[18].x, PV15.w z: MULADD_e R1.z, KC1[14].x, R0.w, R1.x 17 z: MULADD_e R2.z, KC1[13].x, R0.w, R0.z w: MULADD_e R1.w, KC1[12].x, R0.w, PV16.x 18 x: RCP_e ____, R1.z y: RCP_e R0.y, R1.z z: RCP_e ____, R1.z w: ADD_INT R2.w, KC1[0].x, R0.x 19 x: MULADD_e R3.x, PV18.y, R1.w, 0.5 y: MULADD_e R3.y, R2.z, PV18.y, 0.5 z: ADD R2.z, KC1[7].x, R0.w VEC_021 w: ADD_INT R0.w, KC1[1].x, R0.x 20 x: LSHR R0.x, R2.w, (0x00000002, 2.802596929e-45f).x z: MULADD_e R0.z, KC1[13].x, PV19.z, R0.z w: MULADD_e R2.w, KC1[14].x, PV19.z, R1.x 01 TEX: ADDR(112) CNT(2) 21 SAMPLE R2._x__, R3.xy0x, t0, s0 UNNORM(XYZW) 22 VFETCH R1.___x, R0.x, fc155 FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(98) CNT(10) KCACHE0(CB1:0-15) 23 x: LSHR R1.x, R0.w, (0x00000002, 2.802596929e-45f).x z: MULADD_e R1.z, KC0[12].x, R2.z, R2.x w: MUL_e R0.w, R0.y, R2.y 24 x: RCP_e R2.x, R2.w y: RCP_e ____, R2.w z: RCP_e ____, R2.w 25 x: MULADD_e R3.x, R0.y, R0.w, R1.w y: MULADD_e R0.y, R0.z, PV24.x, 0.5 z: MULADD_e R0.z, PV24.x, R1.z, 0.5 03 MEM_RAT_CACHELESS_STORE_DWORD__NI_ACK: RAT(1)[R0].x___, R3, ARRAY_SIZE(4) MARK VPM 04 TEX: ADDR(116) CNT(1) 26 SAMPLE R1.__x_, R0.zy0z, t0, s0 UNNORM(XYZW) 05 WAIT_ACK: Outstanding_acks <= 0 06 TEX: ADDR(118) CNT(1) 27 VFETCH R1._x__, R1.x, fc155 FETCH_TYPE(NO_INDEX_OFFSET) 07 ALU: ADDR(108) CNT(2) 28 y: MUL_e R2.y, R2.x, R1.z 29 x: MULADD_e R0.x, R2.x, PV28.y, R1.y 08 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(1)[R1].x___, R0, ARRAY_SIZE(4) MARK VPM 09 END END_OF_PROGRAM ; ----------------- CS Data ------------------------ ; Input Semantic Mappings ; No input mappings GprPoolSize = 0 CodeLen = 960;Bytes PGM_END_CF = 0; words(64 bit) PGM_END_ALU = 0; words(64 bit) PGM_END_FETCH = 0; words(64 bit) MaxScratchRegsNeeded = 0 ;AluPacking = 0.0 ;AluClauses = 0 ;PowerThrottleRate = 0.0 ; texResourceUsage[0] = 0x00000000 ; texResourceUsage[1] = 0x00000000 ; texResourceUsage[2] = 0x00000000 ; texResourceUsage[3] = 0x00000000 ; fetch4ResourceUsage[0] = 0x00000000 ; fetch4ResourceUsage[1] = 0x00000000 ; fetch4ResourceUsage[2] = 0x00000000 ; fetch4ResourceUsage[3] = 0x00000000 ; texSamplerUsage = 0x00000000 ; constBufUsage = 0x00000000 ResourcesAffectAlphaOutput[0] = 0x00000000 ResourcesAffectAlphaOutput[1] = 0x00000000 ResourcesAffectAlphaOutput[2] = 0x00000000 ResourcesAffectAlphaOutput[3] = 0x00000000 ;SQ_PGM_RESOURCES = 0x30000104 SQ_PGM_RESOURCES:NUM_GPRS = 4 SQ_PGM_RESOURCES:STACK_SIZE = 1 SQ_PGM_RESOURCES:PRIME_CACHE_ENABLE = 1 ;SQ_PGM_RESOURCES_2 = 0x000000C0 SQ_LDS_ALLOC:SIZE = 0x00000000 ; RatOpIsUsed = 0x2 ; NumThreadPerGroupFlattened = 256 ; SetBufferForNumGroup = true

0 Likes
eci
Journeyman III

Performance Comparison ATI-NVidia

OK, then I got it basically right. I only looked at the code that was posted further up so that's why I got the packing ratio a little bit wrong

That leaves the question if I do a major mistake in the way I access memory?

0 Likes
eduardoschardong
Journeyman III

Performance Comparison ATI-NVidia

Originally posted by: eci That leaves the question if I do a major mistake in the way I access memory?


Maybe, profiling output would help.

 

Without this info the best I can do is guess, and a bad guess since I never used texture sampling, I don't care about imaging at all, well... Get two bad guesses by the price of one:

 

1) IIRC textures on GPUs aren't organized in row-major order or any other "linear" way, it's z-order, if your kernel ids are linear the sampler will read from very sparse memory locations limiting performance.

2) I'm not sure how 3D kernels wavefronts are formed, if they aren't all from the same row there may be stores to sparse address or a pattern causing the GPU to serialize accesses, again limiting a lot performance.

 

 

TIP: In your last kernel avoid breaking clauses, the compiler apparently failed to realize the two loads/stores are independent, make all loads/sampling in a sequence, calculate everything from both stores and then store, this will allow to reduce the number of clauses and increasing packing.

 

0 Likes