Archives Discussions

riza_guntur · ‎11-10-2009

The subject speaks for itself

Haven't seen one till now

gaurav_garg · ‎11-10-2009

I think the reason lies in design decisions of BrookGPU to make it a high level language similar to C and not exposing instructions those are speific to architecture.

I think the responsibility to convert a mul-add operation into mad lie on brcc. But, this is a limitation on brcc side currently. Even though there is a easy work-around to generate mad for mul-add operations.

emuller · ‎11-18-2009

OpenCL has explicit mad instruction. I think its not so ugly to add a mad primitive to brook ... when performance matters, explicit mad avoids the need to go down to the IL and check it did it right.

@gaurav

Would a patch adding the mad primitive be considered for addition to brook?

gaurav_garg · ‎11-18-2009

The easier way of doing it would be to generate mul instrauction instead of mul_ieee in HLSL compiler. Changing HLSL codegen allows generation of mad instruction at ISA level.

riza_guntur · ‎11-18-2009

But actually the performance not bad at all though mul and add is separated, maybe the driver does the job. The funny thing is performance increase very well than 6 months ago, maybe after the faster matrix multiplication program stated in Beyond3D has made some changes to ISA compiler

After I follow some instructions about changing mul_ieee to mul, the performance increases 10% on computational intensive kernel, but not increase at all on bandwidth intensive

@gaurav

does any fix to brook+ 1.4.1 in sourceforge has been applied in installer now? I've seen you post some fixes in this and that files but the upload date still the same

gaurav_garg · ‎11-18-2009

We have not updated the installers yet. The changes committed till now are in runtime, you can build the source at your end.

emuller · ‎11-19-2009

@ riza.guntur

Looking closely at the ISA from a cal "mad w,x,y,z" and a brook w=x*y+z, I have to agree with you that brook is doing a pretty good job, and there seems to be no need for a brook mad. Please see the attached code+output.

The ISA are basically the same, except brook has this stuff:

      9 t: I_TO_F      ____, PV8.y
     10 x: NOP         ____
         t: F_TO_I      ____, PS9

Which seems to me to be superfluous, is it not?

Also brook uses MULADD_e instead of MULADD for CAL. Anyone know what is the diff between these two?

import brook kernel_code = """ Attribute[GroupSize (64)] kernel void test_mad(float4 in1[], float4 in2[], float4 in3[], out float4 out_s[]) { out_s[instance().x] = in1[instance().x]*in2[instance().x]+in3[instance().x]; } """ kernels = brook.compiler.build_kernel(kernel_code,virtualization=True, with_il=True) #print kernels.test.IL # get brook kernel ISA k = kernels.test_mad brook_isa = k.getISA(0) # compare to IL program il_code = """ il_cs_2_0 dcl_num_thread_per_group 64 dcl_resource_id(0)_type(1d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(1)_type(1d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) dcl_resource_id(2)_type(1d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float) load_resource(0) r0, vaTid.x load_resource(1) r1, vaTid.x load_resource(2) r2, vaTid.x mad g[vaTid.x], r0, r1, r2 end """ from pygwa.amdcal import * from time import sleep #from numpy import * CalInit() info = CalDeviceGetInfo(0) target = info['target'] # compile kernel.test.IL[0] obj = CalCode(CAL_LANG_IL, target, kernels.test_mad.IL[0]) brook_isa2 = obj.Disassemble() # these two should be the same assert brook_isa2==brook_isa print brook_isa obj = CalCode(CAL_LANG_IL, target, il_code) cal_isa = obj.Disassemble() print cal_isa ######################### output (1st is brook, 2nd is CAL) ######################### In [45]: execfile('getISA_mad.py') ShaderType = 3 TargetChip = c ;SC Dep components NumClauseTemps = 4 ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(5) 0 x: LSHL R2.x, R0.z, (0x00000006, 8.407790786e-45f).x y: LSHL R2.y, R0.y, (0x00000006, 8.407790786e-45f).x z: MOV R4.z, 0.0f w: MOV R0.w, 0.0f 01 TEX: ADDR(64) CNT(1) 1 VFETCH R3.xy__, R0.w, fc147 MEGA(8) FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(37) CNT(16) 2 z: ADD_INT ____, R2.x, R2.y t: MULLO_UINT ____, R1.z, R3.x 3 y: ADD_INT T0.y, R0.x, PV2.z t: MULLO_UINT T0.x, PS2, R3.y 4 t: MULLO_UINT ____, R1.y, R3.x 5 w: ADD_INT ____, T0.x, PS4 6 z: ADD_INT ____, R1.x, PV5.w 7 x: LSHL ____, PV6.z, (0x00000006, 8.407790786e-45f).x 8 y: ADD_INT ____, T0.y, PV7.x 9 t: I_TO_F ____, PV8.y 10 x: NOP ____ t: F_TO_I ____, PS9 11 x: LSHL R0.x, PS10, (0x00000002, 2.802596929e-45f).x t: I_TO_F R4.x, PS10 03 TEX: ADDR(66) CNT(3) 12 SAMPLE R1, R4.xz0x, t0, s0 UNNORM(XYZW) 13 SAMPLE R3, R4.xz0x, t1, s0 UNNORM(XYZW) 14 SAMPLE R4, R4.xz0x, t2, s0 UNNORM(XYZW) 04 ALU: ADDR(53) CNT(4) 15 x: MULADD_e R1.x, R1.x, R3.x, R4.x y: MULADD_e R1.y, R1.y, R3.y, R4.y z: MULADD_e R1.z, R1.z, R3.z, R4.z w: MULADD_e R1.w, R1.w, R3.w, R4.w 05 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R0.x], R1, ELEM_SIZE(3) VPM END_OF_PROGRAM ; ----------------- CS Data ------------------------ ; Input Semantic Mappings ; No input mappings GprPoolSize = 0 CodeLen = 576;Bytes PGM_END_CF = 0; words(64 bit) PGM_END_ALU = 0; words(64 bit) PGM_END_FETCH = 0; words(64 bit) MaxScratchRegsNeeded = 0 ;AluPacking = 0.0 ;AluClauses = 0 ;PowerThrottleRate = 0.0 ; texResourceUsage[0] = 0x00000000 ; texResourceUsage[1] = 0x00000000 ; texResourceUsage[2] = 0x00000000 ; texResourceUsage[3] = 0x00000000 ; fetch4ResourceUsage[0] = 0x00000000 ; fetch4ResourceUsage[1] = 0x00000000 ; fetch4ResourceUsage[2] = 0x00000000 ; fetch4ResourceUsage[3] = 0x00000000 ; texSamplerUsage = 0x00000000 ; constBufUsage = 0x00000000 ResourcesAffectAlphaOutput[0] = 0x00000000 ResourcesAffectAlphaOutput[1] = 0x00000000 ResourcesAffectAlphaOutput[2] = 0x00000000 ResourcesAffectAlphaOutput[3] = 0x00000000 ;SQ_PGM_RESOURCES = 0x30000005 SQ_PGM_RESOURCES:NUM_GPRS = 5 SQ_PGM_RESOURCES:STACK_SIZE = 0 SQ_PRM_RESOURCES:PRIME_CACHE_ENABLE = 1 ;SQ_PGM_RESOURCES_2 = 0x000000C0 ; NumThreadPerGroupFlattened = 64 ; NumThreadPerGroup_x = 64 ; NumThreadPerGroup_y = 1 ; NumThreadPerGroup_z = 1 ; SetBufferForNumGroup = true ShaderType = 3 TargetChip = c ;SC Dep components NumClauseTemps = 4 ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(4) 0 x: MOV R2.x, 0.0f y: LSHL R2.y, R0.z, (0x00000006, 8.407790786e-45f).x z: LSHL R2.z, R0.y, (0x00000006, 8.407790786e-45f).x 01 TEX: ADDR(64) CNT(1) 1 VFETCH R3.xy__, R2.x, fc147 MEGA(8) FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(36) CNT(10) 2 w: ADD_INT ____, R2.y, R2.z t: MULLO_UINT ____, R1.z, R3.x 3 z: ADD_INT T1.z, R0.x, PV2.w t: MULLO_UINT T0.z, PS2, R3.y 4 t: MULLO_UINT ____, R1.y, R3.x 5 x: ADD_INT ____, T0.z, PS4 6 w: ADD_INT ____, R1.x, PV5.x 7 z: LSHL ____, PV6.w, (0x00000006, 8.407790786e-45f).x 8 y: ADD_INT R2.y, T1.z, PV7.z 03 TEX: ADDR(66) CNT(3) 9 LD R1, R2.yy0y, t0, s0 UNNORM(XYZW) 10 LD R3, R2.yy0y, t1, s0 UNNORM(XYZW) 11 LD R0, R2.yy0y, t2, s0 UNNORM(XYZW) 04 ALU: ADDR(46) CNT(6) 12 x: MULADD R0.x, R1.x, R3.x, R0.x y: MULADD R0.y, R1.y, R3.y, R0.y z: MULADD R0.z, R1.z, R3.z, R0.z w: MULADD R0.w, R1.w, R3.w, R0.w 13 x: LSHL R1.x, R2.y, (0x00000002, 2.802596929e-45f).x 05 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x], R0, ELEM_SIZE(3) VPM END_OF_PROGRAM ; ----------------- CS Data ------------------------ ; Input Semantic Mappings ; No input mappings GprPoolSize = 0 CodeLen = 576;Bytes PGM_END_CF = 0; words(64 bit) PGM_END_ALU = 0; words(64 bit) PGM_END_FETCH = 0; words(64 bit) MaxScratchRegsNeeded = 0 ;AluPacking = 0.0 ;AluClauses = 0 ;PowerThrottleRate = 0.0 ; texResourceUsage[0] = 0x00000000 ; texResourceUsage[1] = 0x00000000 ; texResourceUsage[2] = 0x00000000 ; texResourceUsage[3] = 0x00000000 ; fetch4ResourceUsage[0] = 0x00000000 ; fetch4ResourceUsage[1] = 0x00000000 ; fetch4ResourceUsage[2] = 0x00000000 ; fetch4ResourceUsage[3] = 0x00000000 ; texSamplerUsage = 0x00000000 ; constBufUsage = 0x00000000 ResourcesAffectAlphaOutput[0] = 0x00000000 ResourcesAffectAlphaOutput[1] = 0x00000000 ResourcesAffectAlphaOutput[2] = 0x00000000 ResourcesAffectAlphaOutput[3] = 0x00000000 ;SQ_PGM_RESOURCES = 0x30000004 SQ_PGM_RESOURCES:NUM_GPRS = 4 SQ_PGM_RESOURCES:STACK_SIZE = 0 SQ_PRM_RESOURCES:PRIME_CACHE_ENABLE = 1 ;SQ_PGM_RESOURCES_2 = 0x000000C0 ; NumThreadPerGroupFlattened = 64 ; NumThreadPerGroup_x = 64 ; NumThreadPerGroup_y = 1 ; NumThreadPerGroup_z = 1 ; SetBufferForNumGroup = true In [46]:

emuller · ‎11-19-2009

BTW, the ISA output above is for brook CAL "technique 0". The ISA for technique 1 has 4 times more instruction slots (attached).

When is tech 0 used and when is tech 1? Obviously I want to make sure its always using tech 0 in this case.

ShaderType = 3 TargetChip = c ;SC Dep components NumClauseTemps = 4 ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(6) KCACHE0(CB0:0-15) 0 x: LSHL R2.x, R0.z, (0x00000006, 8.407790786e-45f).x y: LSHL R2.y, R0.y, (0x00000006, 8.407790786e-45f).x z: MOV R2.z, 0.0f w: SUB_INT R0.w, 0.0f, KC0[6].x t: SUB_INT R3.x, 0.0f, KC0[1].x 01 TEX: ADDR(192) CNT(1) 1 VFETCH R4.xy__, R2.z, fc147 MEGA(8) FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(38) CNT(110) KCACHE0(CB0:0-15) 2 x: ADD_INT ____, R2.x, R2.y y: SUB_INT T0.y, 0.0f, KC0[3].x z: MAX_INT T1.z, KC0[6].x, R0.w t: MULLO_UINT ____, R1.z, R4.x 3 x: SUB_INT ____, 0.0f, KC0[5].x y: MAX_INT R2.y, KC0[1].x, R3.x w: ADD_INT T1.w, R0.x, PV2.x t: MULLO_UINT T0.w, PS2, R4.y 4 y: MAX_INT R3.y, KC0[3].x, T0.y z: MAX_INT R2.z, KC0[5].x, PV3.x t: MULLO_UINT ____, R1.y, R4.x 5 x: ADD_INT ____, T0.w, PS4 t: RCP_UINT T0.x, T1.z 6 y: ADD_INT ____, R1.x, PV5.x t: MULLO_UINT T0.y, T1.z, PS5 7 z: SUB_INT ____, 0.0f, PS6 w: LSHL ____, PV6.y, (0x00000006, 8.407790786e-45f).x t: MULHI_UINT T0.w, T1.z, T0.x 8 x: CNDE_INT R123.x, PS7, PV7.z, T0.y VEC_021 z: ADD_INT T3.z, T1.w, PV7.w t: RCP_UINT T1.w, R2.y 9 x: SUB_INT ____, 0.0f, PV8.z y: XOR_INT ____, PV8.z, KC0[6].x t: MULHI_UINT ____, PV8.x, T0.x 10 x: AND_INT T2.x, PV9.y, (0x80000000, -0.0f).x y: MAX_INT T0.y, T3.z, PV9.x z: ADD_INT ____, T0.x, PS9 w: SUB_INT ____, T0.x, PS9 t: RCP_UINT T3.x, R3.y 11 x: CNDE_INT R123.x, T0.w, PV10.z, PV10.w t: RCP_UINT R4.x, R2.z 12 t: MULHI_UINT T1.y, PV11.x, T0.y 13 x: ADD_INT T1.x, -1, PS12 y: ADD_INT T2.y, PS12, 1 t: MULLO_UINT ____, PS12, T1.z 14 x: SETGE_UINT T0.x, T0.y, PS13 z: SUB_INT ____, T0.y, PS13 t: MULLO_UINT T0.z, R2.y, T1.w 15 y: SUB_INT T0.y, 0.0f, PS14 w: SETGE_UINT ____, PV14.z, T1.z t: MULLO_UINT T0.w, R3.y, T3.x 16 y: SUB_INT T3.y, 0.0f, PS15 z: AND_INT ____, T0.x, PV15.w t: MULLO_UINT R1.y, R2.z, R4.x 17 x: CNDE_INT R123.x, PV16.z, T1.y, T2.y VEC_201 z: SUB_INT T2.z, 0.0f, PS16 t: MULHI_UINT R1.x, R2.y, T1.w 18 x: CNDE_INT T0.x, PS17, T0.y, T0.z w: CNDE_INT R123.w, T0.x, T1.x, PV17.x VEC_021 t: MULHI_UINT T0.z, R3.y, T3.x 19 x: CNDE_INT T1.x, PS18, T3.y, T0.w z: CNDE_INT T1.z, T1.z, -1, PV18.w t: MULHI_UINT T0.w, R2.z, R4.x 20 x: SUB_INT ____, 0.0f, PV19.z y: CNDE_INT T3.y, PS19, T2.z, R1.y t: MULHI_UINT ____, T0.x, T1.w 21 x: SUB_INT ____, T1.w, PS20 y: ADD_INT ____, T1.w, PS20 z: CNDE_INT R123.z, T2.x, T1.z, PV20.x t: MULHI_UINT ____, T1.x, T3.x 22 y: ADD_INT ____, T3.x, PS21 z: SUB_INT ____, T3.x, PS21 w: CNDE_INT T1.w, R1.x, PV21.y, PV21.x VEC_120 t: MULLO_INT ____, PV21.z, KC0[6].x 23 x: CNDE_INT T3.x, T0.z, PV22.y, PV22.z y: SUB_INT R5.y, T3.z, PS22 VEC_120 w: SUB_INT ____, PS22, T3.z t: MULHI_UINT ____, T3.y, R4.x 24 x: MAX_INT T2.x, PV23.y, PV23.w y: XOR_INT ____, KC0[1].x, PV23.y z: ADD_INT ____, R4.x, PS23 w: SUB_INT ____, R4.x, PS23 t: XOR_INT ____, KC0[3].x, PV23.y 25 x: CNDE_INT T1.x, T0.w, PV24.z, PV24.w y: XOR_INT ____, KC0[5].x, R5.y z: AND_INT R3.z, PV24.y, (0x80000000, -0.0f).x w: AND_INT R2.w, PS24, (0x80000000, -0.0f).x t: MULHI_UINT T0.z, T1.w, PV24.x 26 x: ADD_INT R4.x, -1, PS25 z: ADD_INT T3.z, PS25, 1 w: AND_INT R3.w, PV25.y, (0x80000000, -0.0f).x t: MULHI_UINT T1.w, T3.x, T2.x 27 x: ADD_INT R1.x, -1, PS26 w: ADD_INT T2.w, PS26, 1 t: MULHI_UINT R1.y, T1.x, T2.x 28 x: ADD_INT R3.x, -1, PS27 y: ADD_INT R4.y, PS27, 1 t: MULLO_UINT ____, T0.z, R2.y 29 x: LSHL R6.x, R5.y, (0x00000002, 2.802596929e-45f).x y: SUB_INT ____, T2.x, PS28 w: SETGE_UINT R0.w, T2.x, PS28 t: MULLO_UINT ____, T1.w, R3.y 30 x: SETGE_UINT ____, PV29.y, R2.y y: SUB_INT ____, T2.x, PS29 z: SETGE_UINT R0.z, T2.x, PS29 t: MULLO_UINT ____, R1.y, R2.z 31 x: SETGE_UINT ____, PV30.y, R3.y y: SETGE_UINT R0.y, T2.x, PS30 z: SUB_INT ____, T2.x, PS30 w: AND_INT T0.w, R0.w, PV30.x 32 y: AND_INT ____, R0.z, PV31.x w: SETGE_UINT ____, PV31.z, R2.z 33 x: CNDE_INT R0.x, PV32.y, T1.w, T2.w z: AND_INT R1.z, R0.y, PV32.w w: CNDE_INT R1.w, T0.w, T0.z, T3.z 03 ALU: ADDR(148) CNT(31) KCACHE0(CB0:0-15) 34 x: CNDE_INT R123.x, R1.z, R1.y, R4.y y: CNDE_INT R123.y, R0.w, R4.x, R1.w z: CNDE_INT R123.z, R0.z, R1.x, R0.x VEC_120 35 x: CNDE_INT T2.x, R3.y, -1, PV34.z y: CNDE_INT T3.y, R2.y, -1, PV34.y VEC_120 w: CNDE_INT R123.w, R0.y, R3.x, PV34.x VEC_201 36 x: SUB_INT ____, 0.0f, PV35.x z: CNDE_INT T0.z, R2.z, -1, PV35.w w: SUB_INT ____, 0.0f, PV35.y 37 x: SUB_INT ____, 0.0f, PV36.z y: CNDE_INT T3.y, R3.z, T3.y, PV36.w z: CNDE_INT T1.z, R2.w, T2.x, PV36.x 38 z: CNDE_INT T2.z, R3.w, T0.z, PV37.x t: MULLO_INT ____, PV37.y, KC0[1].x 39 z: SUB_INT T0.z, R5.y, PS38 t: MULLO_INT ____, T1.z, KC0[3].x 40 z: SUB_INT T3.z, R5.y, PS39 t: MULLO_INT ____, T2.z, KC0[5].x 41 y: SUB_INT T0.y, R5.y, PS40 t: I_TO_F ____, T0.z 42 x: ADD R5.x, PS41, 0.5 t: I_TO_F ____, T3.y 43 y: ADD R5.y, PS42, 0.5 t: I_TO_F ____, T3.z 44 x: ADD R0.x, PS43, 0.5 t: I_TO_F ____, T1.z 45 y: ADD R0.y, PS44, 0.5 t: I_TO_F ____, T0.y 46 x: ADD R2.x, PS45, 0.5 t: I_TO_F ____, T2.z 47 y: ADD R2.y, PS46, 0.5 04 TEX: ADDR(194) CNT(3) 48 SAMPLE R5, R5.xy0x, t0, s0 UNNORM(XYZW) 49 SAMPLE R0, R0.xy0x, t1, s0 UNNORM(XYZW) 50 SAMPLE R2, R2.xy0x, t2, s0 UNNORM(XYZW) 05 ALU: ADDR(179) CNT(4) 51 x: MULADD_e R0.x, R5.x, R0.x, R2.x y: MULADD_e R0.y, R5.y, R0.y, R2.y z: MULADD_e R0.z, R5.z, R0.z, R2.z w: MULADD_e R0.w, R5.w, R0.w, R2.w 06 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R6.x], R0, ELEM_SIZE(3) VPM END_OF_PROGRAM ; ----------------- CS Data ------------------------ ; Input Semantic Mappings ; No input mappings GprPoolSize = 0 CodeLen = 1600;Bytes PGM_END_CF = 0; words(64 bit) PGM_END_ALU = 0; words(64 bit) PGM_END_FETCH = 0; words(64 bit) MaxScratchRegsNeeded = 0 ;AluPacking = 0.0 ;AluClauses = 0 ;PowerThrottleRate = 0.0 ; texResourceUsage[0] = 0x00000000 ; texResourceUsage[1] = 0x00000000 ; texResourceUsage[2] = 0x00000000 ; texResourceUsage[3] = 0x00000000 ; fetch4ResourceUsage[0] = 0x00000000 ; fetch4ResourceUsage[1] = 0x00000000 ; fetch4ResourceUsage[2] = 0x00000000 ; fetch4ResourceUsage[3] = 0x00000000 ; texSamplerUsage = 0x00000000 ; constBufUsage = 0x00000000 ResourcesAffectAlphaOutput[0] = 0x00000000 ResourcesAffectAlphaOutput[1] = 0x00000000 ResourcesAffectAlphaOutput[2] = 0x00000000 ResourcesAffectAlphaOutput[3] = 0x00000000 ;SQ_PGM_RESOURCES = 0x30000007 SQ_PGM_RESOURCES:NUM_GPRS = 7 SQ_PGM_RESOURCES:STACK_SIZE = 0 SQ_PRM_RESOURCES:PRIME_CACHE_ENABLE = 1 ;SQ_PGM_RESOURCES_2 = 0x000000C0 ; NumThreadPerGroupFlattened = 64 ; NumThreadPerGroup_x = 64 ; NumThreadPerGroup_y = 1 ; NumThreadPerGroup_z = 1 ; SetBufferForNumGroup = true

gaurav_garg · ‎11-19-2009

These type conversions are bacuase of the way indexing differs in Brook+ & IL. Brook+ exposes integer indexing into array whereas in IL, textures must be indexed using floating point variables.

Technique 1 is used if address translation is required in your kernel. And, address translation is enabled only if CAL is not able to handle dimensions of your input/output streams. i.e. using 3D streams or 1D streams with size > 8192 *8192.

riza_guntur · ‎11-19-2009

That's why brook+ should stay as performance language for ATI

Anyway, gaurav I want to make sure, how to use float indexing? Once I type integer to float conversion then use it as index, increment one by one in each iteration, it acts same as integer indexing. Is it safe?

gaurav_garg · ‎11-19-2009

Anyway, gaurav I want to make sure, how to use float indexing? Once I type integer to float conversion then use it as index, increment one by one in each iteration, it acts same as integer indexing. Is it safe?

It should be safe as well as it should be faster if you directly use floating point indexing.

So, if my compute shader is 1d and dim_x>8192 I get tech 1, right?

If dim_x < 8192^2, and I therefore make my CS 2d and do my own virtualization then I stay with tech 0 ? This should give a big performance improvement over 1d, no?

Sorry, I forgot one thing, compute shader always use AT technique. The reason is that compute shader require linear address calculation and Brook+ has tried to virtualize it for developers in AT code.

emuller · ‎11-19-2009

Do I understand correctly that the CAL equivalent of the scatter out of a brook compute shader is the global buffer g[], which is linearly addressed, and that the index of this g[] in a CAL program is not restriced to <8192? (otherwise, how would a CAL compute shader output to more than 8192 DWORDS?)

Then assuming the input dims are <(8192,8192), one should be able to avoid AT for a compute shader for significant performance boosts.

I need a compute shader for lds and scatter out, but I don't need 3/4ths of my instruction slots spent on AT ... since all of the above applies.

riza_guntur · ‎11-24-2009

Originally posted by: gaurav.garg
Anyway, gaurav I want to make sure, how to use float indexing? Once I type integer to float conversion then use it as index, increment one by one in each iteration, it acts same as integer indexing. Is it safe?

It should be safe as well as it should be faster if you directly use floating point indexing.

Using the old indexof() you mean?

gaurav_garg · ‎11-24-2009

indexof won't help. This too will cause conversions between float <--> int conversions.

I meant if you need to index your gather streams in a loop and the indexing is dependent on loop iteration, it would be good if you use floating point loop iterator.

emuller · ‎11-19-2009

So, if my compute shader is 1d and dim_x>8192 I get tech 1, right?

If dim_x < 8192^2, and I therefore make my CS 2d and do my own virtualization then I stay with tech 0 ? This should give a big performance improvement over 1d, no?

riza_guntur · ‎11-19-2009

Originally posted by: emuller So, if my compute shader is 1d and dim_x>8192 I get tech 1, right?

If dim_x < 8192^2, and I therefore make my CS 2d and do my own virtualization then I stay with tech 0 ? This should give a big performance improvement over 1d, no?

what do you mean do your own virtualization?

create array of stream? >.<

Archives Discussions

Why I don't see any mad operations in Brook+ IL at all?