cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

frankas
Journeyman III

Can I use BFI_INT directly from IL ?

I need to use the BFI_INT instruction directly from some optimization. According to the Evergreen docs it performs:

dst = (src1 & src0) | (src2 & ~src0)

But the closest I can find in the IL docs are UBIT_INSERT.

Alas ubit_insert doesn't allow me to specify the bitmask src0 directly, only to the level of widths and offsets. So if I want a mask such as 0x5555aaaa ubit_insert is no use.

Have I overlooked something in the docs? Or is there an undocumented feature I can use to utilize this assembly level operation directly ?

 

0 Likes
61 Replies

I was also curious about BFI_INT (as it can fully replace several operation for one round of MD5) but looks like it didn't exposed at IL level at all.

0 Likes

Same story here, BFI_INT will also speed up SHA-1

http://en.wikipedia.org/wiki/SHA-1

 

BFI_INT is equivalent to the (bit)vector_select function in this optimization:

(0 ≤ i ≤ 19): f = vec_sel(d, c, b)

But I also found another optimization where it is useful

(40 ≤ i ≤ 59): f = vec_sel(b, (c or d), (c and d))

This brings this round function down to 3 cycles, which is faster than any of the alternatives listed on the Wikipedia page.

 

0 Likes

The IL instruction ubit_insert is translated to a BFM_INT+BFI_INT pair, but there is no instruction to expose the powerful functionality of BFI_INT alone. However there is a way to use it: after compiling the IL code with calclCompile() it is possible to dynamically patch the binary CAL object in memory by scanning its opcodes to replace some of them with BFI_INT. My open source whitepixel v2 project does this, feel free to read the code to see it in action:

 

http://blog.zorinaq.com/?e=43

 

empty_knapsack: actually BFI_INT is useful for two rounds in MD5 (F() and G()) 😉

0 Likes

Yeah, I was curious enough to do it today too . Although speed-up is a bit lower than expected (~16%) as compiler cannot pack BFI_INT in a 5x way efficiently. For my kernel utilization somewhere around 91-92%. May be 6x-8x packing can produce even better results.

0 Likes

Mrbpix: Thanks, your hack is pretty awesome.

However, at this point I think at this point it would be appropriate that ATI comment on what appears to be an oversight in the IL language, and how they intent to remedy the situation.

Can we expect BFI_INT support in future SDK versions, or do we have to go down the path Mrbpix describes in order to eek out the last 15% of performance ?

 

0 Likes

Funny thing that because of IL missing instruction for BFI_INT even OpenCL's bitselect() can't be translated directly into BFI_INT.

 

Anyway, these ~15% only applies to single MD5 where F & G rounds takes 32/45 of all calculations. SHA1 requires a lot more instructions and so removing 20*(3-1)=40 of them (or even +20 from 40-59 rounds) with BFI doesn't produce a huge boost -- I've got about 3% speed-up with BFI_INT usage.

 

And of course ATI should add this function to IL but, as it usually happens, if they'll answer it'll 99% that "this feature will be added in next release" and "we have 3 months development cycle", etc. So I won't expect this will happens anytime soon.

 

Btw, 5970+AMD Stream 2.3+Catalyst 10.12 == still broken...

0 Likes

*bump*

I guess adding BFI_INT to IL (and also mapping OpenCL's bitselect() to it) would be easy yet very useful to all of us.

 

Is that scheduled for upcomming SDK 2.4 ?

0 Likes

* bump * again

Can't find it in the March 2011 IL specs. Should be easy for AMD to fix this oversight, or atleast reply to questions about it.

0 Likes

Those are the ones that I am confident will make it in 2.6. I got pulled onto some more pressing matters so I can't add any more patterns at this time.
Another pattern that I was working on is (C ^ (A & (B ^ C)).

Anyway, these ~15% only applies to single MD5 where F & G rounds takes 32/45 of all calculations. SHA1 requires a lot more instructions and so removing 20*(3-1)=40 of them (or even +20 from 40-59 rounds) with BFI doesn't produce a huge boost -- I've got about 3% speed-up with BFI_INT usage.

hindi songs

0 Likes

I reinstate that OpenCL (as of 11.9 and SDK2.5) does generate BFI_INT code when certain patterns occur in the kernel (and yes, that's not mapped to bitselect()). I have my own routine for binary patching BYTEALIGN with BFI_INT, I have tried disabling it and replacing amd_bytealign() with the bitwise expression and noticed BFI_INTs in the resulting ISA dump. This might have been some mistake or the backend might be able to generate BFI only in certain cases, e.g one of the arguments being a constant (which occurs in the first few steps of MD5 where A/B/C/D or part of them still equal H0...H4 which are constants). Anyway, I am still using my binary patching codepath.

BTW, BFI_INT might give you a bit more with MD5 when you do partial reversals to step 43 instead of complete to step 45 _AND_ when you do the (b^c) trick in round3.

 

With SHA1,it's also a bit more than 3% if you do an early check in step 76 rather than comparing values after step 80 and also precalculate some of the w values in host code rather than doing those calculations post step 16.

 

0 Likes

Originally posted by: gat3way I reinstate that OpenCL (as of 11.9 and SDK2.5) does generate BFI_INT code when certain patterns occur in the kernel (and yes, that's not mapped to bitselect()).

What platform are you using?  I checked rechecked, and rechecked again the vista and linux packages, and it is nowhere to be found, so unless they starting packing their dlls for some reason, it simply is not there.  Since OpenCL generates IL code, it can't even be that its just skipping a step to ISA. 

Only thing I can think of is you must be using XP, and/or the 32 bit only package (not sure if they'd be different).  Well, that or you're the only one with access to the "real" 11.9 driver package...

Honestly Micah, you mention cheking the builds, and you have 2 people on here saying its missing.  Please, go double check all the parameters?  You will find something is missing for sure, it won't be a waste of your time, and you'll make a lot of us happy here!  I've specifically confirmed the vista x64 package does not contain it in 11.9, or 11.10, and the 32 bit binary distributed with the vista x64 version does not contain it.  Same goes for linux, 11.9 (there's only 1 package) x64 binaries show no bfi, nor do the 32 bit versions.  As I said, I bet there is something like -DWITH_BFI not being set....build it and check the command line output, check the spelling, check the spelling in the code.  I don't like my time being wasted, so I'm not going to waste anyone elses. 

I don't know of anything else to say on this, 2 independant people confirmed its not in there.  I reported versions/platforms checked.  I can't make this any easier for you...

 

0 Likes

On, now I'm totally calling BS.  Just grabbed the XP, and XP64 builds, and did the same, so I've checked every version of 11.10 for windows, and 11.9 for linux.  Its not there....just to prove my point, in the code window you'll see the strings output from the xp64 aticaldd64.dll

I don't know what version you have gat3way, but its not the one released to the masses...I don't doubt your eyesight, and I see no reason that you would get on here and lie, which leaves only the aforementioned explanation, that you have some non-public version of the driver...

 Edit:  Apparently the strings dump is too large for the forums, so I'll just include the relevant sections, surrounded by some of the trash so you can see the output is from strings....

 

Another edit, interestingly enough, there is an IL_OP_BFI, but no corrosponding static text with which to parse it.  So it looks like it could generate it, if it knew what it was looking for...

USE_VTX_POINT_SIZE = %u USE_VTX_EDGE_FLAG = %u USE_VTX_RENDER_TARGET_INDX = %u USE_VTX_VIEWPORT_INDX = %u USE_VTX_KILL_FLAG = %u VS_OUT_MISC_VEC_ENA = %u VS_OUT_MISC_SIDE_BUS_ENA = %u VS_OUT_CCDIST0_VEC_ENA = %u VS_OUT_CCDIST1_VEC_ENA = %u ; ----------------- GS Data ------------------------ VGT_GS_OUT_PRIM_TYPE = 0x%08X ; OUTPRIM_TYPE = %u ; MergeFetchFlags = 0x%08X ; ----------------- PS Data ------------------------ ; SPI_PS_IN_CONTROL_0 = 0x%08X SPI0:NUM_INTERP = %u SPI0:POSITION_ENA = %u SPI0:POSITION_CENTROID = %u SPI0:POSITION_ADDR = %u SPI0:PARAM_GEN = %u SPI0:PARAM_GEN_ADDR = %u SPI0:BARYC_SAMPLE_CNTL = %u SPI0:PERSP_GRADIENT_ENA = %u SPI0:LINEAR_GRADIENT_ENA = %u SPI0:POSITION_SAMPLE = %u SPI0:BARYC_SAMPLE_ENA = %u ; SPI_PS_IN_CONTROL_1 = 0x%08X SPI1:GEN_INDEX_PIX = %u SPI1:FIXED_PT_POSITION_ENA = %u SPI1:FIXED_PT_POSITION_ADDR = %u SPI1:FRONT_FACE_ENA = %u SPI1:FRONT_FACE_ADDR = %u SPI1:FRONT_FACE_CHAN = %u SPI1:FOG_ADDR = %u SPI1:GEN_INDEX_PIX_ADDR = %u ; SPI_INPUT_Z SPI:PROVIDE_Z_TO_SPI = %u ; CB_SHADER_MASK = 0x%08X CB:OUTPUT0_ENABLE = %u CB:OUTPUT1_ENABLE = %u CB:OUTPUT2_ENABLE = %u CB:OUTPUT3_ENABLE = %u CB:OUTPUT4_ENABLE = %u CB:OUTPUT5_ENABLE = %u CB:OUTPUT6_ENABLE = %u CB:OUTPUT7_ENABLE = %u CB_SHADER_CONTROL:bitmap = %u%u%u%u%u%u%u%u ; DB_SHADER_CONTROL = 0x%08X DB:Z_EXPORT_ENABLE = %u DB:STENCIL_REF_EXPORT_ENABLE = %u DB:MASK_EXPORT_ENABLE = %u DB:ALPHA_TO_MASK_DISABLE = %u DB:Z_ORDER = %u DB:KILL_ENABLE = %u ; SQ_PGM_EXPORTS_PS SQ_PGM_EXPORTS_PS:PS_EXPORT_MODE = 0x%08X ; (%u color + Z ; bHasFogMerge = 0x%08X ; ----------------- CS Data ------------------------ ; NumSharedGprUser = %d ; NumSharedGprTotal = %d ; CS Setup Mode = Slow (i.e setup R0.xyzw) ; CS Setup Mode = Fast (i.e setup R0.x) ; NumThreadPerGroup = %d ; TotalNumThreadGroup = %d ; NumWavefrontPerSIMD = %d ; IsMaxNumWavePerSIMD = true ; IsMaxNumWavePerSIMD = false ; SetBufferForNumGroup = true ; SetBufferForNumGroup = false R6PLUS_ALU_MOVA_DST_CLAUSE_GLOBAL_B3 R6PLUS_ALU_MOVA_DST_CLAUSE_GLOBAL_B2 R6PLUS_ALU_MOVA_DST_CLAUSE_GLOBAL_B1 R6PLUS_ALU_MOVA_DST_CLAUSE_GLOBAL_B0 R6PLUS_ALU_MOVA_DST_CF_IDX1 R6PLUS_ALU_MOVA_DST_CF_IDX0 R6PLUS_ALU_MOVA_DST_CF_PC R6PLUS_ALU_MOVA_DST_AR_X R6PLUS_EARLY_Z_THEN_RE_Z R6PLUS_RE_Z R6PLUS_EARLY_Z_THEN_LATE_Z R6PLUS_LATE_Z R6PLUS_UNDEF R6PLUS_CENTROIDS_AND_CENTERS R6PLUS_CENTERS_ONLY R6PLUS_CENTROIDS_ONLY R6PLUS_TRISTRIP R6PLUS_LINESTRIP R6PLUS_POINTLIST R6PLUS_SPRITE_EN R6PLUS_GS_SCENARIO_C R6PLUS_GS_SCENARIO_G R6PLUS_GS_SCENARIO_B R6PLUS_GS_SCENARIO_A R6PLUS_GS_OFF R6PLUS_TVX_FMT_RESERVED_63 R6PLUS_TVX_FMT_CTX1 R6PLUS_TVX_FMT_APC7 R6PLUS_TVX_FMT_APC6 R6PLUS_TVX_FMT_APC5 R6PLUS_TVX_FMT_APC4 R6PLUS_TVX_FMT_APC3 R6PLUS_TVX_FMT_APC2 R6PLUS_TVX_FMT_APC1 R6PLUS_TVX_FMT_APC0 R6PLUS_TVX_FMT_BC5 R6PLUS_TVX_FMT_BC4 R6PLUS_TVX_FMT_BC3 R6PLUS_TVX_FMT_BC2 R6PLUS_TVX_FMT_BC1 R6PLUS_TVX_FMT_32_32_32_FLOAT R6PLUS_TVX_FMT_32_32_32 R6PLUS_TVX_FMT_16_16_16_FLOAT R6PLUS_TVX_FMT_16_16_16 R6PLUS_TVX_FMT_8_8_8 R6PLUS_TVX_FMT_5_9_9_9_SHAREDEXP R6PLUS_TVX_FMT_32_AS_8_8 R6PLUS_TVX_FMT_32_AS_8 R6PLUS_TVX_FMT_BG_RG R6PLUS_TVX_FMT_GB_GR R6PLUS_TVX_FMT_1_REVERSED R6PLUS_TVX_FMT_1 R6PLUS_TVX_FMT_RESERVED_36 R6PLUS_TVX_FMT_32_32_32_32_FLOAT R6PLUS_TVX_FMT_32_32_32_32 R6PLUS_TVX_FMT_RESERVED_33 R6PLUS_TVX_FMT_16_16_16_16_FLOAT R6PLUS_TVX_FMT_16_16_16_16 R6PLUS_TVX_FMT_32_32_FLOAT R6PLUS_TVX_FMT_32_32 R6PLUS_TVX_FMT_X24_8_32_FLOAT R6PLUS_TVX_FMT_10_10_10_2 R6PLUS_TVX_FMT_8_8_8_8 R6PLUS_TVX_FMT_2_10_10_10 R6PLUS_TVX_FMT_11_11_10_FLOAT R6PLUS_TVX_FMT_11_11_10 R6PLUS_TVX_FMT_10_11_11_FLOAT R6PLUS_TVX_FMT_10_11_11 R6PLUS_TVX_FMT_24_8_FLOAT R6PLUS_TVX_FMT_24_8 R6PLUS_TVX_FMT_8_24_FLOAT R6PLUS_TVX_FMT_8_24 R6PLUS_TVX_FMT_16_16_FLOAT R6PLUS_TVX_FMT_16_16 R6PLUS_TVX_FMT_32_FLOAT R6PLUS_TVX_FMT_32 R6PLUS_TVX_FMT_5_5_5_1 R6PLUS_TVX_FMT_4_4_4_4 R6PLUS_TVX_FMT_1_5_5_5 R6PLUS_TVX_FMT_6_5_5 R6PLUS_TVX_FMT_5_6_5 R6PLUS_TVX_FMT_8_8 R6PLUS_TVX_FMT_16_FLOAT R6PLUS_TVX_FMT_16 R6PLUS_TVX_FMT_RESERVED_4 R6PLUS_TVX_FMT_3_3_2 R6PLUS_TVX_FMT_4_4 R6PLUS_TVX_FMT_8 R6PLUS_TVX_FMT_INVALID R6PLUS_TVX_DstSel_Mask R6PLUS_TVX_DstSel_RESERVED_6 R6PLUS_TVX_DstSel_1f R6PLUS_TVX_DstSel_0f R6PLUS_TVX_DstSel_W R6PLUS_TVX_DstSel_Z R6PLUS_TVX_DstSel_Y R6PLUS_TVX_DstSel_X R6PLUS_SRF_MODE_NO_ZERO R6PLUS_SRF_MODE_ZERO_CLAMP_MINUS_ONE R6PLUS_NUM_FORMAT_SCALED R6PLUS_NUM_FORMAT_INT R6PLUS_NUM_FORMAT_NORM R6PLUS_TVX_EndianSwap_RESERVED_3 R6PLUS_TVX_EndianSwap_8in32 R6PLUS_TVX_EndianSwap_8in16 R6PLUS_TVX_EndianSwap_None R6PLUS_DSR_MUX_DWORD_SELECT R6PLUS_DSR_MUX_FFT_PERMUTE R6PLUS_DSR_MUX_NONE ATOMIC_ORDERED_ALLOC_RET USHORT_READ_RET SHORT_READ_RET UBYTE_READ_RET BYTE_READ_RET READWRITE_RET READ2_RET READ_REL_RET READ_RET CMP_XCHG_SPF_RET CMP_XCHG_RET XCHG2_RET XCHG_REL_RET XCHG_RET MSKOR_RET XOR_RET OR_RET AND_RET MAX_UINT_RET MIN_UINT_RET MAX_INT_RET MIN_INT_RET DEC_RET INC_RET RSUB_RET SUB_RET ADD_RET SHORT_WRITE BYTE_WRITE CMP_STORE_SPF CMP_STORE WRITE2 WRITE_REL WRITE DEC_UINT_RTN INC_UINT_RTN MSKOR_RTN XOR_RTN OR_RTN AND_RTN MAX_UINT_RTN MAX_INT_RTN MIN_UINT_RTN MIN_INT_RTN RSUB_RTN SUB_RTN ADD_RTN CMPXCHG_FDENORM_RTN CMPXCHG_FLT_RTN CMPXCHG_INT_RTN XCHG_FDENORM_RTN XCHG_RTN NOP_RTN STORE_BYTE__NI STORE_SHORT__NI STORE_DWORD__NI DEC_UINT INC_UINT MSKOR RSUB CMPXCHG_FDENORM CMPXCHG_FLT CMPXCHG_INT STORE_RAW_FDENORM STORE_RAW STORE_TYPED R6PLUS_VTX_FETCH_NO_INDEX_OFFSET R6PLUS_VTX_FETCH_INSTANCE_DATA R6PLUS_VTX_FETCH_VERTEX_DATA R6PLUS_TEX_NORMALIZED R6PLUS_TEX_UNNORMALIZED R6PLUS_VTX_INST_GET_BUFFER_RESINFO R6PLUS_VTX_INST_MEM R6PLUS_VTX_INST_SEMANTIC R6PLUS_VTX_INST_FETCH R6PLUS_FORMAT_COMP_UNSIGNED_BIASED R6PLUS_FORMAT_COMP_SIGNED R6PLUS_FORMAT_COMP_UNSIGNED R6PLUS_ENDIAN_8IN32 R6PLUS_ENDIAN_8IN16 R6PLUS_ENDIAN_NONE R6PLUS_ALU_EXECUTE_MASK_OP_KILL R6PLUS_ALU_EXECUTE_MASK_OP_CONTINUE R6PLUS_ALU_EXECUTE_MASK_OP_BREAK R6PLUS_ALU_EXECUTE_MASK_OP_DEACTIVATE R6PLUS_ALU_SCL_221 R6PLUS_ALU_SCL_212 R6PLUS_ALU_SCL_122 R6PLUS_ALU_SCL_210 R6PLUS_ALU_VEC_210 R6PLUS_ALU_VEC_201 R6PLUS_ALU_VEC_102 R6PLUS_ALU_VEC_120 R6PLUS_ALU_VEC_021 R6PLUS_ALU_VEC_012 R6PLUS_ALU_OMOD_D2 R6PLUS_ALU_OMOD_M4 R6PLUS_ALU_OMOD_M2 R6PLUS_ALU_OMOD_OFF R6PLUS_PRED_SEL_ONE R6PLUS_PRED_SEL_ZERO R6PLUS_PRED_SEL_RESERVED_1 R6PLUS_PRED_SEL_OFF R6PLUS_CF_JUMPTABLE_SEL_INDEX_1 R6PLUS_CF_JUMPTABLE_SEL_INDEX_0 R6PLUS_CF_JUMPTABLE_SEL_CONST_D R6PLUS_CF_JUMPTABLE_SEL_CONST_C R6PLUS_CF_JUMPTABLE_SEL_CONST_B R6PLUS_CF_JUMPTABLE_SEL_CONST_A R6PLUS_CF_PIXEL_Z R6PLUS_CF_PIXEL_MRT7 R6PLUS_CF_PIXEL_MRT6 R6PLUS_CF_PIXEL_MRT5 R6PLUS_CF_PIXEL_MRT4 R6PLUS_CF_PIXEL_MRT3 R6PLUS_CF_PIXEL_MRT2 R6PLUS_CF_PIXEL_MRT1 R6PLUS_CF_PIXEL_MRT0 R6PLUS_CF_POS_3 R6PLUS_CF_POS_2 R6PLUS_CF_POS_1 R6PLUS_CF_POS_0 R6PLUS_CF_INVALID R6PLUS_CF_INDEX_1 R6PLUS_CF_INDEX_0 R6PLUS_CF_INDEX_NONE R6PLUS_CF_KCACHE_LOCK_LOOP_INDEX R6PLUS_CF_KCACHE_LOCK_2 R6PLUS_CF_KCACHE_LOCK_1 R6PLUS_CF_KCACHE_NOP R6PLUS_EXPORT_WRITE_IND_ACK R6PLUS_EXPORT_WRITE_ACK R6PLUS_EXPORT_WRITE_IND R6PLUS_EXPORT_WRITE R6PLUS_EXPORT_PARAM R6PLUS_EXPORT_POS R6PLUS_EXPORT_PIXEL R6PLUS_CF_COND_NOT_BOOL R6PLUS_CF_COND_BOOL R6PLUS_CF_COND_FALSE R6PLUS_CF_COND_ACTIVE R6PLUS_CF_ENCODING_INST_ALU1 R6PLUS_CF_ENCODING_INST_ALU0 R6PLUS_CF_ENCODING_INST_ALLOC_EXPORT R6PLUS_CF_ENCODING_INST_CF R6PLUS_REL_GLOBAL R6PLUS_REL_LOOP R6PLUS_REL_NONE R6PLUS_INDEX_GLOBAL_AR_X R6PLUS_INDEX_GLOBAL R6PLUS_INDEX_LOOP R6PLUS_INDEX_AR_W R6PLUS_INDEX_AR_Z R6PLUS_INDEX_AR_Y R6PLUS_INDEX_AR_X R6PLUS_RELATIVE R6PLUS_ABSOLUTE R6PLUS_SEL_MASK R6PLUS_SEL_RESERVED_6 R6PLUS_SEL_1 R6PLUS_SEL_0 R6PLUS_SEL_W R6PLUS_SEL_Z R6PLUS_SEL_Y R6PLUS_SEL_X R6PLUS_CHAN_W R6PLUS_CHAN_Z R6PLUS_CHAN_Y R6PLUS_CHAN_X R6PLUS_MEM_INST_MEM ;SQ_PGM_RESOURCES_2 = 0x%08X VGT_STRMOUT_CONFIG = 0x%x VGT_STRMOUT_CONFIG:RAST_STREAM = %u VGT_STRMOUT_CONFIG:STREAMOUT_0_EN = %u VGT_STRMOUT_CONFIG:STREAMOUT_1_EN = %u VGT_STRMOUT_CONFIG:STREAMOUT_2_EN = %u VGT_STRMOUT_CONFIG:STREAMOUT_3_EN = %u u32LsStride = %d u32HsNumInputCP = %d u32HsNumOutputCP = %d u32HsNumPatchConst = %d u32HsCPStride = %d u32HsNumThread = %d u32HsTessFactorStride= %d HsTessFactorBufferTFMajor = %d u32TsDomain = %d u32TsPartition = %d u32TsOutputPrimitive = %d f32TsMaxTessFactor = %g u32PrimIdExportSlot = %d ; UavRtnBufInfoTbl[%d] stride = %d isTypedUav = %d dataType = %d ; GlobalRtnBufSlot = 0x%X ; GlobalRtnBufSlotShort = 0x%X ; GlobalRtnBufSlotByte = 0x%X ; RatOpIsUsed = 0x%X ; RatAtomicOpIsUsed = 0x%X VGT_GS_INSTANCE_CNT = 0x%08X ; ENABLE = %u ; CNT = %u SQ_LDS_ALLOC_PS:SIZE = 0x%08X ; SPI_PS_IN_CONTROL_2 = 0x%08X SPI2:LINE_STIPPLE_TEX_ENA = %u SPI2:LINE_STIPPLE_TEX_ADDR = %u ; SPI_BARYC_CNTL = 0x%08X SPI_BARYC_CNTL:PERSP_CENTER_ENA = %u SPI_BARYC_CNTL:PERSP_CENTROID_ENA = %u SPI_BARYC_CNTL:PERSP_SAMPLE_ENA = %u SPI_BARYC_CNTL:PERSP_PULL_MODEL_ENA = %u SPI_BARYC_CNTL:LINEAR_CENTER_ENA = %u SPI_BARYC_CNTL:LINEAR_CENTROID_ENA = %u SPI_BARYC_CNTL:LINEAR_SAMPLE_ENA = %u DB:DB_SOURCE_FORMAT = %u DB:CONSERVATIVE_Z_EXPORT = %u DB:DEPTH_BEFORE_SHADER = %u DB:EXEC_ON_HIER_FAIL = %u DB:EXEC_ON_NOOP = %u SQ_LDS_ALLOC:SIZE = 0x%08X ; NumThreadPerGroupFlattened = %d ; NumThreadPerGroup_x = %d ; NumThreadPerGroup_y = %d ; NumThreadPerGroup_z = %d _addr _unroll _size _matrix _coordtype _aoffimmi _compselect _sampler _resource _relop _resourcetype _type _fmtw _fmtz _fmty _fmtx _nrm3 _zeroop dbg_temploc dbg_line dbg_string srv_struct_load_ext srv_raw_load_ext append_buf_consume append_buf_alloc uav_short_store uav_short_store_ext uav_byte_store uav_byte_store_ext uav_ushort_load uav_ushort_load_ext uav_ubyte_load uav_ubyte_load_ext uav_short_load uav_short_load_ext uav_byte_load uav_byte_load_ext uav_read_udec uav_read_udec_ext uav_read_uinc uav_read_uinc_ext uav_udec uav_udec_ext uav_uinc uav_uinc_ext uav_read_cmp_xchg uav_read_cmp_xchg_ext uav_read_xchg uav_read_xchg_ext uav_read_xor uav_read_xor_ext uav_read_or uav_read_or_ext uav_read_and uav_read_and_ext uav_read_umax uav_read_umax_ext uav_read_umin uav_read_umin_ext uav_read_max uav_read_max_ext uav_read_min uav_read_min_ext uav_read_rsub uav_read_rsub_ext uav_read_sub uav_read_sub_ext uav_read_add uav_read_add_ext uav_cmp uav_cmp_ext uav_xor uav_xor_ext uav_or uav_or_ext uav_and uav_and_ext uav_umax uav_umax_ext uav_umin uav_umin_ext uav_max uav_max_ext uav_min uav_min_ext uav_rsub uav_rsub_ext uav_sub uav_sub_ext uav_add uav_add_ext uav_arena_store uav_arena_load uav_struct_store uav_struct_store_ext uav_raw_store uav_raw_store_ext uav_store uav_store_ext uav_struct_load uav_struct_load_ext uav_raw_load uav_raw_load_ext uav_load uav_load_ext gds_read_cmp_xchg gds_read_xchg gds_read_mskor gds_read_xor gds_read_or gds_read_and gds_read_umax gds_read_umin gds_read_max gds_read_min gds_read_dec gds_read_inc gds_read_rsub gds_read_sub gds_read_add gds_cmp_store gds_mskor gds_xor gds_or gds_and gds_umax gds_umin gds_max gds_min gds_dec gds_inc gds_rsub gds_sub gds_add gds_store gds_load lds_store_short lds_store_byte lds_load_ushort lds_load_ubyte lds_load_short lds_load_byte lds_read_cmp_xchg lds_read_xchg lds_read_mskor lds_read_xor lds_read_or lds_read_and lds_read_umax lds_read_umin lds_read_max lds_read_min lds_read_rsub lds_read_sub lds_read_dec lds_read_inc lds_read_add lds_cmp lds_mskor lds_xor lds_or lds_and lds_umax lds_umin lds_max lds_min lds_rsub lds_sub lds_dec lds_inc lds_add lds_store_vec lds_load_vec lds_store lds_load mqsad qsad msad sad4 sadhi u4lerp bytealign bitalign unpack0 unpack1 unpack2 unpack3 f2u4 f_2_u4 mova mova_round invariant_mov invariant_move ftoi_flr ftoi_rpi utod itod dtou dtoi f162f f2f16_plus_inf f2f16_neg_inf f2f16_near f2f16 utof itof ftou ftoi dtrig_preop ddiv_fixup ddiv_fmas ddiv_scale dclass dmovc dmov drsq drcp dsqrt dmin dmax dmad dfrac ldexp dldexp ddiv dmul dadd dfrexp_mant dfrexp_exp frexp dfrexp fdiv_fixup fdiv_fmas fdiv_scale class max3 med3 min3 frexp_mant frexp_exp fldexp rcp_vec transpose sqrt_vec sqrt cos_vec sin_vec sincos rsq_vec round_z round_plusinf round_neginf round_nearest pireduce mmul logp log_vec fwidth faceforward expp exp_vec dxsincos dp2add dist colorclamp cmov_logical cmov clamp atan asin acos u64mod i64mod u64div i64div u64mul i64mul u64shr u64min u64max u64lt u64ge i64shr i64shl i64negate i64ne i64min i64max i64lt i64ge i64eq i64sub i64add umax3 umed3 umin3 imul24 umul24_high umul24 umad24 umul_high umul umin umax umad umod udiv ushr imax3 imed3 imin3 imad24 imul24_high iborrow icarry ishr ishl inegate imad imul_high imul imin imax iadd ubit_insert icbits ubit_reverser ubit_reverse ubit_extract ibit_extract inot ixor iand stream_id wave_id cu_id eval_centroid eval_sample_index eval_snapped emit_cut_sream emit_stream cut_stream sample_return_code sample_c_b_ext sample_c_b sample_c_l_ext sample_c_l sample_c_g_ext sample_c_g sample_c_ext sample_c sample_c_lz_ext sample_c_lz sample_l_ext sample_l sample_g_ext sample_g fetch4poc_ext fetch4poc fetch4c_ext fetch4c fetch4po_ext fetch4po fetch4_ext fetch4 sample_b_ext sample_b sample_ext sample samplepos_ext samplepos sampleinfo_ext sampleinfo bufinfo_ext bufinfo resinfo_ext resinfo getlod load_fptr_ext load_fptr load_ext load emitcut emit discard_logicalnz discard_logicalz dcl_resource dcl_global_flags dcl_stream dcl_num_instances dcl_lds_size_per_thread dclarray endphase hs_join_phase hs_fork_phase hs_cp_phase ret_logicalnz ret_logicalz ret_dyn switch whileloop loop_rep loop if_logicalnz if_logicalz ifnz func endloop endif endfunc endmain endswitch else default continue_logicalnz continue_logicalz continuec continue case call_logicalnz call_logicalz callnz call break_logicalnz break_logicalz breakc break none cubemaparray cubemap_plus_w 2d_plus_w 2darraymsaa 2darray 1darray buffer 2dmsaa cubemap No Error Non fragment programs not supported Invalid target architecture Unsupported program type Error in Source binary Error getting encoding count Couldn't find appropriate il binary source Invalid target Error Initializing compiler Memory allocation failure Error Creating program info Error Creating constants Error Creating UAV buffer Error encoding binary Error packing binary Invalid architecture Invalid machine type IL_SHADER_PIXEL IL_SHADER_COMPUTE ShaderType = %s TargetChip = %c Parse errors in converting assembly program No Error Reported! Fatal Error: Internal error encountered in back-end! Fatal Error: Back-end out of memory! Fatal Error: Invalid parameters passed to back-end! Fatal Error: Unsupported program construct detected in back-end! Fatal Error: Compilation error reported by back-end! Fatal Error: Invalid operation for this architecture Fatal Error: An unknown error occured in back-end! IL_DBG_TEMPLOC IL_DBG_LINE IL_DBG_STRING IL_OP_BFM IL_OP_BFI IL_OP_STREAM_ID IL_OP_MQSAD_U8 IL_OP_QSAD_U8 IL_OP_MSAD_U8 IL_OP_D_TRIG_PREOP IL_OP_D_DIV_FIXUP IL_OP_D_DIV_FMAS IL_OP_D_DIV_SCALE IL_OP_DIV_FIXUP IL_OP_DIV_FMAS IL_OP_DIV_SCALE IL_OP_SEMAPHORE_WAIT IL_OP_SEMAPHORE_SIGNAL IL_OP_SEMAPHORE_INIT IL_DCL_SEMAPHORE IL_DCL_GWS_THREAD_COUNT IL_OP_U64_MOD IL_OP_I64_MOD IL_OP_U64_DIV IL_OP_I64_DIV IL_OP_I64_SUB IL_OP_WAVE_ID IL_OP_CU_ID IL_OP_SAMPLE_RETURN_CODE IL_OP_D_CLASS IL_OP_CLASS IL_OP_U_MAX3 IL_OP_U_MED3 IL_OP_U_MIN3 IL_OP_I_MAX3 IL_OP_I_MED3 IL_OP_I_MIN3 IL_OP_MAX3 IL_OP_MED3 IL_OP_MIN3 IL_OP_FTOI_FLR IL_OP_FTOI_RPI IL_OP_UTOD IL_OP_ITOD IL_OP_DTOU IL_OP_DTOI IL_OP_D_FREXP_MANT IL_OP_D_FREXP_EXP IL_OP_FREXP_MANT IL_OP_FREXP_EXP IL_OP_LDEXP IL_OP_U64_MUL IL_OP_I64_MUL IL_OP_F_2_F16_PLUS_INF IL_OP_F_2_F16_NEG_INF IL_OP_F_2_F16_NEAR IL_OP_LDS_READ_MSKOR IL_OP_LDS_MSKOR IL_OP_LDS_READ_DEC IL_OP_LDS_READ_INC IL_OP_LDS_DEC IL_OP_LDS_INC IL_OP_U_MUL24_HIGH IL_OP_I_MUL24_HIGH IL_OP_DCL_TYPELESS_UAV IL_OP_DCL_TYPED_UAV IL_OP_U64_SHR IL_OP_U64_MIN IL_OP_U64_MAX IL_OP_U64_LT IL_OP_U64_GE IL_OP_I64_SHR IL_OP_I64_SHL IL_OP_I64_NEGATE IL_OP_I64_NE IL_OP_I64_MIN IL_OP_I64_MAX IL_OP_I64_LT IL_OP_I64_GE IL_OP_I64_EQ IL_OP_I64_ADD IL_OP_UAV_SHORT_STORE IL_OP_UAV_BYTE_STORE IL_OP_UAV_USHORT_LOAD IL_OP_UAV_UBYTE_LOAD IL_OP_UAV_SHORT_LOAD IL_OP_UAV_BYTE_LOAD IL_OP_LDS_STORE_SHORT IL_OP_LDS_STORE_BYTE IL_OP_LDS_LOAD_USHORT IL_OP_LDS_LOAD_UBYTE IL_OP_LDS_LOAD_SHORT IL_OP_LDS_LOAD_BYTE IL_OP_UAV_READ_UDEC IL_OP_UAV_READ_UINC IL_OP_I_MUL24 IL_OP_I_MAD24 IL_OP_UAV_UDEC IL_OP_UAV_UINC IL_OP_FMA IL_OP_U_MUL24 IL_OP_U_MAD24 IL_OP_GDS_READ_CMP_XCHG IL_OP_GDS_READ_XCHG IL_OP_GDS_READ_MSKOR IL_OP_GDS_READ_XOR IL_OP_GDS_READ_OR IL_OP_GDS_READ_AND IL_OP_GDS_READ_UMAX IL_OP_GDS_READ_UMIN IL_OP_GDS_READ_MAX IL_OP_GDS_READ_MIN IL_OP_GDS_READ_DEC IL_OP_GDS_READ_INC IL_OP_GDS_READ_RSUB IL_OP_GDS_READ_SUB IL_OP_GDS_READ_ADD IL_OP_GDS_CMP_STORE IL_OP_GDS_MSKOR IL_OP_GDS_XOR IL_OP_GDS_OR IL_OP_GDS_AND IL_OP_GDS_UMAX IL_OP_GDS_UMIN IL_OP_GDS_MAX IL_OP_GDS_MIN IL_OP_GDS_DEC IL_OP_GDS_INC IL_OP_GDS_RSUB IL_OP_GDS_SUB IL_OP_GDS_ADD IL_OP_GDS_STORE IL_OP_GDS_LOAD IL_DCL_STRUCT_GDS IL_DCL_GDS IL_OP_PREFIX IL_DCL_MAX_THREAD_PER_GROUP IL_OP_LOAD_FPTR IL_OP_RCP_VEC IL_DCL_GLOBAL_FLAGS IL_DCL_STREAM IL_OP_MACROCALL IL_OP_MACROEND IL_OP_MACRODEF IL_OP_D_RSQ IL_OP_D_RCP IL_OP_D_SQRT IL_OP_D_MOVC IL_OP_D_MOV IL_OP_EVAL_CENTROID IL_OP_EVAL_SAMPLE_INDEX IL_OP_EVAL_SNAPPED IL_OP_F_2_U4 IL_OP_SAD_4 IL_OP_SAD_HI IL_OP_SAD IL_OP_U4LERP IL_OP_BYTE_ALIGN IL_OP_BIT_ALIGN IL_OP_UNPACK3 IL_OP_UNPACK2 IL_OP_UNPACK1 IL_OP_UNPACK0 IL_OP_F16_2_F IL_OP_F_2_F16 IL_OP_DMIN IL_OP_DMAX IL_OP_FETCH4_PO_C IL_OP_FETCH4_PO IL_OP_FETCH4_C IL_OP_BUFINFO IL_OP_U_BIT_INSERT IL_OP_FCALL IL_OP_DCL_INTERFACE_PTR IL_OP_DCL_FUNCTION_TABLE IL_OP_DCL_FUNCTION_BODY IL_DCL_MAX_TESSFACTOR IL_DCL_TS_OUTPUT_PRIMITIVE IL_DCL_TS_PARTITION IL_DCL_TS_DOMAIN IL_OP_ENDPHASE IL_OP_HS_JOIN_PHASE IL_OP_HS_FORK_PHASE IL_OP_HS_CP_PHASE IL_DCL_NUM_INSTANCES IL_DCL_NUM_OCP IL_DCL_NUM_ICP IL_OP_U_BIT_REVERSE IL_OP_U_BIT_EXTRACT IL_OP_I_BIT_EXTRACT IL_OP_I_BORROW IL_OP_I_CARRY IL_OP_I_FIRSTBIT IL_OP_I_COUNTBITS IL_OP_SAMPLE_C_B IL_OP_SAMPLE_C_G IL_OP_SAMPLE_C_L IL_OP_EMIT_THEN_CUT_STREAM IL_OP_EMIT_STREAM IL_OP_CUT_STREAM IL_OP_LDS_READ_CMP_XCHG IL_OP_LDS_READ_XCHG IL_OP_LDS_READ_XOR IL_OP_LDS_READ_OR IL_OP_LDS_READ_AND IL_OP_LDS_READ_UMAX IL_OP_LDS_READ_UMIN IL_OP_LDS_READ_MAX IL_OP_LDS_READ_MIN IL_OP_LDS_READ_RSUB IL_OP_LDS_READ_SUB IL_OP_LDS_READ_ADD IL_OP_LDS_CMP IL_OP_LDS_XOR IL_OP_LDS_OR IL_OP_LDS_AND IL_OP_LDS_UMAX IL_OP_LDS_UMIN IL_OP_LDS_MAX IL_OP_LDS_MIN IL_OP_LDS_RSUB IL_OP_LDS_SUB IL_OP_LDS_ADD IL_OP_LDS_STORE IL_OP_LDS_LOAD IL_DCL_STRUCT_LDS IL_DCL_LDS IL_OP_SRV_STRUCT_LOAD IL_OP_SRV_RAW_LOAD IL_OP_DCL_STRUCT_SRV IL_OP_DCL_RAW_SRV IL_OP_APPEND_BUF_CONSUME IL_OP_APPEND_BUF_ALLOC IL_OP_UAV_READ_CMP_XCHG IL_OP_UAV_READ_XCHG IL_OP_UAV_READ_XOR IL_OP_UAV_READ_OR IL_OP_UAV_READ_AND IL_OP_UAV_READ_UMAX IL_OP_UAV_READ_UMIN IL_OP_UAV_READ_MAX IL_OP_UAV_READ_MIN IL_OP_UAV_READ_RSUB IL_OP_UAV_READ_SUB IL_OP_UAV_READ_ADD IL_OP_UAV_CMP IL_OP_UAV_XOR IL_OP_UAV_OR IL_OP_UAV_AND IL_OP_UAV_UMAX IL_OP_UAV_UMIN IL_OP_UAV_MAX IL_OP_UAV_MIN IL_OP_UAV_RSUB IL_OP_UAV_SUB IL_OP_UAV_ADD IL_OP_UAV_ARENA_STORE IL_OP_UAV_ARENA_LOAD IL_OP_DCL_ARENA_UAV IL_OP_UAV_STRUCT_STORE IL_OP_UAV_RAW_STORE IL_OP_UAV_STORE IL_OP_UAV_STRUCT_LOAD IL_OP_UAV_RAW_LOAD IL_OP_UAV_LOAD IL_OP_DCL_STRUCT_UAV IL_OP_DCL_RAW_UAV IL_OP_DCL_UAV IL_OP_LDS_STORE_VEC IL_OP_LDS_LOAD_VEC IL_OP_FENCE IL_OP_LDS_WRITE_VEC IL_OP_LDS_READ_VEC IL_OP_DCL_LDS_SHARING_MODE IL_OP_DCL_LDS_SIZE_PER_THREAD IL_OP_DCL_TOTAL_NUM_THREAD_GROUP IL_OP_DCL_NUM_THREAD_PER_GROUP IL_OP_INIT_SR_HELPER IL_OP_INIT_SR IL_OP_DCL_SHARED_TEMP IL_OP_D_DIV IL_OP_SAMPLEPOS IL_OP_DLT IL_OP_DGE IL_OP_DEQ IL_OP_DNE IL_DCL_PERSIST IL_OP_GETLOD IL_OP_SAMPLEINFO IL_OP_FETCH4 IL_OP_D_MULADD IL_OP_D_FRAC IL_OP_D_LDEXP IL_OP_F_2_D IL_OP_D_2_F IL_OP_D_MUL IL_OP_D_ADD IL_OP_D_FREXP IL_OP_SCATTER IL_OP_INV_MOV IL_OP_DP2 IL_OP_SQRT_VEC IL_OP_COS_VEC IL_OP_SIN_VEC IL_OP_RSQ_VEC IL_OP_ROUND_ZERO IL_OP_ROUND_PLUS_INF IL_OP_ROUND_NEG_INF IL_OP_ROUND_NEAR IL_OP_NE IL_OP_LT IL_OP_LOG_VEC IL_OP_GE IL_OP_EXP_VEC IL_OP_EQ IL_OP_CMOV_LOGICAL IL_OP_AND IL_OP_UTOF IL_OP_ITOF IL_OP_FTOU IL_OP_FTOI IL_OP_U_MUL_HIGH IL_OP_U_MUL IL_OP_U_GE IL_OP_U_LT IL_OP_U_MIN IL_OP_U_MAX IL_OP_U_MAD IL_OP_U_MOD IL_OP_U_DIV IL_OP_U_SHR IL_OP_I_SHR IL_OP_I_SHL IL_OP_I_NE IL_OP_I_NEGATE IL_OP_I_LT IL_OP_I_GE IL_OP_I_EQ IL_OP_I_MUL_HIGH IL_OP_I_MUL IL_OP_I_MIN IL_OP_I_MAX IL_OP_I_MAD IL_OP_I_ADD IL_OP_I_XOR IL_OP_I_OR IL_OP_I_NOT IL_OP_SAMPLE_C_LZ IL_OP_SAMPLE_C IL_OP_SAMPLE_L IL_OP_SAMPLE_G IL_OP_SAMPLE_B IL_OP_SAMPLE IL_OP_RESINFO IL_OP_LOAD IL_OP_EMIT_THEN_CUT IL_OP_EMIT IL_OP_DISCARD_LOGICALNZ IL_OP_DISCARD_LOGICALZ IL_OP_CUT IL_DCL_RESOURCE IL_DCL_VPRIM IL_DCL_INPUT IL_DCL_OUTPUT IL_DCL_OUTPUT_TOPOLOGY IL_DCL_ODEPTH IL_DCL_MAX_OUTPUT_VERTEX_COUNT IL_DCL_LITERAL IL_DCL_INPUT_PRIMITIVE IL_DCL_INDEXED_TEMP_ARRAY IL_DCL_CONST_BUFFER IL_OP_RET_LOGICALNZ IL_OP_RET_LOGICALZ IL_OP_RET_DYN IL_OP_SWITCH IL_OP_WHILE IL_OP_IF_LOGICALNZ IL_OP_IF_LOGICALZ IL_OP_ENDINLINEFUNC IL_OP_ENDSWITCH IL_OP_DEFAULT IL_OP_CONTINUE_LOGICALNZ IL_OP_CONTINUE_LOGICALZ IL_OP_CASE IL_OP_CALL_LOGICALNZ IL_OP_CALL_LOGICALZ IL_OP_BREAK_LOGICALNZ IL_OP_BREAK_LOGICALZ IL_OP_DXSINCOS IL_OP_TRC IL_OP_TRANSPOSE IL_OP_TEXWEIGHT IL_OP_TEXLDMS IL_OP_TEXLDD IL_OP_TEXLDB IL_OP_TEXLD IL_OP_TAN IL_OP_SUB IL_OP_SQRT IL_OP_SINCOS IL_OP_SIN IL_OP_SGN IL_OP_SET IL_OP_RSQ IL_OP_RND IL_OP_RET IL_OP_REFLECT IL_OP_RCP IL_OP_PROJECT IL_OP_PRECOMP IL_OP_POW IL_OP_PIREDUCE IL_OP_NRM IL_OP_NOP IL_OP_NOISE IL_OP_MUL IL_OP_MOVA IL_OP_MOV IL_OP_MOD IL_OP_MMUL IL_OP_MIN IL_OP_MEMIMPORT IL_OP_MEMEXPORT IL_OP_MAX IL_OP_MAD IL_OP_LRP IL_OP_LOOP IL_OP_LOGP IL_OP_LOG IL_OP_LOD IL_OP_LN IL_OP_LIT IL_OP_LEN IL_OP_KILL IL_OP_INITV IL_OP_IFNZ IL_OP_IFC IL_OP_FWIDTH IL_OP_FUNC IL_OP_FRC IL_OP_FLR IL_OP_FACEFORWARD IL_OP_EXPP IL_OP_EXP IL_OP_EXN IL_OP_ENDMAIN IL_OP_ENDLOOP IL_OP_ENDIF IL_OP_END IL_OP_ELSE IL_OP_DSY IL_OP_DSX IL_OP_DST IL_OP_DP4 IL_OP_DP3 IL_OP_DP2ADD IL_OP_DIV IL_OP_DIST IL_OP_DET IL_OP_DEFB IL_OP_DEF IL_OP_DCLVOUT IL_OP_DCLV IL_OP_DCLPT IL_OP_DCLPP IL_OP_DCLPIN IL_OP_DCLPI IL_OP_DCLDEF IL_OP_DCLARRAY IL_OP_CRS IL_OP_COS IL_OP_CONTINUEC IL_OP_CONTINUE IL_OP_COMMENT IL_OP_COLORCLAMP IL_OP_CMP IL_OP_CMOV IL_OP_CLG IL_OP_CLAMP IL_OP_CALLNZ IL_OP_CALL IL_OP_BREAKC IL_OP_BREAK IL_OP_ATAN IL_OP_ASIN IL_OP_ADD IL_OP_ACOS IL_OP_ABS IL_OP_UNKNOWN ILScanILBinary: Unknown opcode in IL Binary ILScanILBinary: Unsupported opcode for architecture ILScanILBinary: Unsupported opcode ILScanILBinary: Fatal Error: Non constant buffer constant detected param IsMaxNumWavePerSIMD NumWavefrontPerSIMD TotalNumThreadGroup NumThreadPerGroup Slow Fast CsSetupMode NumSharedGprTotal NumSharedGprUser SCENARIO_G SCENARIO_B SCENARIO_A GS_MODE MemExportSize writeMask outputSlot memOffset index STREAM CULL_DIST_ENA7 CULL_DIST_ENA6 CULL_DIST_ENA5 CULL_DIST_ENA4 CULL_DIST_ENA3 CULL_DIST_ENA2 CULL_DIST_ENA1 CULL_DIST_ENA0 CLIP_DIST_ENA7 CLIP_DIST_ENA6 CLIP_DIST_ENA5 CLIP_DIST_ENA4 CLIP_DIST_ENA3 CLIP_DIST_ENA2 CLIP_DIST_ENA1 CLIP_DIST_ENA0 MergeFlags VS_OUT_CCDIST1_VEC_ENA VS_OUT_CCDIST0_VEC_ENA VS_OUT_MISC_VEC_ENA USE_VTX_KILL_FLAG USE_VTX_VIEWPORT_INDX USE_VTX_RENDER_TARGET_INDX USE_VTX_EDGE_FLAG USE_VTX_POINT_SIZE PA_CL_VS_OUT_CNTL R600VSOUTPUT_USE_BEST_MODE R600VSOUTPUT_VECTOR_SEMANTICS R600VSOUTPUT_COMPONENT_SEMANTICS VsOutSemanticMode VS_EXPORT_COUNT SLOT StreamOutStride StreamOutDecls StreamOutEnable UsesPrimId MaxOutputVertexCount MemExportVertexSize VGT_GS_OUT_PRIM_TYPE SampleFreq MaxReductionBufferSize CB_SHADER_CONTROL:bitmap DB:KILL_ENABLE DB:Z_ORDER DB:ALPHA_TO_MASK_DISABLE DB:MASK_EXPORT_ENABLE DB:STENCIL_REF_EXPORT_ENABLE DB:Z_EXPORT_ENABLE SPI:PROVIDE_Z_TO_SPI SPI0:BARYC_SAMPLE_ENA SPI0:POSITION_SAMPLE SPI0:LINEAR_GRADIENT_ENA SPI0:PERSP_GRADIENT_ENA SPI0:BARYC_SAMPLE_CNTL SPI0:PARAM_GEN_ADDR SPI0:PARAM_GEN SPI0:POSITION_ADDR SPI0:POSITION_CENTROID SPI0:POSITION_ENA NumTexStages SPI0:NUM_INTERP TexCubeMaskBits SPI1:GEN_INDEX_PIX_ADDR SPI1:FOG_ADDR SPI1:FRONT_FACE_CHAN SPI1:FRONT_FACE_ADDR SPI1:FRONT_FACE_ENA SPI1:FIXED_PT_POSITION_ADDR SPI1:FIXED_PT_POSITION_ENA SPI1:GEN_INDEX_PIX SQ_PGM_EXPORTS_PS:PS_EXPORT_MODE GprPoolSize MaxScratchRegsNeeded SQ_PRM_RESOURCES:PRIME_CACHE_ENABLE SQ_PRM_RESOURCES:FETCH_CACHE_LINES SQ_PGM_RESOURCES:STACK_SIZE PGM_END_FETCH PGM_END_ALU PGM_END_CF SQ_PGM_END_FETCH SQ_PGM_END_ALU SQ_PGM_END_CF CodeLen SQ_PGM_RESOURCES:NUM_GPRS NumClauseTemps NumIntrlBConstants NumIntrlIConstants NumIntrlFConstants original IL_Unknown ResourcesAffectAlphaOutput fatal flex scanner internal error--no action found fatal flex scanner internal error--end of buffer missed input buffer overflow, can't enlarge buffer because scanner uses REJECT input in flex scanner failed out of dynamic memory in yy_create_buffer() out of dynamic memory in yy_scan_buffer() out of dynamic memory in yy_scan_bytes() bad buffer in yy_scan_bytes() out of memory expanding start-condition stack start-condition stack underflow RegSel FloatComment Register RelMode VecReg VecDstWriteMask RegisterAbs RegisterNeg SrcReg SrcRegList DestReg ALUOpcode0 ALUOpcode ALUProperty ALUProperties2 ALUProperties OutputMod ScalarOp ScalarOps ALUInstBlock ALUInst ALUClause TexParam TexParams TexParamsOpt TexInst TexOpcode VtxFetchDst VtxFetchPropOpt VtxFetchPropsOpt2 VtxFetchPropsOpt VtxFetchOpcode VtxFetchConstOpt VtxInst VtxClause TexClause VecSwiz1 PastSwizzle ExpectSwizzle VecSwiz cache CFPropertiesOpt CFPropListOpt2 CFPropListOpt CFExportInst CFTexInst CFVtxInst CFALUInst CFLoopInst CFInst vecptr1 cfmem vec_ptr cnd_kind CFClauseInst CFInstruction CFProgram HeaderItem PinPropOpt PinPropListOpt2 PinPropListOpt FooterItem FooterList HeaderList StartCopy CopyShader SHDissassembly VTX_FETCHTYPE VTX_WHOLE_QUAD VTX_NUM_FORMAT VTX_CONST_BUF VTX_SRF_MODE VTX_FORMAT_COMP VTX_ENDIAN_SWAP REG_INDEX VTX_FORMAT VTX_OFFSET FLOAT_SPECIAL FLOAT_LITERAL SR_REG L_BANK_SWIZZLE PS_REG FETCH_CONST SEM_ID PV_REG INT_REG CFILE_REG GPR_REG INTEGER_LITERAL L_COUNT CF_POPCNT CF_INST_BREAK CF_INST_ENDREP CF_INST_REP CF_INST_CALL CF_EXPORT_ESIZE CF_EXPORT_BRSTCNT SCALAR_ASSIGNMENT COORD_TYPE L_SAMPLER_ID L_RESOURCE_ID VTX_OPCODE TEX_OPCODE_NO_SRC TEX_OPCODE CHAN ALU_OPCODE0 ALU_OPCODE CF_EXPORT CF_CMD_IND CF_CMD CF_MEM CF_VTX CF_TEX CF_ALU L_CF_INST CF_JUMP L_CF_CONST CF_POP L_CALL_COUNT L_USES_WATERFALL CND_KIND1 L_ZOFFSET L_YOFFSET L_XOFFSET OMOD_D2 OMOD_M4 OMOD_M2 MINIFETCH MEGAFETCH L_VALID_PIXEL_MODE L_WHOLE_QUAD_MODE L_END_OF_PROGRAM L_FOGMERGE WRITE_MASK_INVERT L_UPDATE_PRED UPDATE_EXEC_MASK ALU_CLAMP L_KCACHE L_CB NO_BARRIER VEC_SWIZ EXPORT_REG CF_ADDR CF_COUNT L_KC L_VEC_PTR L_ARRAY_SIZE L_LINEAR L_CENTROID L_SAMPLE L_FLAT L_DEFAULT_VAL L_Usage V_REG L_RES_AFFECT_ALPHA L_CHAR L_TARGET_CHIP L_SHADER_TYPE L_IN L_ENABLE L_CBOUTPUT L_STREAM_STRIDE L_WRITE L_OUTPUTSLOT L_MEMOFFSET L_INDEX L_STREAM L_EsrcTypeCB L_EsrcLoop L_EsrcType_int_const L_START_COPY_SHADER L_VOUT L_ORIGINAL L_VIN L_CHDR L_DEP L_PHDR L_VHDR L_GHDR $undefined. error Miss expected %x got %x Starting parse Entering state %d Reading a token: Now at end of input. Next token is %d (%s Shifting token %d (%s), Reducing via rule %d (line %d), -> %s Special constant %f not supported! state stack now parse error Discarding token %d (%s). Error: state stack now Shifting error token, parser stack overflow Error: R600Asm(%d): parse error xVgi `Wgi REGTYPE_UNSET STENCIL_OP SAMPLE_RETURN_CODE LINE_STIPPLE TIMER NEW_PRIM_MASK_PIXEL NEW_PRIM_MASK_PIXQUAD LDS_PARAM_BASE Coverage_Mask EZGE EZLE LOAD_STORE_OFFSET THIS GS_INSTANCE_ID PHASE_IID OCP_ID DOMAIN BARY_COORD LDS_PARAM LDS_Q AC_MASK SIMD_ID RBUF CF_INDEX TF_BUF T_BIDF T_BID A_TIDF A_TID I_TIDF I_TID EPSFOG OMSK kc_al indexed_cb Call_RSC PRED M_RSC P_RSC IC_RSC ADDR LOOP PIVO PRIMT PRIMC GRAD SPRITE FACE INFOG INC1 INC0 INTEX QUAD PRIM BARY_HOS p=!i cb_flt-cb_flt cb_int-cb_int cb_int-lit cb_flt cb_int loop_bound65k loop_bound255 lod_bias kernel_size boolean_set constbuf_handle memexp0 memexp1 memexp2 memexp3 vertex fog factor adj shadow_fail viewport_z_far_plus_near viewport_bias_y viewport_bias_x viewport_z_far_minus_near viewport_scale_height_half viewport_scale_width_half tex_height_inv tex_width_inv src_bool src_float src_int bool 4.3f 0.5, 0.5, 0.5, 0.5 1.0, 1.0, 1.0, 1.0 0.0, 0.0, 0.0, 0.0 BARRIER 5%i SIMPLE ENTRY EXIT LOOP_FOOTER POST_LOOP_FOOTER IF_HEADER IF_HEADER_S IF_FOOTER IF_FOOTER_S JUMP_TABLE JUMP_TABLE_FOOTER BREAK CONTINUE INLINE_FUNC_END CALL_BLOCK REP_HEADER LOOP_HEADER 5%i 5%i 5%i 5%i 5%i 5%i 5%i 5%i

0 Likes
Alice_Sunny
Journeyman III

Wow. so interesting~
0 Likes

Thanks for the feedback. I will forward your request to the proper people.
0 Likes

Any word on if/when BFI_INT will be included in IL?
Also any word on exposing BFI_INT in OpenCL via bitselect() function?

It seems these issues in AMD OpenCL performance remain open and forgotten for months.

The sad thing is the hack around of using wrong instruction and then patching dynamically in memory is becoming the defacto standard for using AMD incomplete libraries.

http://blog.zorinaq.com/?e=43

http://comments.gmane.org/gmane.comp.security.openwall.john.user/3357

 

0 Likes

As far as I know, this instruction support should be in 11.9, maybe even in 11.8, but the IL doc won't be updated until SDK 2.6 timeframe.

Also, the optimization should be enabled in SDK 2.6.
0 Likes

Yes, it is present in 11.9, thanks 🙂

BTW as a side question, there is a BFE_UINT optimization now which is basically good but for some reason it generates an additional MOV instruction when it indexes an element in local memory, e.g:

a = tableinlocalmemory[(X>> 2)&0x3f];

would generate BFE_UINT and MOV. It's slower than what we had before that optimization was implemented, where we had just shr+and instead of bfe_uint + mov. 

I could write a simplified test-case, but I guess it's not hard to reproduce.

0 Likes

Is it available as separate IL instruction or just as an optimization ? Could you post simple code example which generates BFI_INT ?

0 Likes

So then what is the name of the IL instruction?  Are the parameters as one would expect?

I tried adding a bitselect to my junk kernel, on just 3 random array variables, then wrote the result back, so the optimizer wouldn't cut it out, and I have no BFI_INT.  I installed 11.9 yesterday.

 

0 Likes

delete...see my previous message....Forums double posted...

Suggest a new feature - Delete this post button for when the forums hiccup.

0 Likes

Originally posted by: MicahVillmow As far as I know, this instruction support should be in 11.9, maybe even in 11.8, but the IL doc won't be updated until SDK 2.6 timeframe. Also, the optimization should be enabled in SDK 2.6.


Well that is good to hear.  I will do some research.  

 

Just to clarify when you say "the optimization should be enabled in SDK 2.6" you mean BFI_INT will be accessible via OpenCL code?  Via OpenCL bitselect() function?  Via an extension in cl_amd_media_ops?

If so then that is even better news.  BFI_INT is an amazingly powerful function with many cryptographic uses.  For AMD perspective it is also one area that is a competitive advantage over NVidia as they have no equivelent function which results in simlar code taking 3 OPS vs 1 OP.

0 Likes

The instruction is called BFI in IL. I've been working on optimizing for certain patterns.
One is (A & C) | (B & ~C)
Another is (A & C) | (B & (C ^ -1))

Those are the ones that I am confident will make it in 2.6. I got pulled onto some more pressing matters so I can't add any more patterns at this time.
Another pattern that I was working on is (C ^ (A & (B ^ C)).

I've notified the library person to see if he has time to take advantage of BFI for 2.6.
[Update] 2.6 will use BFI for bitselect.
0 Likes

So what am I doing wrong then that would be having the compiler return parse error near b when I put bfi in my code? 

When I go into Kernel Analyzer options, it has an option for CAL version, "Use Latest Available (CAL 11.7) - v.157.2913 is selected.  Is kernel analyzer just not picking up the latest for some reason?  I tried it in my code anyways as all of the following just hoping to guess...

bfi r0, r1, r2
bfi r0, r1, r2, r3
ibit_bfi r0, r1, r2
ibit_bfi r0, r1, r2, r3
ubit_bfi r0, r1, r2
ubit_bfi r0, r1, r2, r3

Nothing worked 😞  I tried an uninstall of all AMD software from the system, and a reinstall, same version shows up in KernelAnalyzer.  Do I need to manually remove?  What am I doing wrong? 

Again, as usual, thanks for helping us low level people out as well!

0 Likes

Originally posted by: corry So what am I doing wrong then that would be having the compiler return parse error near b when I put bfi in my code? 

 

When I go into Kernel Analyzer options, it has an option for CAL version, "Use Latest Available (CAL 11.7) - v.157.2913 is selected.  Is kernel analyzer just not picking up the latest for some reason?  I tried it in my code anyways as all of the following just hoping to guess...



The issue seems to be CAL 11.7.  The latest CAL is 11.9.  Not sure why KA thinks latest version is 11.7.  Technically KA is correct.  There is not BFI_INT in 11.7 and thus can't compile.

My KA also only shows 11.7 = "latest".  I tried downloading and reinstalling SDK 2.5 with same outcome.  Is CAL support in Kernel Analyzer hard coded to CAL @ time of SDK release? Since the latest CAL when SKD 2.5 was released was 11.7 it will only see 11.7?

0 Likes

This instruction is not in CAL 11.7. You need to make sure that you are using CAL 11.9 from Catalyst 11.9.
0 Likes

So I take it then I have to wait for 2.6 for KernelAnalyzer to be able to make use of it then?  Seems it uses dlls in C:\Program Files (x86)\Common Files\AMD\GPU ShaderAnalyzer called GPUShaderAnalyzer_CAL_11_7.dll, GPUShaderAnalyzer_CAL_11_6.dll, etc 

I fired up depends, and it never showed KernelAnalyzer loading aticalrt.dll, so I take it those other dlls are it.  I checked, they aren't simply renamed aitcalrt.dll files, so yeah, given theres no 11_8 or 11_9 on my system, I take it I'm SOL there.

My program, however, loads aticalrt and aticalcl, which in the SysWOW64 and system32 directories are listed as version 6.14.10.1546, but details also show it was compiled 9/8/2011 at 1:09pm

Thats all I have...whats wrong?!should I uninstall, delete everything ati/amd that I can find, kill all registry entries with amd or ati and reinstall?  Or is there just some other seperate download I am missing?

0 Likes

corry,
Please post a query here: http://forums.amd.com/forum/ca...m?catid=347&zb=4687012 and maybe the Dev tools team can help.
0 Likes

I can see posting there for kernelanalyzer, but I seem to be using the latest aticalcl compiler dll, and still cannot compile with bfi included in the source.  Seems to be 2 seperate issues.  The ISA Docs seem to say there should be 3 source operands, and 1 dest.  So seems bfi r0, r1, r2, r3 should work, yet, all I get is the standard useless annoying "Failed to compile program with IL front-end compiler" ...I'm uninstalling, and manually deleting in the hopes of fixing this...Interesting to note, after uninstall, C:\windows\system32\aticaltrt64.dll still exists...hmmm...

0 Likes

I completly uninstalled, deleted those files, installed catalyst 11.9, and verified the associated .dlls version numbers.  They were identical to what I had before.

Has anyone seen bfi in their IL code, and had the cal compiler accept it, or is this another case of it works on our internal versions? 

On a seperate (but sorta related) note, is there any way to get the error messages like what SKA gives from CAL, or do they get it from ILAssembler.dll?  I'd sure like to know what the complaint is, something better than fatal error:  failed to compile....That always leaves me thinking, "Gee, really?  I think I figured out that part already!" 

0 Likes

Micah/gat3way, I am going to go ahead and call shenanigans on this.  I just treid from OpenCL.  I'll post the kernels and you tell me why I see no BFI (unless like I said this is some sort of shenanigans....)

 

//OpenCL Kernel Below.... //tried with and without this... //#pragma OPENCL EXTENSION cl_amd_media_ops : enable __kernel void Junk(__global unsigned int * output, __global unsigned int * input, const unsigned int multiplier) { uint tid = get_global_id(0); __global uint* mySpot=tid*8; uint t1, t2, t3, t4, t5; t1=input[3]; t2=input[8]; t2=input[14]; t5=(t1 & t2) | (t3 & ~t2) ; t4=bitselect(t1, t2, t3); mySpot[tid] = t4; mySpot[tid+1]=t5; } //Resulting IL below.... mdef(16383)_out(1)_in(2) mov r0, in0 mov r1, in1 div_zeroop(infinity) r0.x___, r0.x, r1.x mov out0, r0 mend il_cs_2_0 dcl_cb cb0[10] ; Constant buffer that holds ABI data dcl_literal l0, 4, 1, 2, 3 dcl_literal l1, 0x00FFFFFF, -1, -2, -3 dcl_literal l2, 0x0000FFFF, 0xFFFFFFFE,0x000000FF,0xFFFFFFFC dcl_literal l3, 24, 16, 8, 0xFFFFFFFF dcl_literal l4, 0xFFFFFF00, 0xFFFF0000, 0xFF00FFFF, 0xFFFF00FF dcl_literal l5, 0, 4, 8, 12 dcl_literal l6, 32, 32, 32, 32 dcl_literal l7, 24, 31, 16, 31 call 1024;$ endmain func 1024 ; __OpenCL_Junk_kernel mov r1013, cb0[8].x mov r1019, l1.0 dcl_max_thread_per_group 256 dcl_raw_uav_id(11) dcl_arena_uav_id(8) mov r0.z, vThreadGrpIdFlat.x mov r1022.xyz0, vTidInGrp.xyz mov r1023.xyz0, vThreadGrpId.xyz imad r1021.xyz0, r1023.xyz0, cb0[1].xyz0, r1022.xyz0 iadd r1021.xyz0, r1021.xyz0, cb0[6].xyz0 iadd r1023.xyz0, r1023.xyz0, cb0[7].xyz0 mov r1023.w, r0.z ishl r1023.w, r1023.w, l0.z mov r1018.x, l0.0 udiv r1024.xyz, r1021.xyz, cb0[10].xyz imad r1025.xyz, r1023.xyz, cb0[1].xyz, r1022.xyz dcl_literal l9, 0x00000002, 0x00000002, 0x00000002, 0x00000002; f32:i32 2 dcl_literal l10, 0x00000003, 0x00000003, 0x00000003, 0x00000003; f32:i32 3 dcl_literal l13, 0x00000004, 0x00000004, 0x00000004, 0x00000004; f32:i32 4 dcl_literal l12, 0x0000000c, 0x0000000c, 0x0000000c, 0x0000000c; f32:i32 12 dcl_literal l11, 0x00000038, 0x00000038, 0x00000038, 0x00000038; f32:i32 56 dcl_cb cb1[3] ; Kernel arg setup: output mov r1, cb1[0] ; Kernel arg setup: input mov r2, cb1[1] ; Kernel arg setup: multiplier mov r3, cb1[2] call 1027 ; Junk ret endfunc ; __OpenCL_Junk_kernel ;ARGSTART:__OpenCL_Junk_kernel ;version:2:0:74 ;device:cayman ;uniqueid:1024 ;memory:hwprivate:0 ;memory:hwregion:0 ;memory:hwlocal:0 ;pointer:output:i32:1:1:0:uav:8:8 ;pointer:input:i32:1:1:16:uav:11:8 ;value:multiplier:i32:1:1:32 ;function:1:1027 ;uavid:11 ;ARGEND:__OpenCL_Junk_kernel func 1027 ; Junk ; @__OpenCL_Junk_kernel ; BB#0: ; %entry mov r254, r1021.xyz0 mov r254, r254.x000 mov r255, l9.xxxx ishl r255.x___, r254.xxxx, r255.xxxx mov r256, l10.xxxx ishl r254.x___, r254.xxxx, r256.xxxx iadd r254.x___, r254.xxxx, r255.xxxx mov r255, l11.xxxx iadd r255.x___, r2.xxxx, r255.xxxx mov r1010.x___, r255.xxxx uav_raw_load_id(11)_cached r1011.x___, r1010.xxxx mov r255.x___, r1011.xxxx mov r256, l12.xxxx iadd r253.x___, r2.xxxx, r256.xxxx mov r1010.x___, r253.xxxx uav_raw_load_id(11)_cached r1011.x___, r1010.xxxx mov r253.x___, r1011.xxxx mov r1011.x___, r253.xxxx mov r1010.x___, r254.xxxx uav_arena_store_id(8)_size(dword) r1010.x, r1011.x iand r253.x___, r255.xxxx, r253.xxxx mov r255, l13.xxxx iadd r254.x___, r254.xxxx, r255.xxxx mov r1011.x___, r253.xxxx mov r1010.x___, r254.xxxx uav_arena_store_id(8)_size(dword) r1010.x, r1011.x ret endfunc ; Junk ;ARGSTART:Junk ;uniqueid:1027 ;memory:hwregion:0 ;memory:hwlocal:0 ;ARGEND:Junk end //Resulting isa below.... ShaderType = IL_SHADER_COMPUTE TargetChip = c ; ------------- SC_SRCSHADER Dump ------------------ SC_SHADERSTATE: u32NumIntVSConst = 0 SC_SHADERSTATE: u32NumIntPSConst = 0 SC_SHADERSTATE: u32NumIntGSConst = 0 SC_SHADERSTATE: u32NumBoolVSConst = 0 SC_SHADERSTATE: u32NumBoolPSConst = 0 SC_SHADERSTATE: u32NumBoolGSConst = 0 SC_SHADERSTATE: u32NumFloatVSConst = 0 SC_SHADERSTATE: u32NumFloatPSConst = 0 SC_SHADERSTATE: u32NumFloatGSConst = 0 fConstantsAvailable = 0 iConstantsAvailable = 0 bConstantsAvailable = 0 u32SCOptions[0] = 0x01A00000 SCOption_IGNORE_SAMPLE_L_BUG SCOption_FLOAT_DO_NOT_DIST SCOption_FLOAT_DO_NOT_REASSOC u32SCOptions[1] = 0x00202000 SCOption_R600_ERROR_ON_DOUBLE_MEMEXP SCOption_SET_VPM_FOR_SCATTER u32SCOptions[2] = 0x00020041 SCOption_R800_UAV_NONARRAY_FIXUP SCOption_R800_UAV_NONUAV_SYNC_WORKAROUND_BUG216513_1 SCOption_R900_BRANCH_IN_NESTED_LOOPS_WORKAROUND_BUG281276 ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(10) KCACHE0(CB1:0-15) KCACHE1(CB0:0-15) 0 y: ADD_INT ____, KC0[1].x, 56 z: ADD_INT ____, KC0[1].x, 12 1 y: LSHR R0.y, PV0.z, 2 w: LSHR R0.w, PV0.y, 2 2 x: MULLO_INT R1.x, R1.x, KC1[1].x y: MULLO_INT ____, R1.x, KC1[1].x z: MULLO_INT ____, R1.x, KC1[1].x w: MULLO_INT ____, R1.x, KC1[1].x 01 TEX: ADDR(64) CNT(2) 3 VFETCH R2.x___, R0.w, fc153 FETCH_TYPE(NO_INDEX_OFFSET) 4 VFETCH R3.x___, R0.y, fc153 FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(42) CNT(11) KCACHE0(CB0:0-15) 5 w: ADD_INT ____, R0.x, R1.x 6 x: AND_INT R0.x, R2.x, R3.x z: ADD_INT ____, PV5.w, KC0[6].x 7 y: LSHL ____, PV6.z, 2 w: LSHL ____, PV6.z, 3 8 z: ADD_INT ____, PV7.y, PV7.w 9 x: LSHR R2.x, PV8.z, 2 10 x: MOV R3.x, R3.x y: MOV R3.y, R0.x VEC_120 03 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R2].xy__, R3, ARRAY_SIZE(4) MARK VPM 04 END END_OF_PROGRAM ; ----------------- CS Data ------------------------ ; Input Semantic Mappings ; No input mappings GprPoolSize = 0 CodeLen = 544;Bytes PGM_END_CF = 0; words(64 bit) PGM_END_ALU = 0; words(64 bit) PGM_END_FETCH = 0; words(64 bit) MaxScratchRegsNeeded = 0 ;AluPacking = 0.0 ;AluClauses = 0 ;PowerThrottleRate = 0.0 ; texResourceUsage[0] = 0x00000000 ; texResourceUsage[1] = 0x00000000 ; texResourceUsage[2] = 0x00000000 ; texResourceUsage[3] = 0x00000000 ; texResourceUsage[4] = 0x00000000 ; texResourceUsage[5] = 0x00000000 ; texResourceUsage[6] = 0x00000000 ; texResourceUsage[7] = 0x00000000 ; fetch4ResourceUsage[0] = 0x00000000 ; fetch4ResourceUsage[1] = 0x00000000 ; fetch4ResourceUsage[2] = 0x00000000 ; fetch4ResourceUsage[3] = 0x00000000 ; fetch4ResourceUsage[4] = 0x00000000 ; fetch4ResourceUsage[5] = 0x00000000 ; fetch4ResourceUsage[6] = 0x00000000 ; fetch4ResourceUsage[7] = 0x00000000 ; texSamplerUsage = 0x00000000 ; constBufUsage = 0x00000000 ResourcesAffectAlphaOutput[0] = 0x00000000 ResourcesAffectAlphaOutput[1] = 0x00000000 ResourcesAffectAlphaOutput[2] = 0x00000000 ResourcesAffectAlphaOutput[3] = 0x00000000 ResourcesAffectAlphaOutput[4] = 0x00000000 ResourcesAffectAlphaOutput[5] = 0x00000000 ResourcesAffectAlphaOutput[6] = 0x00000000 ResourcesAffectAlphaOutput[7] = 0x00000000 ;SQ_PGM_RESOURCES = 0x30000104 SQ_PGM_RESOURCES:NUM_GPRS = 4 SQ_PGM_RESOURCES:STACK_SIZE = 1 SQ_PGM_RESOURCES:PRIME_CACHE_ENABLE = 1 ;SQ_PGM_RESOURCES_2 = 0x000000C0 SQ_LDS_ALLOC:SIZE = 0x00000000 ; RatOpIsUsed = 0x900 ; NumThreadPerGroupFlattened = 256 ; SetBufferForNumGroup = true

0 Likes

corry, looks like in your example, the code that you want to be optimized into a BFI is being optimized away before the BFI pattern can be generated because there is a typo and t3 is never initialized. Once you initialize t3 correctly, BFI gets generated.
0 Likes

Fixed, no bfi

You *SURE* this isn't an internal build only?  Or is it architecture specific?  Enabled for evergreens, but not caymans for some reason? 

Also fixed the pointer....still no BFI...

//OpenCL Below.... //tried with and without this... #pragma OPENCL EXTENSION cl_amd_media_ops : enable __kernel void Junk(__global unsigned int * output, __global unsigned int * input, const unsigned int multiplier) { uint tid = get_global_id(0); __global uint* mySpot=output+tid*8; uint t1, t2, t3, t4, t5; t1=input[3]; t2=input[8]; t3=input[14]; t5=(t1 & t2) | (t3 & ~t2) ; t4=bitselect(t1, t2, t3); mySpot[tid] = t4; mySpot[tid+1]=t5; } //IL Below.... mdef(16383)_out(1)_in(2) mov r0, in0 mov r1, in1 div_zeroop(infinity) r0.x___, r0.x, r1.x mov out0, r0 mend il_cs_2_0 dcl_cb cb0[10] ; Constant buffer that holds ABI data dcl_literal l0, 4, 1, 2, 3 dcl_literal l1, 0x00FFFFFF, -1, -2, -3 dcl_literal l2, 0x0000FFFF, 0xFFFFFFFE,0x000000FF,0xFFFFFFFC dcl_literal l3, 24, 16, 8, 0xFFFFFFFF dcl_literal l4, 0xFFFFFF00, 0xFFFF0000, 0xFF00FFFF, 0xFFFF00FF dcl_literal l5, 0, 4, 8, 12 dcl_literal l6, 32, 32, 32, 32 dcl_literal l7, 24, 31, 16, 31 call 1024;$ endmain func 1024 ; __OpenCL_Junk_kernel mov r1013, cb0[8].x mov r1019, l1.0 dcl_max_thread_per_group 256 dcl_raw_uav_id(11) dcl_arena_uav_id(8) mov r0.z, vThreadGrpIdFlat.x mov r1022.xyz0, vTidInGrp.xyz mov r1023.xyz0, vThreadGrpId.xyz imad r1021.xyz0, r1023.xyz0, cb0[1].xyz0, r1022.xyz0 iadd r1021.xyz0, r1021.xyz0, cb0[6].xyz0 iadd r1023.xyz0, r1023.xyz0, cb0[7].xyz0 mov r1023.w, r0.z ishl r1023.w, r1023.w, l0.z mov r1018.x, l0.0 udiv r1024.xyz, r1021.xyz, cb0[10].xyz imad r1025.xyz, r1023.xyz, cb0[1].xyz, r1022.xyz dcl_literal l15, 0x00000001, 0x00000001, 0x00000001, 0x00000001; f32:i32 1 dcl_literal l13, 0x00000002, 0x00000002, 0x00000002, 0x00000002; f32:i32 2 dcl_literal l12, 0x00000003, 0x00000003, 0x00000003, 0x00000003; f32:i32 3 dcl_literal l9, 0x0000000c, 0x0000000c, 0x0000000c, 0x0000000c; f32:i32 12 dcl_literal l10, 0x00000020, 0x00000020, 0x00000020, 0x00000020; f32:i32 32 dcl_literal l11, 0x00000038, 0x00000038, 0x00000038, 0x00000038; f32:i32 56 dcl_literal l14, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff; f32:i32 4294967295 dcl_cb cb1[3] ; Kernel arg setup: output mov r1, cb1[0] ; Kernel arg setup: input mov r2, cb1[1] ; Kernel arg setup: multiplier mov r3, cb1[2] call 1027 ; Junk ret endfunc ; __OpenCL_Junk_kernel ;ARGSTART:__OpenCL_Junk_kernel ;version:2:0:74 ;device:cayman ;uniqueid:1024 ;memory:hwprivate:0 ;memory:hwregion:0 ;memory:hwlocal:0 ;pointer:output:i32:1:1:0:uav:11:8 ;pointer:input:i32:1:1:16:uav:11:8 ;value:multiplier:i32:1:1:32 ;function:1:1027 ;uavid:11 ;ARGEND:__OpenCL_Junk_kernel func 1027 ; Junk ; @__OpenCL_Junk_kernel ; BB#0: ; %entry mov r255, l9.xxxx iadd r255.x___, r2.xxxx, r255.xxxx mov r256, r1021.xyz0 mov r1010.x___, r255.xxxx uav_raw_load_id(11)_cached r1011.x___, r1010.xxxx mov r255.x___, r1011.xxxx mov r257, l10.xxxx iadd r257.x___, r2.xxxx, r257.xxxx mov r1010.x___, r257.xxxx uav_raw_load_id(11)_cached r1011.x___, r1010.xxxx mov r257.x___, r1011.xxxx ixor r258.x___, r257.xxxx, r255.xxxx mov r259, l11.xxxx iadd r253.x___, r2.xxxx, r259.xxxx mov r1010.x___, r253.xxxx uav_raw_load_id(11)_cached r1011.x___, r1010.xxxx mov r253.x___, r1011.xxxx iand r258.x___, r258.xxxx, r253.xxxx ixor r258.x___, r258.xxxx, r255.xxxx mov r256, r256.x000 mov r259, l12.xxxx ishl r259.x___, r256.xxxx, r259.xxxx iadd r256.x___, r259.xxxx, r256.xxxx mov r259, l13.xxxx ishl r260.x___, r256.xxxx, r259.xxxx iadd r260.x___, r1.xxxx, r260.xxxx mov r1011.x___, r258.xxxx mov r1010.x___, r260.xxxx uav_raw_store_id(11) mem.x___, r1010.xxxx, r1011.xxxx iand r255.x___, r257.xxxx, r255.xxxx mov r258, l14.xxxx ixor r257.x___, r257.xxxx, r258.xxxx iand r253.x___, r253.xxxx, r257.xxxx ior r253.x___, r253.xxxx, r255.xxxx mov r255, l15.xxxx iadd r255.x___, r256.xxxx, r255.xxxx ishl r255.x___, r255.xxxx, r259.xxxx iadd r254.x___, r1.xxxx, r255.xxxx mov r1011.x___, r253.xxxx mov r1010.x___, r254.xxxx uav_raw_store_id(11) mem.x___, r1010.xxxx, r1011.xxxx ret endfunc ; Junk ;ARGSTART:Junk ;uniqueid:1027 ;memory:hwregion:0 ;memory:hwlocal:0 ;ARGEND:Junk end //ISA Below.... ShaderType = IL_SHADER_COMPUTE TargetChip = c ; ------------- SC_SRCSHADER Dump ------------------ SC_SHADERSTATE: u32NumIntVSConst = 0 SC_SHADERSTATE: u32NumIntPSConst = 0 SC_SHADERSTATE: u32NumIntGSConst = 0 SC_SHADERSTATE: u32NumBoolVSConst = 0 SC_SHADERSTATE: u32NumBoolPSConst = 0 SC_SHADERSTATE: u32NumBoolGSConst = 0 SC_SHADERSTATE: u32NumFloatVSConst = 0 SC_SHADERSTATE: u32NumFloatPSConst = 0 SC_SHADERSTATE: u32NumFloatGSConst = 0 fConstantsAvailable = 0 iConstantsAvailable = 0 bConstantsAvailable = 0 u32SCOptions[0] = 0x01A00000 SCOption_IGNORE_SAMPLE_L_BUG SCOption_FLOAT_DO_NOT_DIST SCOption_FLOAT_DO_NOT_REASSOC u32SCOptions[1] = 0x00202000 SCOption_R600_ERROR_ON_DOUBLE_MEMEXP SCOption_SET_VPM_FOR_SCATTER u32SCOptions[2] = 0x00020041 SCOption_R800_UAV_NONARRAY_FIXUP SCOption_R800_UAV_NONUAV_SYNC_WORKAROUND_BUG216513_1 SCOption_R900_BRANCH_IN_NESTED_LOOPS_WORKAROUND_BUG281276 ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(13) KCACHE0(CB1:0-15) KCACHE1(CB0:0-15) 0 x: ADD_INT ____, KC0[1].x, 56 y: ADD_INT ____, KC0[1].x, 12 z: ADD_INT ____, KC0[1].x, 32 1 x: LSHR R2.x, PV0.x, 2 y: LSHR R0.y, PV0.z, 2 w: LSHR R0.w, PV0.y, 2 2 x: MULLO_INT R1.x, R1.x, KC1[1].x y: MULLO_INT ____, R1.x, KC1[1].x z: MULLO_INT ____, R1.x, KC1[1].x w: MULLO_INT ____, R1.x, KC1[1].x 01 TEX: ADDR(80) CNT(3) 3 VFETCH R4.x___, R0.w, fc153 FETCH_TYPE(NO_INDEX_OFFSET) 4 VFETCH R3.x___, R0.y, fc153 FETCH_TYPE(NO_INDEX_OFFSET) 5 VFETCH R2.x___, R2.x, fc153 FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(45) CNT(23) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15) 6 y: XOR_INT ____, -1, R3.x w: ADD_INT ____, R0.x, R1.x VEC_021 7 x: AND_INT R3.x, R4.x, R3.x y: AND_INT R0.y, R2.x, PV6.y VEC_201 z: ADD_INT R0.z, PV6.w, KC0[6].x w: XOR_INT ____, R4.x, R3.x 8 x: AND_INT ____, PV7.w, R2.x w: LSHL ____, PV7.z, 3 9 x: XOR_INT R4.x, R4.x, PV8.x z: ADD_INT ____, R0.z, PV8.w 10 x: OR_INT R2.x, R3.x, R0.y y: LSHL ____, PV9.z, 2 w: ADD_INT ____, PV9.z, 1 11 z: LSHL ____, PV10.w, 2 w: ADD_INT ____, KC1[0].x, PV10.y 12 x: LSHR R3.x, PV11.w, 2 y: ADD_INT ____, KC1[0].x, PV11.z 13 x: LSHR R0.x, PV12.y, 2 03 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R3].x___, R4, ARRAY_SIZE(4) MARK VPM 04 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R0].x___, R2, ARRAY_SIZE(4) MARK VPM 05 END END_OF_PROGRAM ; ----------------- CS Data ------------------------ ; Input Semantic Mappings ; No input mappings GprPoolSize = 0 CodeLen = 688;Bytes PGM_END_CF = 0; words(64 bit) PGM_END_ALU = 0; words(64 bit) PGM_END_FETCH = 0; words(64 bit) MaxScratchRegsNeeded = 0 ;AluPacking = 0.0 ;AluClauses = 0 ;PowerThrottleRate = 0.0 ; texResourceUsage[0] = 0x00000000 ; texResourceUsage[1] = 0x00000000 ; texResourceUsage[2] = 0x00000000 ; texResourceUsage[3] = 0x00000000 ; texResourceUsage[4] = 0x00000000 ; texResourceUsage[5] = 0x00000000 ; texResourceUsage[6] = 0x00000000 ; texResourceUsage[7] = 0x00000000 ; fetch4ResourceUsage[0] = 0x00000000 ; fetch4ResourceUsage[1] = 0x00000000 ; fetch4ResourceUsage[2] = 0x00000000 ; fetch4ResourceUsage[3] = 0x00000000 ; fetch4ResourceUsage[4] = 0x00000000 ; fetch4ResourceUsage[5] = 0x00000000 ; fetch4ResourceUsage[6] = 0x00000000 ; fetch4ResourceUsage[7] = 0x00000000 ; texSamplerUsage = 0x00000000 ; constBufUsage = 0x00000000 ResourcesAffectAlphaOutput[0] = 0x00000000 ResourcesAffectAlphaOutput[1] = 0x00000000 ResourcesAffectAlphaOutput[2] = 0x00000000 ResourcesAffectAlphaOutput[3] = 0x00000000 ResourcesAffectAlphaOutput[4] = 0x00000000 ResourcesAffectAlphaOutput[5] = 0x00000000 ResourcesAffectAlphaOutput[6] = 0x00000000 ResourcesAffectAlphaOutput[7] = 0x00000000 ;SQ_PGM_RESOURCES = 0x30000105 SQ_PGM_RESOURCES:NUM_GPRS = 5 SQ_PGM_RESOURCES:STACK_SIZE = 1 SQ_PGM_RESOURCES:PRIME_CACHE_ENABLE = 1 ;SQ_PGM_RESOURCES_2 = 0x000000C0 SQ_LDS_ALLOC:SIZE = 0x00000000 ; RatOpIsUsed = 0x800 ; NumThreadPerGroupFlattened = 256 ; SetBufferForNumGroup = true

0 Likes

The optimization to generate BFI will be enabled in 2.6. The IL instruction is there but only available at the CAL level.
0 Likes

Can you give me an example usage of the instruction?  I tried it at the opencl level because I could not get it working at the CAL level, and gat3way seemed to say he had success seeing the instruction, but from the post, seemed like the luck was with OpenCL.  I just want to see something that should work and test it on my end.  If you can't do that, could you install a machine fresh and verify with the catalyst release available on the website that it works?  If I have to I'll blow away my dev box completly, but I don't want to do so needlessly.  I came pretty close to that uninstalling, and deleting manually.  Depends still shows the sysWOW64 and system32 aitcalcl dlls being loaded, and the only place they came from was the 11.9 driver, so I don't know how it could be a configuration issue on my end, but I'm more than open to suggestions!

Anyone else want to pipe up with some code that uses bfi and compiles with the 11.9 cal compiler?

I tried as you said just bfi, then I figured WTH, and tried adding ibit_ and ubit_ following the pattern of some of the other IL instructions for bit operations to no avail.  should it literally be "bfi dst, src0, src1, src2"?

and yeah, if you are looking at the time of this post, you are reading the timestamp on this correctly, its 2:30am here, and yes, I seem to be fighting insomnia again.  No posting on the AMD developer forums isn't part of that battle.  Its more of a "tactical regrouping", (read: temporary retreat) so I can continue the fight later 🙂

0 Likes

Using "bfi r102,r2,r3,r4"  in IL kernel ( compiled using CAL, linux driver 11.9 ) gives compilation error.

0 Likes

Originally posted by: hazeman Using "bfi r102,r2,r3,r4"  in IL kernel ( compiled using CAL, linux driver 11.9 ) gives compilation error.

Thanks hazeman, I was beginning to think I had gone looney!  At least I now know that someone else out there has the same problem.  So either we're both insane, or it is an internal thing only still 🙂  If we get a third, random person from the forums, I'll call that consensus that we aren't insane 🙂

0 Likes

There is no "bfi" string in driver libraries so I think that we have been trolled ...

0 Likes

Originally posted by: hazeman There is no "bfi" string in driver libraries so I think that we have been trolled ...

That's certainly how I feel!

0 Likes

Saw someone else saying they had tried 11.10, and 11.11 for something else....couldn't find 11.11, but I checked 11.10, no bfi instruction.  I decided to take your approach hazeman, and sure enough, no bfi...found all the other instructions...might try to play with dbg_string and dbg_line a little....but no bfi  If there is an 11.11 preview, I'd check there, but I have my doubts...

 

0 Likes