Archives Discussions

quazee · ‎05-13-2012

Hello,

I am having trouble enabling the new cl_amd_media_ops2 extension with the latest SDK and drivers.

I tried the Windows OpenCL 1.2 beta drivers, 12.4 official drivers, and "leaked" 12.5 - none of them worked.

The hardware is Sapphire Radeon HD 7970, single card.

#pragma OPENCL EXTENSION cl_amd_media_ops2 : enable

causes a compiler error "line 1: error: can't enable all OpenCL extensions or unrecognized OpenCL extension"

The reason I need cl_amd_media_ops2 to be enabled is the amd_sadw intrinsic (packed sum of 16-bit absolute differences) which appeared in the new SDK.

(See http://www.khronos.org/registry/cl/extensions/amd/cl_amd_media_ops2.txt and

http://developer.amd.com/sdks/amdappsdk/assets/AMD_APP_SDK_Release_Notes_Developer.pdf )

Here is the output of CL_DEVICE_EXTENSIONS variable:

cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing

Note that the cl_amd_media_ops2 is missing.

CL_DEVICE_OPENCL_C_VERSION = OpenCL C 1.2

CL_DEVICE_NAME = Tahiti

CL_DRIVER_VERSION = CAL 1.4.1720 (VM) <---- this is the "openCL 1.2 beta"

The amdocl.dll definitely contains "cl_amd_media_ops2" string, but does not report it in CL_DEVICE_EXTENSIONS for some reason.

There are definitely some drivers which do support the feature - but I am unable to find them anywhere.

See http://clbenchmark.com/sys-info.jsp?id=133582320595781685 - on that report cl_amd_media_ops2 is clearly present, and the driver version is different.

Am I using the wrong drivers, or do I need to enable some registry setting/environment variable before I can use the feature?

Thank you.

drallan · ‎05-14-2012

Great stuff.

Then, I just replace "01 00 b4 d2" with "01 00 b8 d2" in the image before running it.

The opcode for amd_sadw is D2B80001 (v_sad_u16).
It's a long opcode (arguments go in the next 4 bytes), so it's easy to replace by brute force.

The actual opcode is 0xD2B80000 & 0xFFFE0000. The lower byte with 0x01 is the destination register and can change.

The second byte has bits used mostly for float arguments so looking for 0xD2B80000&0xFFFFFF00 should be ok for integer instructions.

allan

View solution in original post

MicahVillmow · ‎05-14-2012

pass in -Dcl_amd_media_ops2 during compile time to enable it.

quazee · ‎05-14-2012

Thank you Micah,

It works now, but doesn't produce expected ISA code (that is, v_sad_u16 instruction) - it looks like it is emulated in the compiler.

Here is the kernel:

#pragma OPENCL EXTENSION cl_amd_media_ops2 : enable

__kernel void test(__global uint* pIn, __global uint* pOut)

{

int z = get_global_id(0);

pOut = amd_sadw(pIn[z*2], pIn[z*2+1], 0);

}

And it produces the following software-emulated sequence, thus, even slower than non-SADW version (boilerplate code removed):

tbuffer_load_format_xy v[1:2], v1, s[8:11], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000044: EBD91000 80020101

s_waitcnt vmcnt(0) // 0000004C: BF8C1F70

v_lshrrev_b32 v3, 16, v2 // 00000050: 2C060490

v_and_b32 v2, 0x0000ffff, v2 // 00000054: 360404FF 0000FFFF

v_lshrrev_b32 v4, 16, v1 // 0000005C: 2C080290

v_and_b32 v1, 0x0000ffff, v1 // 00000060: 360202FF 0000FFFF

v_max_u32 v5, v2, v1 // 00000068: 280A0302

v_max_u32 v6, v3, v4 // 0000006C: 280C0903

v_min_u32 v1, v2, v1 // 00000070: 26020302

v_min_u32 v2, v3, v4 // 00000074: 26040903

v_sub_i32 v1, vcc, v5, v1 // 00000078: 4C020305

v_sub_i32 v2, vcc, v6, v2 // 0000007C: 4C040506

v_add_i32 v1, vcc, v1, v2 // 00000080: 4A020501

tbuffer_store_format_x v1, v0, s[16:19], 0 offen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT] // 00000084: EBA41000 80040100

It looks like newer drivers are needed - since the drivers don't report proper support for this extension

(but these ones do: http://clbenchmark.com/sys-info.jsp?id=133582320595781685 - I want to know where to get them )

MicahVillmow · ‎05-14-2012

quazee,

It looks like we are using the software emulated library instead of the hardware library. We are looking into what went wrong and will fix the problem.

quazee · ‎05-14-2012

Thank you, Micah.

drallan · ‎05-14-2012

I just checked and most of the media_ops2 functions are there, the few that are missing include

amd_sadw()

amd_sadd()

and amd_msad(), is encoded as v_mad_u32 (???)

sad, qsad, mqsad, and things like min3 use native instructions.

MicahVillmow · ‎05-14-2012

Yeah,

The library implementation went in before the hardware was available, so we were using a software version and did not update these functions correctly once the hardware became available. We are fixing it now and it should show up in a future catalyst release.

quazee · ‎05-14-2012

Looks like I found a temporary workaround for now - something like this at the beginning of the kernel:

#ifdef SADW_KLUDGE

#define amd_sadw amd_sad

#pragma OPENCL EXTENSION cl_amd_media_ops : enable

#endif

Then I compile the kernel with "-fbin-exe -fno-bin-amdil -fno-bin-llvmir -fno-bin-source -DSADW_KLUDGE" options (extra options just to minimize the chance of corrupting the binary image when doing brute-force replace).

Then, I just replace "01 00 b4 d2" with "01 00 b8 d2" in the image before running it.

The opcode for amd_sadw is D2B80001 (v_sad_u16).
It's a long opcode (arguments go in the next 4 bytes), so it's easy to replace by brute force.

I tested the amd_sadw functionality for correctness on a bunch of random input - looks like it's doing exactly what I want.

Maybe someday I will do proper ELF parsing / instruction boundary detection to make it bullet proof - currently, the workaround will break if there is a similar constant in the code, etc.

Looking forward to the fixed Catalyst version - this stuff gives a whole new meaning to "kernel hacking" .

drallan · ‎05-14-2012

Great stuff.

Then, I just replace "01 00 b4 d2" with "01 00 b8 d2" in the image before running it.

The opcode for amd_sadw is D2B80001 (v_sad_u16).
It's a long opcode (arguments go in the next 4 bytes), so it's easy to replace by brute force.

The actual opcode is 0xD2B80000 & 0xFFFE0000. The lower byte with 0x01 is the destination register and can change.

The second byte has bits used mostly for float arguments so looking for 0xD2B80000&0xFFFFFF00 should be ok for integer instructions.

allan

quazee · ‎05-15-2012

Thank you for clarification.

It worked in a simple test case with the wrong mask because the dest register just happened to be v1 by luck.

Of course, if I didn't find the placeholder instruction later after actually putting this into the algorithm, it would have raised some red flags

Archives Discussions

Enabling cl_amd_media_ops2 extension on Tahiti