cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

quazee
Journeyman III

Enabling cl_amd_media_ops2 extension on Tahiti

Hello,

I am having trouble enabling the new cl_amd_media_ops2 extension with the latest SDK and drivers.

I tried the Windows OpenCL 1.2 beta drivers, 12.4 official drivers, and "leaked" 12.5 - none of them worked.

The hardware is Sapphire Radeon HD 7970, single card.

#pragma OPENCL EXTENSION cl_amd_media_ops2 : enable

causes a compiler error "line 1: error: can't enable all OpenCL extensions or unrecognized OpenCL extension"

The reason I need cl_amd_media_ops2 to be enabled is the amd_sadw intrinsic (packed sum of 16-bit absolute differences) which appeared in the new SDK.

(See http://www.khronos.org/registry/cl/extensions/amd/cl_amd_media_ops2.txt and

http://developer.amd.com/sdks/amdappsdk/assets/AMD_APP_SDK_Release_Notes_Developer.pdf )

Here is the output of CL_DEVICE_EXTENSIONS variable:

cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing

Note that the cl_amd_media_ops2 is missing.

CL_DEVICE_OPENCL_C_VERSION = OpenCL C 1.2

CL_DEVICE_NAME = Tahiti

CL_DRIVER_VERSION = CAL 1.4.1720 (VM)  <---- this is the "openCL 1.2 beta"

The amdocl.dll definitely contains "cl_amd_media_ops2" string, but does not report it in CL_DEVICE_EXTENSIONS for some reason.

There are definitely some drivers which do support the feature - but I am unable to find them anywhere.

See http://clbenchmark.com/sys-info.jsp?id=133582320595781685  - on that report cl_amd_media_ops2 is clearly present, and the driver version is different.

Am I using the wrong drivers, or do I need to enable some registry setting/environment variable before I can use the feature?

Thank you.

0 Likes
1 Solution

Great stuff.


Then, I just replace "01 00 b4 d2" with "01 00 b8 d2" in the image before running it.

The opcode for amd_sadw is D2B80001 (v_sad_u16).
It's a long opcode (arguments go in the next 4 bytes), so it's easy to replace by brute force.

The actual opcode is 0xD2B80000 & 0xFFFE0000. The lower byte with 0x01 is the destination register and can change.

The second byte has bits used mostly for float arguments so looking for 0xD2B80000&0xFFFFFF00 should be ok for integer instructions.

allan

View solution in original post

0 Likes
9 Replies

pass in -Dcl_amd_media_ops2 during compile time to enable it.

Thank you Micah,

It works now, but doesn't produce expected ISA code (that is, v_sad_u16 instruction) - it looks like it is emulated in the compiler.

Here is the kernel:

#pragma OPENCL EXTENSION cl_amd_media_ops2 : enable

__kernel void test(__global uint* pIn, __global uint* pOut)

{

int z = get_global_id(0);

pOut = amd_sadw(pIn[z*2], pIn[z*2+1], 0);

}

And it produces the following software-emulated sequence, thus, even slower than non-SADW version (boilerplate code removed):

tbuffer_load_format_xy  v[1:2], v1, s[8:11], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000044: EBD91000 80020101

  s_waitcnt     vmcnt(0)                                    // 0000004C: BF8C1F70

  v_lshrrev_b32  v3, 16, v2                                 // 00000050: 2C060490

  v_and_b32     v2, 0x0000ffff, v2                          // 00000054: 360404FF 0000FFFF

  v_lshrrev_b32  v4, 16, v1                                 // 0000005C: 2C080290

  v_and_b32     v1, 0x0000ffff, v1                          // 00000060: 360202FF 0000FFFF

  v_max_u32     v5, v2, v1                                  // 00000068: 280A0302

  v_max_u32     v6, v3, v4                                  // 0000006C: 280C0903

  v_min_u32     v1, v2, v1                                  // 00000070: 26020302

  v_min_u32     v2, v3, v4                                  // 00000074: 26040903

  v_sub_i32     v1, vcc, v5, v1                             // 00000078: 4C020305

  v_sub_i32     v2, vcc, v6, v2                             // 0000007C: 4C040506

  v_add_i32     v1, vcc, v1, v2                             // 00000080: 4A020501

  tbuffer_store_format_x  v1, v0, s[16:19], 0 offen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT] // 00000084: EBA41000 80040100

It looks like newer drivers are needed - since the drivers don't report proper support for this extension

(but these ones do: http://clbenchmark.com/sys-info.jsp?id=133582320595781685  - I want to know where to get them )

0 Likes

quazee,

It looks like we are using the software emulated library instead of the hardware library. We are looking into what went wrong and will fix the problem.

Thank you, Micah.

0 Likes

I just checked and most of the media_ops2 functions are there, the few that are missing include

amd_sadw()

amd_sadd()

and amd_msad(),  is encoded as v_mad_u32 (???)

sad, qsad, mqsad, and things like min3 use native instructions.

0 Likes

Yeah,

The library implementation went in before the hardware was available, so we were using a software version and did not update these functions correctly once the hardware became available. We are fixing it now and it should show up in a future catalyst release.

0 Likes
quazee
Journeyman III

Looks like I found a temporary workaround for now - something like this at the beginning of the kernel:

#ifdef SADW_KLUDGE

   #define amd_sadw amd_sad

   #pragma OPENCL EXTENSION cl_amd_media_ops : enable

#endif

Then I compile the kernel with "-fbin-exe -fno-bin-amdil -fno-bin-llvmir -fno-bin-source -DSADW_KLUDGE" options (extra options just to minimize the chance of corrupting the binary image when doing brute-force replace).

Then, I just replace "01 00 b4 d2" with "01 00 b8 d2" in the image before running it.

The opcode for amd_sadw is D2B80001 (v_sad_u16).
It's a long opcode (arguments go in the next 4 bytes), so it's easy to replace by brute force.

I tested the amd_sadw functionality for correctness on a bunch of random input - looks like it's doing exactly what I want.

Maybe someday I will do proper ELF parsing / instruction boundary detection to make it bullet proof - currently, the workaround will break if there is a similar constant in the code, etc.

Looking forward to the fixed Catalyst version - this stuff gives a whole new meaning to "kernel hacking" .

0 Likes

Great stuff.


Then, I just replace "01 00 b4 d2" with "01 00 b8 d2" in the image before running it.

The opcode for amd_sadw is D2B80001 (v_sad_u16).
It's a long opcode (arguments go in the next 4 bytes), so it's easy to replace by brute force.

The actual opcode is 0xD2B80000 & 0xFFFE0000. The lower byte with 0x01 is the destination register and can change.

The second byte has bits used mostly for float arguments so looking for 0xD2B80000&0xFFFFFF00 should be ok for integer instructions.

allan

0 Likes

Thank you for clarification.

It worked in a simple test case with the wrong mask because the dest register just happened to be v1 by luck.

Of course, if I didn't find the placeholder instruction later after actually putting this into the algorithm, it would have raised some red flags

0 Likes