Hello,
I am having trouble enabling the new cl_amd_media_ops2 extension with the latest SDK and drivers.
I tried the Windows OpenCL 1.2 beta drivers, 12.4 official drivers, and "leaked" 12.5 - none of them worked.
The hardware is Sapphire Radeon HD 7970, single card.
#pragma OPENCL EXTENSION cl_amd_media_ops2 : enable
causes a compiler error "line 1: error: can't enable all OpenCL extensions or unrecognized OpenCL extension"
The reason I need cl_amd_media_ops2 to be enabled is the amd_sadw intrinsic (packed sum of 16-bit absolute differences) which appeared in the new SDK.
(See http://www.khronos.org/registry/cl/extensions/amd/cl_amd_media_ops2.txt and
http://developer.amd.com/sdks/amdappsdk/assets/AMD_APP_SDK_Release_Notes_Developer.pdf )
Here is the output of CL_DEVICE_EXTENSIONS variable:
cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing
Note that the cl_amd_media_ops2 is missing.
CL_DEVICE_OPENCL_C_VERSION = OpenCL C 1.2
CL_DEVICE_NAME = Tahiti
CL_DRIVER_VERSION = CAL 1.4.1720 (VM) <---- this is the "openCL 1.2 beta"
The amdocl.dll definitely contains "cl_amd_media_ops2" string, but does not report it in CL_DEVICE_EXTENSIONS for some reason.
There are definitely some drivers which do support the feature - but I am unable to find them anywhere.
See http://clbenchmark.com/sys-info.jsp?id=133582320595781685 - on that report cl_amd_media_ops2 is clearly present, and the driver version is different.
Am I using the wrong drivers, or do I need to enable some registry setting/environment variable before I can use the feature?
Thank you.
Solved! Go to Solution.
Great stuff.
Then, I just replace "01 00 b4 d2" with "01 00 b8 d2" in the image before running it.
The opcode for amd_sadw is D2B80001 (v_sad_u16).
It's a long opcode (arguments go in the next 4 bytes), so it's easy to replace by brute force.
The actual opcode is 0xD2B80000 & 0xFFFE0000. The lower byte with 0x01 is the destination register and can change.
The second byte has bits used mostly for float arguments so looking for 0xD2B80000&0xFFFFFF00 should be ok for integer instructions.
allan
pass in -Dcl_amd_media_ops2 during compile time to enable it.
Thank you Micah,
It works now, but doesn't produce expected ISA code (that is, v_sad_u16 instruction) - it looks like it is emulated in the compiler.
Here is the kernel:
#pragma OPENCL EXTENSION cl_amd_media_ops2 : enable
__kernel void test(__global uint* pIn, __global uint* pOut)
{
int z = get_global_id(0);
pOut
}
And it produces the following software-emulated sequence, thus, even slower than non-SADW version (boilerplate code removed):
tbuffer_load_format_xy v[1:2], v1, s[8:11], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000044: EBD91000 80020101
s_waitcnt vmcnt(0) // 0000004C: BF8C1F70
v_lshrrev_b32 v3, 16, v2 // 00000050: 2C060490
v_and_b32 v2, 0x0000ffff, v2 // 00000054: 360404FF 0000FFFF
v_lshrrev_b32 v4, 16, v1 // 0000005C: 2C080290
v_and_b32 v1, 0x0000ffff, v1 // 00000060: 360202FF 0000FFFF
v_max_u32 v5, v2, v1 // 00000068: 280A0302
v_max_u32 v6, v3, v4 // 0000006C: 280C0903
v_min_u32 v1, v2, v1 // 00000070: 26020302
v_min_u32 v2, v3, v4 // 00000074: 26040903
v_sub_i32 v1, vcc, v5, v1 // 00000078: 4C020305
v_sub_i32 v2, vcc, v6, v2 // 0000007C: 4C040506
v_add_i32 v1, vcc, v1, v2 // 00000080: 4A020501
tbuffer_store_format_x v1, v0, s[16:19], 0 offen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT] // 00000084: EBA41000 80040100
It looks like newer drivers are needed - since the drivers don't report proper support for this extension
(but these ones do: http://clbenchmark.com/sys-info.jsp?id=133582320595781685 - I want to know where to get them )
quazee,
It looks like we are using the software emulated library instead of the hardware library. We are looking into what went wrong and will fix the problem.
Thank you, Micah.
I just checked and most of the media_ops2 functions are there, the few that are missing include
amd_sadw()
amd_sadd()
and amd_msad(), is encoded as v_mad_u32 (???)
sad, qsad, mqsad, and things like min3 use native instructions.
Yeah,
The library implementation went in before the hardware was available, so we were using a software version and did not update these functions correctly once the hardware became available. We are fixing it now and it should show up in a future catalyst release.
Looks like I found a temporary workaround for now - something like this at the beginning of the kernel:
#ifdef SADW_KLUDGE
#define amd_sadw amd_sad
#pragma OPENCL EXTENSION cl_amd_media_ops : enable
#endif
Then I compile the kernel with "-fbin-exe -fno-bin-amdil -fno-bin-llvmir -fno-bin-source -DSADW_KLUDGE" options (extra options just to minimize the chance of corrupting the binary image when doing brute-force replace).
Then, I just replace "01 00 b4 d2" with "01 00 b8 d2" in the image before running it.
The opcode for amd_sadw is D2B80001 (v_sad_u16).
It's a long opcode (arguments go in the next 4 bytes), so it's easy to replace by brute force.
I tested the amd_sadw functionality for correctness on a bunch of random input - looks like it's doing exactly what I want.
Maybe someday I will do proper ELF parsing / instruction boundary detection to make it bullet proof - currently, the workaround will break if there is a similar constant in the code, etc.
Looking forward to the fixed Catalyst version - this stuff gives a whole new meaning to "kernel hacking" .
Great stuff.
Then, I just replace "01 00 b4 d2" with "01 00 b8 d2" in the image before running it.
The opcode for amd_sadw is D2B80001 (v_sad_u16).
It's a long opcode (arguments go in the next 4 bytes), so it's easy to replace by brute force.
The actual opcode is 0xD2B80000 & 0xFFFE0000. The lower byte with 0x01 is the destination register and can change.
The second byte has bits used mostly for float arguments so looking for 0xD2B80000&0xFFFFFF00 should be ok for integer instructions.
allan
Thank you for clarification.
It worked in a simple test case with the wrong mask because the dest register just happened to be v1 by luck.
Of course, if I didn't find the placeholder instruction later after actually putting this into the algorithm, it would have raised some red flags