Originally posted by: Curious cat Originally posted by: genaganna Are you facing any problem with cl_amd_fp64 extension?
Yes, this:
OpenCL Compile Error: clBuildProgram failed (CL_BUILD_PROGRAM_FAILURE).
Line 10: error: can't enable all OpenCL extensions or unrecognized OpenCL extension #pragma OPENCL EXTENSION cl_amd_fp64 : enable ^
Could you please run MatrixMulDouble sample coming from SDK and see whether it is running or not?
Originally posted by: genaganna Could you please run MatrixMulDouble sample coming from SDK and see whether it is running or not?
Yes, it runs with "--device cpu" on the command line (am on a laptop right now, no AMD graphics). So maybe it's just the SKA 1.6 that's borked.
When I try to target x86 Assembly with the SKA, I get "OpenCL Compile Error: X86 asm output is not currently supported." It does work without the cl_amd_fp64 pragma (but only produces stats for GPUs, and no x86 assembly).
Originally posted by: genaganna Originally posted by: Curious catOriginally posted by: JawedDoes
#pragma OPENCL EXTENSION cl_amd_fp64 : enable
work for you?
Are you facing any problem with cl_amd_fp64 extension?
http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/
Weird thing: the pdf specs say use enable / disable,
the online manpages say : use require instead of enable
MatrixMulDouble_Kernels.cl uses "enable", but oddly, the SKA does not complain about "require". Now I'm totally confused.
Judging by the stats though, "require" is simply ignored.
I've played some more with the SKA. An example of perplexing behaviour:
Start with three kernels, call them kernel_A, kernel_B and kernel_C, which all take the same arguments and perform similar computations. Individually, they use no scratch registers and have similar throughputs; call those thru_A, thru_B and thru_C (MThreads/s).
Now combine them to a single kernel which takes the same arguments, by simply turning their bodies into blocks of the new kernel. Since there are no shared variables between the blocks, I would expect the compiler to treat each block as it treated the original kernel body. I would still expect to see no scratch register usage and throughput given by 1/(1/thru_A + 1/thru_B + 1/thru_C).
Instead, I now get plenty of scratch register usage and significantly lower throughput than expected.
Originally posted by: Curious cat I've played some more with the SKA. An example of perplexing behaviour:
Start with three kernels, call them kernel_A, kernel_B and kernel_C, which all take the same arguments and perform similar computations. Individually, they use no scratch registers and have similar throughputs; call those thru_A, thru_B and thru_C (MThreads/s).
Now combine them to a single kernel which takes the same arguments, by simply turning their bodies into blocks of the new kernel. Since there are no shared variables between the blocks, I would expect the compiler to treat each block as it treated the original kernel body. I would still expect to see no scratch register usage and throughput given by 1/(1/thru_A + 1/thru_B + 1/thru_C).
Instead, I now get plenty of scratch register usage and significantly lower throughput than expected.
Have you looked at the ISA and played with moving instructions around?
It turns out that simply cascading kernels is not the best way to results. I won't get into this much but it's not too hard to get the same register usage from the merged kernel as it is from the max(kernA, kernB, kernC), but you will need to look at, and possibly move, the code.
Originally posted by: Jawed Are you using the update version of 10.7?
Jawed,
If you are talking to me then yes, per my original post.
Originally posted by: ryta1203Have you looked at the ISA and played with moving instructions around?
No. The point being that if the compiler were behaving reasonably, it would reuse all registers employed in block A when doing block B, and again reuse all registers employed in block B when doing block C. Instead, it's spilling registers. If I were a compiler developer, I would want to understand why; it may well be the same problem causing the increased register use in 2.2 vs 2.1.
Originally posted by: Curious cat Originally posted by: ryta1203Have you looked at the ISA and played with moving instructions around?
No. The point being that if the compiler were behaving reasonably, it would reuse all registers employed in block A when doing block B, and again reuse all registers employed in block B when doing block C. Instead, it's spilling registers. If I were a compiler developer, I would want to understand why; it may well be the same problem causing the increased register use in 2.2 vs 2.1.
Curious cat,
Could you please post your three kernels here which helps us to see what is going wrong?
There are two releases of Catalyst 10.7, so I can't tell which you are using.
For this reason SKA cannot be relied upon, because when 10.7 is selected internally for compilations, you don't know which release of 10.7 is being used.
(For the record: I've got no experience of any of this, as I haven't installed SDK 2.2, nor Catalyst 10.7, nor SKA 1.6).