cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Controlling calclCompile() behavior

u32SCOptions

Looking at ISA output I've found the following:

u32SCOptions[0] = 0x01A00000 SCOption_IGNORE_SAMPLE_L_BUG SCOption_FLOAT_DO_NOT_DIST SCOption_FLOAT_DO_NOT_REASSOC
u32SCOptions[1] = 0x00202000 SCOption_R600_ERROR_ON_DOUBLE_MEMEXP SCOption_SET_VPM_FOR_SCATTER
u32SCOptions[2] = 0x00000040 SCOption_R800_UAV_NONUAV_SYNC_WORKAROUND_BUG216513_1

 Looking further at aticaldd.dll itself I've found much more SCOptions available. The question is -- is it possible to turn on these options for calclCompile() invocation? If it's possible -- how, if not -- why?

I'm personally very interested in SCOption_KEEP_CALLS hoping it'll prevent calls from being inlined (like it was long time ago at 9.9). Any ideas?

0 Likes
15 Replies

empty_knapsack,
there is currently no interface to provide options to the shader compiler. It is something that we are working on and I am hoping to add it to an upcoming release.
0 Likes

Originally posted by: MicahVillmow empty_knapsack, there is currently no interface to provide options to the shader compiler. It is something that we are working on and I am hoping to add it to an upcoming release.


Any news on this subject ?

0 Likes
jch
Journeyman III

subroutine(call) is kept @ 9.9 ?

But, subroutine is inlined @ kernel analyzer 1.6.721/option : CAL 9.9.

0 Likes

ATI compiler is simply unpredictable. For my kernels when compiling for PS mode it _sometimes_ doesn't inlines functions with Catalyst 9.9. The same kernel compiled for CS mode (only vObjIndex calculation changed from PS) got everything inlined (and performance dropped several times).

Since 9.9 it's no more possible to forbid inlining, no matter PS or CS mode you're using.

 

The best possible solution -- write your own assembler, obviously it'll take way too much time but it's the only way to get code you're want to get, doesn't looks like ATI will change anything in nearest years.

 

0 Likes

Thanks a lot, empty_knapsack!!

I'll try to reverse aticalcl.dll.

If I fail to injection SCOption_KEEP_CALL, then I'll try to assembly....

0 Likes

I doubt it's possible to hack aticaldd.dll to make it produce non-inlined code. Looks like SCOptions selected at compile stage, so required code simply absent in DLL.

 

Anyway, I was kinda ironic about own assembler -- it'll take tremendous amount of time to make one and without payed for it it looks like as absolutely pointless thing for me.

Much easier to switch to NVIDIA GPUs, yes, they have lower peak performance but it takes minutes to get your code working instead of hours/days for ATI kernels.

0 Likes

It's possible. I solved already.

one function @aticaldd.dll  sets SCoptions.

SCOption_KEEP_CALL is SCOption[1]=0x00000400.

I patched aticaldd.dll, about 10 bytes.

--------------------------

or dword ptr [esi+4], 400h

jmp short 0x....

-----------------------------

I got a call/ret asm.

I'll publish dll file soon.

 

P.S. Thanks, empty_knapsack!!

your idea(SCOptions_KEEP_CALL) is helpful for me!! 

0 Likes

Probably posting hacked .dll isn't good at all from legal point of view .

0 Likes

Well, I've made quick tests with 10.8 DLLs and forced SCOption_KEEP_CALL. The kernel size is indeed small and call/rets presents, however compilation time is just milliseconds for my kernels while it takes minutes with 9.9. So either ATI compiler become that good year after (but this functionality still isn't available without hacks) or generated code differs from 9.9, ofc in a bad way because of compilation time priority over code efficiency generation.

 

Tests and benchmarks needed to say for sure.

0 Likes

Made more tests and benchmarks. For my specific kernels results with forced SCOption_KEEP_CALL enabled looks quite good. Now I'm kinda disappointed that I haven't done final step (i.e. actually hack DLL) after discovered that SCOption_KEEP_CALL exists and can be tweaked. Probably I was too distracted by "upcoming release" refrain and decided that such functionality totally disabled within DLL... I was wrong.

Anyway, results from 10.8 with enabled SCOption_KEEP_CALL looks way better than 10.8 without SCOption_KEEP_CALL while 9.9 still produce the fastest code. However 9.9 takes much longer time to compile (but using lesser number of GPRs). Small table with results (two kernels, one with bitalign for 5XXX, one without):

{Catalyst version} {Execution time at 5770} {Execution time at 4770} {GPRs used/Codelen at 5770} {GPRs used/Codelen at 4770}

9.9 5.9s 10.7s 48/17600 48/21824
10.8/patched 6.2s 11.2s 55/17472 54/21696
10.8 25.7s 24.9s 113/217408 79/312512

So unpatched 10.8 at 5770 works 4x slower than it could. I really curious why compiler switches aren't exposed yet at CAL/OpenCL level especially now knowing that they're available at least since 10.2 and compile time became quite reasonable to perform even at run-time stage.

 

0 Likes

Did you have compiled with SCOption_BIAS_SCHEDULE_TO_MINIMIZE_INSTS or SCOption_BIAS_SCHEDULE_TO_MINIMIZE_REGS?

Today, I'll test speeds.

 

0 Likes

I've tried these options but didn't notice any visible difference in code size or speed.

0 Likes

Other items are higher priority at this time, so it is not likely to happen.
0 Likes

Originally posted by: MicahVillmow Other items are higher priority at this time, so it is not likely to happen.


So any bigger/more advanced OpenCL/IL kernel is supposed to be much slower or at least there are plans to improve IL compiler so it doesn't remove function calls ?

 

0 Likes

hazeman,
We are working on that, but the design of IL makes allowing function calls extremely difficult to get correct.
For example the following kernel:
float4 image_read_and_compute(image2d_t type) {
read image, do something and return result
}

kernel void image_compute(image2d_t type1, image2d type2, global float4 *val) {
val[get_global_id(0)] = image_read_and_compute(type1) * image_read_and_compute(type2);
}


Since the resource ID/sampler ID of the image is in the instruction, generating code correctly for image_read_and_compute is difficult.

We are moving closer to a solution, but it isn't easy to get correct.
0 Likes