Just a timing error? It sounds like you're not measuring the kernel execution at all and only measuring the enqueue. Have you definitely wrapped timing code around everything to the point of waiting for the kernel event to be marked complete (or, better for testing, the readback)?
Relevant CAL code looks something like the attached code...
Hmmm, no more attach code button...wonder if ib tag code works...will try it
I guess code tags don't work...Is there still a way to post pretty printed code?
LastResult=calCtxRunProgramGrid(&KernelFinishedEvent, DeviceContext, &pg);
if (LastResult!= CAL_RESULT_OK)
printf("Ugh, Failed to run, error was %d, %s\n", LastResult, calGetErrorString());
while (calCtxIsEventDone(DeviceContext, KernelFinishedEvent) == CAL_RESULT_PENDING)
//Nah, lets grab grab 100% CPU for this spin loop just to see if it makes a difference in timing...
//Sleep(0); //Give up our time slice...
unsigned long long time=Finish.QuadPart-Start.QuadPart;
The result it shows is 100% illogical as well. The ret_logicalnz is a performance enhancement, so I can take it out and still get correct results, albeit somewhat slower than I would if the ret was working as it should. Everything works until I try to return from this call to the loop.
I even removed the wackey thing I was doing before, abusing the break command. I had it in a function outside of main to break the loop in main. Which to my great satisfaction, worked However, I thought something might be getting confused, so I moved that up into main inside the loop. Tried the ret_logicalnz, and there it goes again, illogical results, done in .02 seconds.
You want even weirder? I can get the ISA if I use the Dump env variable, but kernel analyzer using 11.12 refuses to compile the code. It doesn't give any error, it just doesnt show the ISA, nor any statistics. Yeah, I'm scratching my head on this one....if there isn't something completly obvious, I may have to try to write up a test case and see if that doesn't let you guys see what I'm doing wrong!
Ok, so here's some more behavior characterization. Normally, I expect the ret to be taken, thus the performance optimization. ret_logicalnz seems to ret the entire program. No amount of ret_dyn changes that behavior. If I switch the ret_logicalnz to ret_logicalnz, it runs to the maximum value each thread is allowed to run to, as expected.
I guess I need this to be an actuall function somehow, or get ret to act as ret, not as break. I'm a fan of inlining, obviously However, in this case it seems to be hindering proper flow control. Any example of ret_dyn and ret_logicalz/nz should demonstrate the problem. I'm running driver 12.1, but was previously running with 11.12 with the same issue. Not really sure where to go from here short of nested ifs, dropping the optimization, or finding out there is a ret_dyn_logicalnz
Either way, ret_dyn doesn't work, and ret_logicalnz anywhere in my program exits the program as soon as the condition is hit.
So, in case anyone else has this particular issue....
The workaround I used was to make my main function loop more of a do-while loop. Basically, the last thing it did in the loop was call the function that performed the check of whether to terminate early. Thus, a continue (yes, inside a function with no loops) would to the same thing as ret should do. Prior to that, I had my re-initilaization code after the call to check., which with continue, would never get called, and botch everything up. Just added it to my list of quirks to workaround
So let me also add that no only has this *NOT* been fixed, but callnz/callz also do *NOT* work.
The call is completly missing, and the function optimized out. So here's another request for FSAIL, fix it, or remove it, and/or let us control the optimizer and prevent things from being inlined...
If the problem does not show up in OpenCL, it won't be fixed. CAL has been deprecated.
Well, you tell me, I'm not the OpenCL compiler author. Do you try to use callnz/callz/ret_logicalz/ret_logicalnz when you compile OpenCL code? Is FSAIL going to have these instructions? Seems to me this one is a little too basic for your boilerplate answer...Or is OpenCL no longer compiling to IL?
I cannot speak about FSAIL, but we do have some internal apps that use ret_* instructions. So they are known to work in OpenCL.
As for CALL instruction, there are some situations where the CALL can be dropped/ignored. This is documented in the IL spec. If you have a sufficiently complex program, this might be a problem. These limits are guaranteed to never be hit by OpenCL.
Let me amend my statement. ANY function call, reguardless of how its called, or how you return from it *WILL NOT WORK* if it cannot be inlined. I.E. if its called based on a condition, its broken, because that doesn't inlne....
so something like
must be replaced with
mov r6, r7
blah blah blah
Why, because the ret in function 4 kills the whole program.
This really smaks of a simple optimization bug...something that should be pretty simple to fix...and I imagine even for OpenCL its making some serious constraints on how you generate your code...hell, I haven't looked at people's OpenCL problems, maybe you might find this is one of your OpenCL bugs...
This type of code will never be generated by OpenCL because of other constraints that AMDIL has in relation to OpenCL. That being said, are you using 'ret' or 'ret_dyn'? 'ret' is a dx9 instruction, 'ret_dyn' is a dx10/11 instruction, so is more likely to be correct.