Archives Discussions

corry · ‎09-01-2011

I'm attempting to just write a non-optimized, straight C version of a popular algorithm which unfortunatly requires byte level access. Some accesses to the byte addressable store will cause the compiler to bail out with no error message other than Build Failed. Other times it works just fine. The addition of the pragma helped me get a few more, but the lack of it did not deny every byte level access. (unaligned ones I might add.) It looks to me like CAL was deprecated long before its time, as fighting with the OpenCL compiler shouldn't be necessary...

MicahVillmow · ‎09-01-2011

Please supply examples of programs that aren't working along with the devices you are attempting to compile them on so that we can fix the issue on our side. We don't see any problems with byte addressable stores.

corry · ‎09-01-2011

I wrote up basically a standard C block processor with some fun variable names. As expected, a few of the byte addressable lines of code cause compilation to fail miserably, with no errors, just a few warnings saying that variables used by a commented out loop are never used. I'm using the AMD Kernelyzer program so I can compile it without running my test code, originally targetting Cayman (6970 in the drop down, but in reality it will likely be 6990's :D) But I switched to cypress IL, it has fewer problems than the cayman ASM did.

#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable typedef struct _MY_C_FUNCTIONS_CONTEXT { uint somecontextval; uint morevals; uint somethingelse; uint onlyafewmore; uint needaname; uint howaboutthis; uint dontlikeit; union { char buffer[72]; unsigned long long llbuffer[9]; }; uint junkbuffersize; uint somemorejunk; uint williteverend; uint actuallyYes; } MY_C_FUNCTIONS_CONTEXT; void InitContext(MY_C_FUNCTIONS_CONTEXT* ctx) { ctx->somecontextval=0; ctx->morevals=1; ctx->somethingelse=2; ctx->onlyafewmore=3; ctx->needaname=4; ctx->howaboutthis=5; ctx->dontlikeit=6; for (uint i=0;i<9;i++) { ctx->llbuffer=0; //I assume if xoring on ATI cards is faster it will translate? Hope?! } ctx->junkbuffersize=0; ctx->somemorejunk=8; ctx->williteverend=9; ctx->actuallyYes=10; } void HandleLargeInput(MY_C_FUNCTIONS_CONTEXT* ctx, __global char* LargeData, uint size) { //No problem here... LargeData[24]=0xFF; } void HandleSmallInput(MY_C_FUNCTIONS_CONTEXT* ctx) { //Think its the union? Guess again, this works fine... ctx->buffer[24]=0xFF; } void TestFunction(MY_C_FUNCTIONS_CONTEXT* ctx, __global char* JunkByteAddressable, uint JunkSize) { if (JunkSize > 72) { if (ctx->junkbuffersize) { uint tocopy=72-ctx->junkbuffersize; //Funny, it still works here! for (uint i=ctx->junkbuffersize, j=0;j<tocopy;i++) ctx->buffer=JunkByteAddressable; HandleSmallInput(ctx); JunkSize-=tocopy; JunkByteAddressable+=tocopy; if (JunkSize) { if (JunkSize > 72) { HandleLargeInput(ctx, JunkByteAddressable, JunkSize); //Store anything not processed... uint remainder=JunkSize%72; uint d=JunkSize/72; d*=72; //Still no problems, since its copy and paste from above //you wouldn't expect it to be, but just wait a little //longer... for (uint i=d, j=0;j<remainder;i++,j++) ctx->buffer=JunkByteAddressable; return; } else { uint remainder=JunkSize%72; uint d=JunkSize/72; d*=72; //Same thing....still working as expected... for (uint i=d, j=0;j<remainder;i++,j++) ctx->buffer=JunkByteAddressable; return; } } } else { HandleLargeInput(ctx, JunkByteAddressable, JunkSize); //Store anything not processed... uint remainder=JunkSize%72; uint d=JunkSize/72; d*=72; //In the immortal words of Peter Griffen...."Uh-Oh!" //Comment this out, and all remaining byte addressing function //and everythings fine! //for (uint i=d, j=0;j<remainder;i++,j++) // ctx->buffer=JunkByteAddressable; return; } } else { uint tocopy=72-ctx->junkbuffersize; //Like I said in the previous instance of this, standalone //this for loop will cause the compiler to bomb without a single //error message (A few warnings since a varaible declared, normally //used by this for loop isn't referenced...) //for (uint i=ctx->junkbuffersize, j=0;j<tocopy;i++) // ctx->buffer=JunkByteAddressable; HandleSmallInput(ctx); JunkSize-=tocopy; JunkByteAddressable+=tocopy; if (JunkSize) { uint remainder=JunkSize%72; uint d=JunkSize/72; d*=72; //Same thing right here. Broken, on its own, independant of //the others..what a rip... //for (uint i=d, j=0;j<remainder;i++,j++) // ctx->buffer=JunkByteAddressable; return; } } } __kernel void Test(__global char* InputByteAddressable, __global uint* size, uint OutputBuffer) { uint myId=get_global_id(1); uint startpos=myId*(*size); __global char* myData=InputByteAddressable; myData+=startpos; MY_C_FUNCTIONS_CONTEXT ctx; InitContext(&ctx); TestFunction(&ctx, myData, *size); }

corry · ‎09-02-2011

Well, another snag here. Decided to replace the byte addressable stuff with 32 bit address code using logic operations to hit the desired bytes. Not ideal per se, but should've gotten me working. Well, running the kernel crashes...not really sure why, so I think, I'll build it for the CPU, than maybe? the system will let me step into the openCL kernel? Well, the compiler fails to compile it for x86, confirmed in the Kernelyzer program, no specific error messages, just "as failed". For giggles, I wonder if...a CPU competitor has an OpenCL compiler released, maybe theres something wrong with my kernel, and maybe that will give me an error message! Well, it did...

First off, it told me long long is invalid in OpenCL. Confirmed it on the OpenCL spec page, and found they just make long 64 bits....basics I know...But the AMD OpenCL compiler is allowing it no problem, so of course when I tried it, I figured I was golden! Not the case.

So I made the change, and suddenly I can see X86 assembly for my kernel. Great, go back to Kernelyzer, sill fails to generate as...

As an FYI, same thing happens in the code I posted here. Again, the other compiler likes it just fine.

Out of curiousity, since I'm a bit stuck, and could probably hack the other compiled kernel in, will a CPU based kernel be able to be debugged from Visual Studio? Would certainly be nice 🙂

MicahVillmow · ‎09-02-2011

So, with our upcoming catalyst this compiles fine for all devices that support byte addressable stores. I'll file a bug on the long long issue as we should error on that one.
Thanks for the example.

As for the CPU debugging, I'll let the debugger team know about your request.

Archives Discussions

Byte Addressable Stores seem very buggy