1. Where do the docs talk about compute shader mode?
2. Why do I get the "No Error" string for a kernel that doesn't return CAL_RESULT_OK??? There is obviously an error but it tells me "don't worry about it.. no error... but your kernel still won't run, so sorry"!? EDIT: Kernel compiles fine in SKA.
3. Do you have to sample from an offset. If VaTid is the global thread id, why can't I just sample (either sampling or getting from/to the global buffer) from that (like you would with vWinCoord0)?
4. If I'm sampling the inputs as streams in compute shader mode, do I have to allocate the resource as global?
5. Can you burst write in CS?
6. Where is the docs that talk about CS?
Thank you, AGAIN!!
The kernel is below, pretty straightforward actually; however, I was using vaTid to sample, apparently I cannot. What should I be using instead to access the thread id? All the "...Tid" registers are 1 component it seems.
const
char
HILKernel[] =
"il_cs_2_0\n"
"dcl_num_thread_per_group 64\n"
"dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"
"dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"
"mov r2.x, vaTid.x\n"
"sample_resource(0)_sampler(0) r0, r2\n"
"sample_resource(1)_sampler(0) r1, r2\n"
"add r2, r1, r0\n"
"add r3, r2, r1\n"
"add r4, r3, r2\n"
"add r5, r4, r3\n"
"add r6, r5, r4\n"
"add r7, r6, r5\n"
"add r8, r7, r6\n"
"add r9, r8, r7\n"
"add r10, r9, r8\n"
"add r11, r10, r9\n"
"add r12, r11, r10\n"
"add r13, r12, r11\n"
"add r14, r13, r12\n"
"add r15, r14, r13\n"
"add r16, r15, r14\n"
"add r17, r16, r15\n"
"mov g[vaTid0.x], r17\n"
"ret_dyn\n"
"end\n"
Ok, thanks, will try that.
So when I output, if I have 8 outputs going to the global buffer for that kernel (each of the same element) then I should burst write by doing:
g[r0]
g[r0+1]
g[r0+2]
g[r0+3]
g[r0+4]... etc, etc.. like in the burst_write_cs example, correct?
Is this the case for float4 AND float data types? Does it not matter the data type?
What about inputs from the global buffer, are they handled the same way (with that same stride)?
Also, can you re-explain how to sample in compute shader mode? I'm not sure I understand why you need to div/mod (you guys mul/mod in your examples).
What is cb0[0] in this example?
"itof r0.z, vaTid0.x\n"
"mul r0.y, r0.z, cb0[0].y\n""mod r0.x, r0.z, cb0[0].x\n"
"flr r0.xy, r0.xy\n"
Essentially if you are using compute shader mode you need to pass in some constants, there is no way around this I suppose?
il_cs_2_0
dcl_num_thread_per_group 64
dcl_cb cb0[1]
dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
itof r2.z, vaTid0.x
mul r2.y, r2.z, cb0[0].y
mod r2.x, r2.z, cb0[0].x
flr r3, r2
sample_resource(0)_sampler(0) r0, r3
sample_resource(1)_sampler(0) r1, r3
add r2, r1, r0
add r3, r2, r1
add r4, r3, r2
add r5, r4, r3
add r6, r5, r4
add r7, r6, r5
add r8, r7, r6
add r9, r8, r7
add r10, r9, r8
add r11, r10, r9
add r12, r11, r10
add r13, r12, r11
mov r14.x, vaTid0.x
mov g[r14.x], r13
ret_dyn
end
This is the kernel I have so far that does not work, I get the same errors. I have my const (float4) declared as: c[0]=1/width, c[1]=width, c[2] and c[3] = 0.
I also have allocation errors for the output outLocal, error getting module name o0 and error setting context memory (null)
The mul and mod are in the same sequence as "inputspeed_CS" sample, but I changed them anyway, this did not make a difference.
I changed the cb.y and cb.x, this did not make a difference.
Micah,
I get "error occured allocating resource outLocal 0" and "Error compiling, string is 'No Error'".
Maybe it will help if I post some of my code:
Here is the kernel:
const
char HILKernel[] =
"il_cs_2_0\n"
"dcl_num_thread_per_group 64,1,1\n"
"dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"
"dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"
"itof r2.z, vAbsTidFlat.x\n"
"mod r2.x, r2.z, cb0[0].y\n"
"mul r2.y, r2.z, cb0[0].y\n"
"flr r3, r2\n"
"sample_resource(0)_sampler(0) r0, r3\n"
"sample_resource(1)_sampler(0) r1, r3\n"
"add r2, r1, r0\n"
"add r3, r2, r1\n"
"add r4, r3, r2\n"
"add r5, r4, r3\n"
"mov r6.x, vAbsTidFlat.x\n"
"mov g[r6.x], r5\n"
"ret_dyn\n"
"end\n"
;
My constants:
for (i=0;i < num_const ; i++)
{
constPtr[1]=1.0f/(float)curNum.num_domain; constPtr[0]=(float)curNum.num_domain; constPtr[2]=0.0f; constPtr[3]=0.0f;
And my output resource allocation:
f
or (i=0; i < num_outputs ; i ++)
if(calResAllocLocal2D(&outLocal,device, cal_size, cal_size, CAL_FORMAT_FLOAT_4, CAL_RESALLOC_GLOBAL_BUFFER) != CAL_RESULT_OK)
fprintf(stderr,
"error occured allocating resource outLocal %d", i);
With a domain of > 32 the "error occured allocating resource outLocal" goes away.
Ryta,
AFAIK, one GPU context can not load both CS and PS kernels due to a bug in perhaps CAL.
Are you using only CS, or mixing up CS and PS in a single context in your application? If the latter, you will get weird results.
I only have one context and one kernel, it is a cs kernel that is posted above.
Also, outside of using RunProgramGrid and using the global buffer flag for the output my CAL code hasn't changed from my ps kernel running CAL code. What else should I be modifying?
I found, in your latest posted codes, you do not declare cb0? Maybe that is the problem, just maybe.
Also, personally I do not use ret_dyn in the end of the main il programs, since it is for functions. But I do not think this will cause any problem in the case you posted. However, according to the documents, you should use endmain to end the main procedure and begin declaration of functions, if there is any.
the729,
That just happened to get left out, sorry, this is not the problem. I also changed ret_dyn to ret and end to endmain, that didn't help. Any other ideas?
I'm still getting an error when running the program but the stringError is No Error, sort of contradicts itself.
Sorry, forgot to repost my kernel:
il_cs_2_0
dcl_num_thread_per_group 64
dcl_cb cb0[1]
dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(2)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(3)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(4)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(5)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
itof r6.z, vaTid0.x
mul r6.y, r6.z, cb0[0].y
mod r6.x, r6.z, cb0[0].x
flr r7, r6
sample_resource(0)_sampler(0) r0, r7
sample_resource(1)_sampler(0) r1, r7
sample_resource(2)_sampler(0) r2, r7
sample_resource(3)_sampler(0) r3, r7
sample_resource(4)_sampler(0) r4, r7
sample_resource(5)_sampler(0) r5, r7
add r6, r1, r0
add r7, r6, r2
add r8, r7, r3
add r9, r8, r4
add r10, r9, r5
add r11, r10, r9
add r12, r11, r10
add r13, r12, r11
add r14, r13, r12
add r15, r14, r13
add r16, r15, r14
add r17, r16, r15
add r18, r17, r16
add r19, r18, r17
add r20, r19, r18
add r21, r20, r19
add r22, r21, r20
add r23, r22, r21
add r24, r23, r22
add r25, r24, r23
add r26, r25, r24
mov r27.x, vaTid0.x
mov g[r27.x], r26
ret_dyn
end
BTW, the SKA compiles this code just fine, it makes me think that there is something on the host code that is wrong, I have:
const[0]=width and const[1]=1/width
RunProgramGrid instead of RunProgram (though I've tried both)
and the output flag is RES_ALLOC_GLOBAL_BUFFER..
any other ideas?
Micah,
I tried vAbsTidFlat.x and that didn't help.
The shader COMPILES fine on my machine too, this is not where I get the error, it's not a compile error it's a runtime error I am getting.
I think that the problem is on the host side code:
void callCalIL() { CALuint cal_size = curNum.num_domain; unsigned int size = curNum.num_domain; unsigned int num_inputs=curNum.num_inputs; unsigned int num_outputs=curNum.num_outputs; unsigned int num_const=curNum.num_const; unsigned int i=0; double duration=0.0f; clock_t start, stop; // Initialize CAL system for computation if(calInit() != CAL_RESULT_OK) fprintf(stderr, "error occured"); // Query and print the runtime version that is loaded CALuint version[3]; calGetVersion(&version[0], &version[1], &version[2]); fprintf(stderr, "CAL Runtime version %d.%d.%d\n", version[0], version[1], version[2]); // Query the compiler version that is loaded calclGetVersion(&version[0], &version[1], &version[2]); fprintf(stderr, "CAL Compiler version %d.%d.%d\n", version[0], version[1], version[2]); // Query the number of devices on the system CALuint numDevices = 0; if(calDeviceGetCount(&numDevices) != CAL_RESULT_OK) fprintf(stderr, "error occured"); printf("Number of Devices: %d\n", numDevices); // Get the information on the 0th device CALdeviceinfo info; if(calDeviceGetInfo(&info, 0) != CAL_RESULT_OK) fprintf(stderr, "error occured getting info\n"); switch(info.target) { case CAL_TARGET_600: { fprintf(stdout, "Device Type = GPU R600\n"); break; } case CAL_TARGET_670: { fprintf(stdout, "Device Type = GPU RV670\n"); break; } case CAL_TARGET_770: { fprintf(stdout, "Device Type = GPU RV770\n"); break; } default: { fprintf(stdout, "Unknown Device\n"); } } // Opening the 0th device CALdevice device = 0; if(calDeviceOpen(&device, 0) != CAL_RESULT_OK) fprintf(stderr, "error occured opening device\n"); // Create context on the device CALcontext ctx=0; if(calCtxCreate(&ctx, device) != CAL_RESULT_OK) fprintf(stderr, "error occured"); // allocate local resource CALresource inLocal[MAX_INPUTS], outLocal[MAX_OUTPUTS], constLocal[MAX_CONST]; for (i=0;i<num_inputs;i++) { inLocal=0; if(calResAllocLocal2D(&inLocal, device, cal_size, cal_size, CAL_FORMAT_FLOAT_4, 0) != CAL_RESULT_OK) fprintf(stderr, "error occured allocating resource inLocal %d", i); } for (i=0;i<num_outputs;i++) { if(calResAllocLocal2D(&outLocal ,device, cal_size, cal_size, CAL_FORMAT_FLOAT_4, CAL_RESALLOC_GLOBAL_BUFFER) != CAL_RESULT_OK) fprintf(stderr, "error occured allocating resource outLocal %d", i); } for (i=0;i<num_const;i++) { if(calResAllocRemote1D(&constLocal, &device, 1, 1, CAL_FORMAT_FLOAT_4, 0) != CAL_RESULT_OK) fprintf(stderr, "error occured allocating remote constLocal %d", i); } CALfloat *inPtr[MAX_INPUTS]; /*CALfloat **inPtr=(CALfloat**)malloc(sizeof(CALfloat)); for (i=0;i<MAX_INPUTS;i++) { inPtr = (CALfloat*)malloc(sizeof(CALfloat)); }*/ CALfloat *outPtr[MAX_OUTPUTS]; CALfloat *constPtr[MAX_CONST]; CALuint pitch = 0; CALuint constPitch=0; //map the resource for input for (i=0;i<num_inputs;i++) { inPtr = NULL; if (calResMap((CALvoid**)&inPtr, &pitch, inLocal, 0) != CAL_RESULT_OK) fprintf(stderr, "error occured mapping resource inPtr %d", i); } //init the memory //float *verify=(float*)malloc(curNum.num_domain*curNum.num_domain*sizeof(float)); CALfloat *tmp[MAX_INPUTS]; for (i=0;i<num_inputs;i++) { for (unsigned int k=0;k < size; k++) { tmp = &inPtr[k*pitch]; for (unsigned int j=0;j<size;j++) { //verify[j+size*k] = (float)(j+k); tmp[4*j] = (CALfloat)(j+k); tmp[4*j+1] = (CALfloat)(j+k+1); tmp[4*j+2] = (CALfloat)(j+k+2); tmp[4*j+3] = (CALfloat)(j+k+3); //printf("input %d: [%d] = %f\n", i, j+size*k, tmp[j+size*k]); //printf("verify[%d]: %f\n", i, verify[j+size*k]); } } } //unmap the resource for input for (i=0;i<num_inputs;i++) { if (calResUnmap(inLocal) != CAL_RESULT_OK) fprintf(stderr, "error occured unmapping resource inLocal %d\n",i); } for (i=0;i<num_const;i++) { constPtr=NULL; if (calResMap((CALvoid**)&constPtr, &constPitch, constLocal, 0) != CAL_RESULT_OK) fprintf(stderr, "error occured mapping resource constPtr %d", 0); } for (i=0;i<num_const;i++) { constPtr[1]=1.0f/(float)curNum.num_domain; constPtr[0]=(float)curNum.num_domain; constPtr[2]=0.0f; constPtr[3]=0.0f; } for (i=0;i<num_const;i++) { if (calResUnmap(constLocal) != CAL_RESULT_OK) fprintf(stderr, "error occured unmapping resource constLocal %d\n",0); } CALmem inmem[MAX_INPUTS], outmem[MAX_OUTPUTS], constmem[MAX_CONST]; for (i=0;i<num_inputs;i++) { inmem=0; if (calCtxGetMem(&inmem, ctx, inLocal) != CAL_RESULT_OK) fprintf(stderr, "error binding resource %d to context\n", i); } for (i=0;i<num_const;i++) { constmem=0; if (calCtxGetMem(&constmem, ctx, constLocal) != CAL_RESULT_OK) fprintf(stderr, "error binding resource %d to context\n", 0); } for (i=0;i<num_outputs;i++) { outmem=0; if (calCtxGetMem(&outmem, ctx, outLocal) != CAL_RESULT_OK) fprintf(stderr, "error binding out resource %d to context\n",i); } //compile the kernel //link object to image CALdeviceattribs attribs; attribs.struct_size = sizeof(CALdeviceattribs); if (calDeviceGetAttribs(&attribs, 0) != CAL_RESULT_OK) { fprintf(stderr, "There was an error getting device attribs.\n"); fprintf(stderr, "Error string is %s\n", calGetErrorString()); } CALobject obj=NULL; CALimage img=NULL; if(calclCompile(&obj, CAL_LANGUAGE_IL, ILKernel.c_str(), info.target) != CAL_RESULT_OK) { fprintf(stderr, "Error compiling, string is %s\n", calclGetErrorString()); getchar(); exit(1); } if(calclLink(&img, &obj, 1) != CAL_RESULT_OK) fprintf(stderr, "error linking object\n"); // load and run the kernel HERE CALmodule module=0; if(calModuleLoad(&module, ctx, img) != CAL_RESULT_OK) fprintf(stdout, "error loading module\n"); // Query the entry point in the module for the function “main” CALfunc func = 0; if(calModuleGetEntry(&func, ctx, module, "main") != CAL_RESULT_OK) fprintf(stdout, "error getting module entry point\n"); // Query the variable names for inName 0 and outName 0 CALname inName[MAX_INPUTS], outName[MAX_OUTPUTS], constName[MAX_CONST]; CALchar paramName[10]; for (i=0;i<num_inputs;i++) { sprintf_s(paramName, "i%d", i); inName = 0; if(calModuleGetName(&inName, ctx, module, paramName ) != CAL_RESULT_OK) fprintf(stdout,"error getting module name %s\n", paramName); } for (i=0;i<num_const;i++) { sprintf_s(paramName, "cb0"); constName = 0; if(calModuleGetName(&constName, ctx, module, paramName ) != CAL_RESULT_OK) fprintf(stdout,"error getting module name %s\n", paramName); } for (i=0;i<num_outputs;i++) { sprintf_s(paramName, "o%d", i); outName=0; if(calModuleGetName(&outName, ctx, module, paramName) != CAL_RESULT_OK) fprintf(stdout,"error getting module name %s\n", paramName); } // Bind resources to memory handles for this context // …………… for (i=0;i<num_inputs;i++) { if(calCtxSetMem(ctx, inName, inmem) != CAL_RESULT_OK) fprintf(stdout, "error setting context memory %s\n", inName); } for (i=0;i<num_const;i++) { if(calCtxSetMem(ctx, constName, constmem) != CAL_RESULT_OK) fprintf(stdout, "error setting context memory %s\n", constName); } for(i=0;i<num_outputs;i++) { if(calCtxSetMem(ctx, outName, outmem) != CAL_RESULT_OK) fprintf(stdout, "error setting context memory %s\n", outName); } // Setup the domain for execution CALdomain domain = {0, 0, size, size}; // Event ID corresponding to the kernel invocation CALevent event = 0; // Launch the CAL kernel on the given domain CALresult calCtxError; double total_time=0.0f, total_idle=0.0f, total_cache=0.0f; int j; counter_func_init(); CALcounter cacheCounter; CALcounter idleCounter; calCtxCreateCounterExt(&cacheCounter, ctx, CAL_COUNTER_INPUT_CACHE_HIT_RATE); calCtxCreateCounterExt(&idleCounter, ctx, CAL_COUNTER_IDLE); CALfloat idlePercentage = 0.0f; CALfloat cachePercentage = 0.0f; fdata<<setw(10)<<curNum.alu_fetch; fdata<<setw(7)<<curNum.num_inputs; fdata<<setw(8)<<curNum.num_outputs; fdata<<setw(7)<<curNum.num_const; fdata<<setw(8)<<curNum.num_alu_ops; CALprogramGrid pg; static PFNCALCTXRUNPROGRAMGRID calCtxRunProgramGrid = 0; if (calCtxRunProgramGrid == 0) { calExtGetProc((CALextproc*)&calCtxRunProgramGrid, CAL_EXT_COMPUTE_SHADER, "calCtxRunProgramGrid"); if (calCtxRunProgramGrid == 0) { fprintf(stderr, "Error: Compute shader extension not found\n"); } } for (j=0;j<OUTER_LOOP+1;j++) { calCtxFlush(ctx); calCtxBeginCounterExt(ctx, idleCounter); calCtxBeginCounterExt(ctx, cacheCounter); CALdomain3D rect; rect.width = curNum.num_domain; rect.height = curNum.num_domain; rect.depth = 1; pg.func = func; pg.flags = 0; pg.gridBlock.width = 64; //needs to be same value as what is in the kernal for thread group size. pg.gridBlock.height = 1; pg.gridBlock.depth = 1; pg.gridSize.width = (rect.width*rect.height + pg.gridBlock.width - 1) / pg.gridBlock.width; pg.gridSize.height = 1; pg.gridSize.depth = 1; start = clock(); calCtxError = calCtxRunProgramGrid(&event, ctx, &pg); //calCtxError = calCtxRunProgram(&event, ctx, func, &domain); //fprintf(stdout, "%s\n", calGetErrorString()); if (calCtxError == CAL_RESULT_BAD_HANDLE) fprintf(stdout, "bad handle error running program\n"); if (calCtxError == CAL_RESULT_ERROR) { fprintf(stdout, "symbol error running context program\n"); fprintf(stderr, "Error running, string is %s\n", calclGetErrorString()); printf("%s", ILKernel.c_str()); //getchar(); } // Wait on the event for kernel completion while(calCtxIsEventDone(ctx, event) == CAL_RESULT_PENDING); stop=clock(); calCtxEndCounterExt(ctx, idleCounter); calCtxEndCounterExt(ctx, cacheCounter); duration =(stop-start); calCtxGetCounterExt(&idlePercentage, ctx, idleCounter); calCtxGetCounterExt(&cachePercentage, ctx, cacheCounter); idlePercentage *= 100.0f; cachePercentage *= 100.0f; //fdata<<"Idle percentage: "<<idlePercentage<<endl; //fdata<<"Cache hit rate: "<<cachePercentage<<endl; duration = duration/(double)CLOCKS_PER_SEC; if (j!=0) total_time+=duration; total_idle+=idlePercentage; total_cache+=cachePercentage; //fdata<<"Kernel "<<j<<" Time: "<<duration<<endl; } getchar(); string bottleneck; float core_time=0.0f; float fetch_time=0.0f; float mem_time=0.0f; float exp_time=0.0f; cout<<"ALU Ops: "<<curNum.num_alu_ops<<endl; core_time=((float)(curNum.num_domain*curNum.num_domain)*((float)curNum.num_alu_ops))/((160.0f)*((float)attribs.engineClock*1000000.0f)); fetch_time=((float)(curNum.num_domain*curNum.num_domain)*((float)curNum.num_inputs))/((40.0f)*((float)attribs.engineClock*1000000.0f)); mem_time=((float)(curNum.num_domain*curNum.num_domain)*((float)curNum.num_outputs*128.0f))/((256.0f)*((float)attribs.memoryClock*1000000.0f*2.0f)); cout<<"Core Time: "<<core_time<<endl; cout<<"Fetch Time: "<<fetch_time<<endl; cout<<"Mem Time: "<<mem_time<<endl; if (core_time >= fetch_time) { if (core_time >= mem_time) { exp_time = core_time; bottleneck="ALU"; } else { exp_time = mem_time; bottleneck="MEMORY"; } } else { if (fetch_time >= mem_time) { exp_time=fetch_time; bottleneck="FETCH"; } else { exp_time=mem_time; bottleneck="MEMORY"; } } calCtxDestroyCounterExt(ctx, idleCounter); calCtxDestroyCounterExt(ctx, cacheCounter); fdata<<setw(6)<<OUTER_LOOP*INNER_LOOP; fdata<<setw(13)<<total_cache/(OUTER_LOOP*INNER_LOOP); fdata<<setw(13)<<total_idle/(OUTER_LOOP*INNER_LOOP); fdata<<setw(13)<<total_time; fdata<<setw(7)<<curNum.num_domain; fdata<<setw(13)<<(exp_time*OUTER_LOOP*INNER_LOOP); fdata<<setw(11)<<bottleneck; fdata<<setw(5)<<curNum.num_GPR; fdata<<setw(5)<<curNum.num_wf; fdata<<endl; //remap the resource for output for (i=0;i<num_outputs;i++) { outPtr = NULL; if (calResMap((CALvoid**)&outPtr, &pitch, outLocal, 0) != CAL_RESULT_OK) fprintf(stderr, "error occured mapping resource outLocal %d", i); } //print the memory CALfloat *out1[MAX_OUTPUTS]; for (i=0;i<num_outputs;i++) { for (unsigned int k=0;k < size; k++) { out1 = &outPtr[k*pitch]; for (unsigned int j=0;j<size;j++) { //printf("out1[%d][%d]: %f\n", i, j+k*size, out1
); } } } // verify using CPU resource and function /*float *verify_out = (float*)malloc(curNum.num_domain*curNum.num_domain*sizeof(float)); float *tmpf = (float*)malloc(curNum.num_alu_ops*sizeof(float)); for(i=0;i<curNum.num_domain;i++) { for(j=0;j<curNum.num_domain;j++) { tmpf[0]=verify[j+size*i]+verify[j+size*i]; tmpf[1]=tmpf[0]+verify[i+size*j]; tmpf[2]=tmpf[1]+tmpf[0]; tmpf[3]=tmpf[2]+tmpf[1]; tmpf[4]=tmpf[3]+tmpf[2]; tmpf[5]=tmpf[4]+tmpf[3]; verify_out[j+size*i]=tmpf[5]+tmpf[4]; } } bool confirm=false; for(i=0;i<curNum.num_domain;i++) { out1[0]=&outPtr[0][i*pitch]; for (j=0;j<curNum.num_domain;j++) { if (out1[0] == verify_out[j+size*i]) { confirm = true; } else { confirm=false; printf("%d: %f = %f\n", j+size*i, out1[0] , verify_out[j+size*i]); printf("ERROR, output does not compute!\n"); getchar(); } if (confirm == false) { exit(1); } } }*/ //unmap the resource for output for (i=0;i<num_outputs;i++) { if (calResUnmap(outLocal) != CAL_RESULT_OK) fprintf(stderr, "error occured unmapping outLocal %d", i); } //unload module calModuleUnload(ctx, module); //free the image calclFreeImage(img); //free the object calclFreeObject(obj); //release the resource from the context for (i=0;i<num_inputs;i++) { if (calCtxReleaseMem(ctx, inmem) != CAL_RESULT_OK) fprintf(stderr, "error occured releasing resource inmem %d from context", i); } for (i=0;i<num_const;i++) { if (calCtxReleaseMem(ctx, constmem) != CAL_RESULT_OK) fprintf(stderr, "error occured releasing resource constmem %d from context", 0); } for (i=0;i<num_outputs;i++) { if (calCtxReleaseMem(ctx, outmem) != CAL_RESULT_OK) fprintf(stderr, "error occured releasing resource from context"); } // deallocate local resource for (i=0;i<num_inputs;i++) { if (calResFree(inLocal) != CAL_RESULT_OK) fprintf(stderr, "error occured freeing inLocal %d", i); } for (i=0;i<num_const;i++) { if (calResFree(constLocal) != CAL_RESULT_OK) fprintf(stderr, "error occured freeing constLocal %d", 0); } for (i=0;i<num_outputs;i++) { if (calResFree(outLocal) != CAL_RESULT_OK) fprintf(stderr, "error occured freeing outLocal\n"); } // Destroy the context if(calCtxDestroy(ctx) != CAL_RESULT_OK) fprintf(stderr, "error occured"); // Closing the device calDeviceClose(device); // Shutting down CAL if(calShutdown() != CAL_RESULT_OK) fprintf(stderr, "error occured"); }
No, I get an error: "error getting module name o0" and "error setting context output memory (null)"
Then when the kernel runs (when I call calRunProgramGrid(..)) I get "symbol error running context program" and "Error running, string is No Error".
It's definitely something on the host side code but I'm really having a problem because there is such a shortage of documentation on this. Any help would be great, thanks.
Micah,
So even if I have many outputs they all have the same name: 'g[]', just like that?
Micah,
Ok, I have tried that and it seems that the errors are gone. Once again, thank you for your time, trust me I understand how valuable it is (your time that is).
Hopefully, this will all become much clearer in the new documentation. This is not very clear in the docs now, even under the "how to use the global buffer" section.
I'm getting incorrect results with this kernel:
"il_cs_2_0\n"
"dcl_num_thread_per_group 64\n"
"dcl_cb cb0[1]\n"
"dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"
"dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"
"dcl_resource_id(2)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"
"itof r7.z, vAbsTidFlat.x\n"
"mul r7.y, r7.z, cb0[0].y\n"
"mod r7.x, r7.z, cb0[0].x\n"
"flr r8, r7\n"
"sample_resource(0)_sampler(0) r0, r8\n"
"sample_resource(1)_sampler(0) r1, r8\n"
"sample_resource(2)_sampler(0) r2, r8\n"
"add r3, r1, r0\n"
"add r4, r3, r2\n"
"add r5, r4, r3\n"
"add r6, r5, r4\n"
"add r7, r6, r5\n"
"add r8, r7, r6\n"
"add r9, r8, r7\n"
"mov g[vAbsTidFlat.x], r9\n"
"ret_dyn\n"
"end\n"
;
The "results" are actually "correct" but they are in the wrong place (and some just show 0, meaning they are not being computed on at all)... how I've done this is how they do it in the inputspeed_cs example, so I'm a bit confused.
cb0[0].x = domain and cb0[0].y = 1/domain (it's a squared domain)
Actually, I'm still fairly confused when it comes to getting the right 2D index to use in texture fetching for compute shader mode.
What's wrong with the above? Any ideas?
The output is linear, not the input.
I didn't think it was possible to declare linear input and still use texture fetches to get the input.
Why is 2,2 the 4th element? What element is 0, 0? Should be 0, yes? 0, 1 is 1? 1, 0 is 2? 1, 1 is 3 (the 4th element)?
So for example, if I use 2,2 as the access for the kernel then I get, by index:
0: 0
1: 0
2: 0
3: Real value
4: 0
5: 0
6: 0
7: Real value
Actually, I get no change in output whether using the CAL_RESALLOC_GLOBAL_BUFFER or just using 0 for the flag, the result is the same.
Micah,
Ok, I understand that one is tiled and the other is linear.. though I can't get either to work for compute mode... sadly this doesn't tell me anything about the tiled arrangment.
Maybe some better documentation with graphs would work go far to help people (or at least me) understand this.
If I try to sample off a literal I get the same result regardless of the literal values... 0,0 returns same result as 4, 0 or 127, 35, etc..
So my question is this: how are the groups arranged off of the absolute thread index? Using a 64x1 block, if I want to access absolute index 63 then it should just be 0, 63 correct?
Maybe I have the output wrong, I'm just using vAbsTidFlat.x... this is for a 64x1 block.
Or is it possible that I'm reading the output incorrectly back? (I have verified that my method works fine in pixel shader mode)
Also, in tiled layout are 1,0 and 0,1 the same element?