Archives Discussions

ryta1203 · ‎07-28-2009

1. Where do the docs talk about compute shader mode?

2. Why do I get the "No Error" string for a kernel that doesn't return CAL_RESULT_OK??? There is obviously an error but it tells me "don't worry about it.. no error... but your kernel still won't run, so sorry"!? EDIT: Kernel compiles fine in SKA.

3. Do you have to sample from an offset. If VaTid is the global thread id, why can't I just sample (either sampling or getting from/to the global buffer) from that (like you would with vWinCoord0)?

4. If I'm sampling the inputs as streams in compute shader mode, do I have to allocate the resource as global?

5. Can you burst write in CS?

6. Where is the docs that talk about CS?

MicahVillmow · ‎07-29-2009

1&6) If they were not in 1.4, they should be in the next release.
2) Most likely a PS instruction is being used as a CS instruction, but without the kernel I cannot be sure.
3) the sample instruction expects the address to be two dimensional and in floating point format, vATid is a single dimension integer.
4) No, only the write must be allocated as global
5) Yes, just do global buffer writes with address offsets of + 0, +1, +2, +3, etc...

ryta1203 · ‎07-29-2009

Thank you, AGAIN!!

The kernel is below, pretty straightforward actually; however, I was using vaTid to sample, apparently I cannot. What should I be using instead to access the thread id? All the "...Tid" registers are 1 component it seems.

const

char

HILKernel[] =

"il_cs_2_0\n"

"dcl_num_thread_per_group 64\n"

"dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"

"dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"

"mov r2.x, vaTid.x\n"

"sample_resource(0)_sampler(0) r0, r2\n"

"sample_resource(1)_sampler(0) r1, r2\n"

"add r2, r1, r0\n"

"add r3, r2, r1\n"

"add r4, r3, r2\n"

"add r5, r4, r3\n"

"add r6, r5, r4\n"

"add r7, r6, r5\n"

"add r8, r7, r6\n"

"add r9, r8, r7\n"

"add r10, r9, r8\n"

"add r11, r10, r9\n"

"add r12, r11, r10\n"

"add r13, r12, r11\n"

"add r14, r13, r12\n"

"add r15, r14, r13\n"

"add r16, r15, r14\n"

"add r17, r16, r15\n"

"mov g[vaTid0.x], r17\n"

"ret_dyn\n"

"end\n"

MicahVillmow · ‎07-29-2009

So, after your copy to r2.x, you need to convert it into x & y cordinates via either a shl/and or a mod/div and then convert the results to fp using itof, then you can index correctly into the samplers. Also, it might be vaTid.x or vaTid0.x that is causing the problems. Please use the same in both locations and make sure that you are using the correct one specified in the docs. I know that they were updated recently but not sure if it was a 1.3 change or a 1.4 change.

ryta1203 · ‎07-29-2009

Ok, thanks, will try that.

So when I output, if I have 8 outputs going to the global buffer for that kernel (each of the same element) then I should burst write by doing:

g[r0]

g[r0+1]

g[r0+2]

g[r0+3]

g[r0+4]... etc, etc.. like in the burst_write_cs example, correct?

Is this the case for float4 AND float data types? Does it not matter the data type?

What about inputs from the global buffer, are they handled the same way (with that same stride)?

MicahVillmow · ‎07-29-2009

Yes that is the correct way of doing it. In 1.4 the global buffer inputs had some performance issues that we have since fixed, but if you setup your inputs and outputs using that manner, than you can possibly get the best performance.

ryta1203 · ‎07-29-2009

Also, can you re-explain how to sample in compute shader mode? I'm not sure I understand why you need to div/mod (you guys mul/mod in your examples).

What is cb0[0] in this example?

"itof r0.z, vaTid0.x\n"

"mul r0.y, r0.z, cb0[0].y\n"

"mod r0.x, r0.z, cb0[0].x\n"

"flr r0.xy, r0.xy\n"

MicahVillmow · ‎07-29-2009

cb0[0].y is actually 1 / width, so we do the division on the host side(as it is done once instead of once per thread) and cb0[0].x is width.

ryta1203 · ‎07-29-2009

Essentially if you are using compute shader mode you need to pass in some constants, there is no way around this I suppose?

MicahVillmow · ‎07-29-2009

if you want to base your computation on a dynamic width, then yes. However, you can hardcode your width to say 1024 and then just vary the height of the data domain. It requires a little bit of translation on the host side, but would simplify the kernels.

ryta1203 · ‎07-29-2009

il_cs_2_0
dcl_num_thread_per_group 64
dcl_cb cb0[1]
dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
itof r2.z, vaTid0.x
mul r2.y, r2.z, cb0[0].y
mod r2.x, r2.z, cb0[0].x
flr r3, r2
sample_resource(0)_sampler(0) r0, r3
sample_resource(1)_sampler(0) r1, r3
add r2, r1, r0
add r3, r2, r1
add r4, r3, r2
add r5, r4, r3
add r6, r5, r4
add r7, r6, r5
add r8, r7, r6
add r9, r8, r7
add r10, r9, r8
add r11, r10, r9
add r12, r11, r10
add r13, r12, r11
mov r14.x, vaTid0.x
mov g[r14.x], r13
ret_dyn
end

This is the kernel I have so far that does not work, I get the same errors. I have my const (float4) declared as: c[0]=1/width, c[1]=width, c[2] and c[3] = 0.

I also have allocation errors for the output outLocal, error getting module name o0 and error setting context memory (null)

MicahVillmow · ‎07-29-2009

you have the mul and mod backwards. c[0] = width and c[1] = 1/width.

ryta1203 · ‎07-29-2009

The mul and mod are in the same sequence as "inputspeed_CS" sample, but I changed them anyway, this did not make a difference.

I changed the cb.y and cb.x, this did not make a difference.

MicahVillmow · ‎07-29-2009

Well, then the next problem is to find out what is going wrong by finding out line by line what is causing the error. Also, can you try 64, 1, 1 as the thread_per_group and vAbsTidFlat.x?

ryta1203 · ‎07-29-2009

Micah,

I get "error occured allocating resource outLocal 0" and "Error compiling, string is 'No Error'".

Maybe it will help if I post some of my code:

Here is the kernel:

const

char HILKernel[] =

"il_cs_2_0\n"

"dcl_num_thread_per_group 64,1,1\n"

"dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"

"dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"

"itof r2.z, vAbsTidFlat.x\n"

"mod r2.x, r2.z, cb0[0].y\n"

"mul r2.y, r2.z, cb0[0].y\n"

"flr r3, r2\n"

"sample_resource(0)_sampler(0) r0, r3\n"

"sample_resource(1)_sampler(0) r1, r3\n"

"add r2, r1, r0\n"

"add r3, r2, r1\n"

"add r4, r3, r2\n"

"add r5, r4, r3\n"

"mov r6.x, vAbsTidFlat.x\n"

"mov g[r6.x], r5\n"

"ret_dyn\n"

"end\n"

;

My constants:

for (i=0;i < num_const ; i++)

{

constPtr[1]=1.0f/(float)curNum.num_domain; constPtr[0]=(float)curNum.num_domain; constPtr[2]=0.0f; constPtr[3]=0.0f;

And my output resource allocation:

f

or (i=0; i < num_outputs ; i ++)

if(calResAllocLocal2D(&outLocal,device, cal_size, cal_size, CAL_FORMAT_FLOAT_4, CAL_RESALLOC_GLOBAL_BUFFER) != CAL_RESULT_OK)

fprintf(stderr,

"error occured allocating resource outLocal %d", i);

With a domain of > 32 the "error occured allocating resource outLocal" goes away.

It doesn't seem like the compiler likes 64,1,1 because when I take that away the kernel compiles fine, though I still get errors.

the729 · ‎07-30-2009

Ryta,

AFAIK, one GPU context can not load both CS and PS kernels due to a bug in perhaps CAL.

Are you using only CS, or mixing up CS and PS in a single context in your application? If the latter, you will get weird results.

ryta1203 · ‎07-31-2009

I only have one context and one kernel, it is a cs kernel that is posted above.

Also, outside of using RunProgramGrid and using the global buffer flag for the output my CAL code hasn't changed from my ps kernel running CAL code. What else should I be modifying?

the729 · ‎08-01-2009

I found, in your latest posted codes, you do not declare cb0? Maybe that is the problem, just maybe.

Also, personally I do not use ret_dyn in the end of the main il programs, since it is for functions. But I do not think this will cause any problem in the case you posted. However, according to the documents, you should use endmain to end the main procedure and begin declaration of functions, if there is any.

ryta1203 · ‎08-02-2009

the729,

That just happened to get left out, sorry, this is not the problem. I also changed ret_dyn to ret and end to endmain, that didn't help. Any other ideas?

I'm still getting an error when running the program but the stringError is No Error, sort of contradicts itself.

ryta1203 · ‎08-02-2009

Sorry, forgot to repost my kernel:

il_cs_2_0
dcl_num_thread_per_group 64
dcl_cb cb0[1]
dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(2)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(3)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(4)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(5)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
itof r6.z, vaTid0.x
mul r6.y, r6.z, cb0[0].y
mod r6.x, r6.z, cb0[0].x
flr r7, r6
sample_resource(0)_sampler(0) r0, r7
sample_resource(1)_sampler(0) r1, r7
sample_resource(2)_sampler(0) r2, r7
sample_resource(3)_sampler(0) r3, r7
sample_resource(4)_sampler(0) r4, r7
sample_resource(5)_sampler(0) r5, r7
add r6, r1, r0
add r7, r6, r2
add r8, r7, r3
add r9, r8, r4
add r10, r9, r5
add r11, r10, r9
add r12, r11, r10
add r13, r12, r11
add r14, r13, r12
add r15, r14, r13
add r16, r15, r14
add r17, r16, r15
add r18, r17, r16
add r19, r18, r17
add r20, r19, r18
add r21, r20, r19
add r22, r21, r20
add r23, r22, r21
add r24, r23, r22
add r25, r24, r23
add r26, r25, r24
mov r27.x, vaTid0.x
mov g[r27.x], r26
ret_dyn
end

BTW, the SKA compiles this code just fine, it makes me think that there is something on the host code that is wrong, I have:

const[0]=width and const[1]=1/width

RunProgramGrid instead of RunProgram (though I've tried both)

and the output flag is RES_ALLOC_GLOBAL_BUFFER..

any other ideas?

MicahVillmow · ‎08-03-2009

Ryta,
Can you try with vAbsTidFlat.x instead of vaTid0.x? Also, this shader compiles fine for me on my machine.

ryta1203 · ‎08-03-2009

Micah,

I tried vAbsTidFlat.x and that didn't help.

The shader COMPILES fine on my machine too, this is not where I get the error, it's not a compile error it's a runtime error I am getting.

I think that the problem is on the host side code:

void callCalIL() { CALuint cal_size = curNum.num_domain; unsigned int size = curNum.num_domain; unsigned int num_inputs=curNum.num_inputs; unsigned int num_outputs=curNum.num_outputs; unsigned int num_const=curNum.num_const; unsigned int i=0; double duration=0.0f; clock_t start, stop; // Initialize CAL system for computation if(calInit() != CAL_RESULT_OK) fprintf(stderr, "error occured"); // Query and print the runtime version that is loaded CALuint version[3]; calGetVersion(&version[0], &version[1], &version[2]); fprintf(stderr, "CAL Runtime version %d.%d.%d\n", version[0], version[1], version[2]); // Query the compiler version that is loaded calclGetVersion(&version[0], &version[1], &version[2]); fprintf(stderr, "CAL Compiler version %d.%d.%d\n", version[0], version[1], version[2]); // Query the number of devices on the system CALuint numDevices = 0; if(calDeviceGetCount(&numDevices) != CAL_RESULT_OK) fprintf(stderr, "error occured"); printf("Number of Devices: %d\n", numDevices); // Get the information on the 0th device CALdeviceinfo info; if(calDeviceGetInfo(&info, 0) != CAL_RESULT_OK) fprintf(stderr, "error occured getting info\n"); switch(info.target) { case CAL_TARGET_600: { fprintf(stdout, "Device Type = GPU R600\n"); break; } case CAL_TARGET_670: { fprintf(stdout, "Device Type = GPU RV670\n"); break; } case CAL_TARGET_770: { fprintf(stdout, "Device Type = GPU RV770\n"); break; } default: { fprintf(stdout, "Unknown Device\n"); } } // Opening the 0th device CALdevice device = 0; if(calDeviceOpen(&device, 0) != CAL_RESULT_OK) fprintf(stderr, "error occured opening device\n"); // Create context on the device CALcontext ctx=0; if(calCtxCreate(&ctx, device) != CAL_RESULT_OK) fprintf(stderr, "error occured"); // allocate local resource CALresource inLocal[MAX_INPUTS], outLocal[MAX_OUTPUTS], constLocal[MAX_CONST]; for (i=0;i<num_inputs;i++) { inLocal=0; if(calResAllocLocal2D(&inLocal, device, cal_size, cal_size, CAL_FORMAT_FLOAT_4, 0) != CAL_RESULT_OK) fprintf(stderr, "error occured allocating resource inLocal %d", i); } for (i=0;i<num_outputs;i++) { if(calResAllocLocal2D(&outLocal ,device, cal_size, cal_size, CAL_FORMAT_FLOAT_4, CAL_RESALLOC_GLOBAL_BUFFER) != CAL_RESULT_OK) fprintf(stderr, "error occured allocating resource outLocal %d", i); } for (i=0;i<num_const;i++) { if(calResAllocRemote1D(&constLocal, &device, 1, 1, CAL_FORMAT_FLOAT_4, 0) != CAL_RESULT_OK) fprintf(stderr, "error occured allocating remote constLocal %d", i); } CALfloat *inPtr[MAX_INPUTS]; /*CALfloat **inPtr=(CALfloat**)malloc(sizeof(CALfloat)); for (i=0;i<MAX_INPUTS;i++) { inPtr = (CALfloat*)malloc(sizeof(CALfloat)); }*/ CALfloat *outPtr[MAX_OUTPUTS]; CALfloat *constPtr[MAX_CONST]; CALuint pitch = 0; CALuint constPitch=0; //map the resource for input for (i=0;i<num_inputs;i++) { inPtr = NULL; if (calResMap((CALvoid**)&inPtr, &pitch, inLocal, 0) != CAL_RESULT_OK) fprintf(stderr, "error occured mapping resource inPtr %d", i); } //init the memory //float *verify=(float*)malloc(curNum.num_domain*curNum.num_domain*sizeof(float)); CALfloat *tmp[MAX_INPUTS]; for (i=0;i<num_inputs;i++) { for (unsigned int k=0;k < size; k++) { tmp = &inPtr[k*pitch]; for (unsigned int j=0;j<size;j++) { //verify[j+size*k] = (float)(j+k); tmp[4*j] = (CALfloat)(j+k); tmp[4*j+1] = (CALfloat)(j+k+1); tmp[4*j+2] = (CALfloat)(j+k+2); tmp[4*j+3] = (CALfloat)(j+k+3); //printf("input %d: [%d] = %f\n", i, j+size*k, tmp[j+size*k]); //printf("verify[%d]: %f\n", i, verify[j+size*k]); } } } //unmap the resource for input for (i=0;i<num_inputs;i++) { if (calResUnmap(inLocal) != CAL_RESULT_OK) fprintf(stderr, "error occured unmapping resource inLocal %d\n",i); } for (i=0;i<num_const;i++) { constPtr=NULL; if (calResMap((CALvoid**)&constPtr, &constPitch, constLocal, 0) != CAL_RESULT_OK) fprintf(stderr, "error occured mapping resource constPtr %d", 0); } for (i=0;i<num_const;i++) { constPtr[1]=1.0f/(float)curNum.num_domain; constPtr[0]=(float)curNum.num_domain; constPtr[2]=0.0f; constPtr[3]=0.0f; } for (i=0;i<num_const;i++) { if (calResUnmap(constLocal) != CAL_RESULT_OK) fprintf(stderr, "error occured unmapping resource constLocal %d\n",0); } CALmem inmem[MAX_INPUTS], outmem[MAX_OUTPUTS], constmem[MAX_CONST]; for (i=0;i<num_inputs;i++) { inmem=0; if (calCtxGetMem(&inmem, ctx, inLocal) != CAL_RESULT_OK) fprintf(stderr, "error binding resource %d to context\n", i); } for (i=0;i<num_const;i++) { constmem=0; if (calCtxGetMem(&constmem, ctx, constLocal) != CAL_RESULT_OK) fprintf(stderr, "error binding resource %d to context\n", 0); } for (i=0;i<num_outputs;i++) { outmem=0; if (calCtxGetMem(&outmem, ctx, outLocal) != CAL_RESULT_OK) fprintf(stderr, "error binding out resource %d to context\n",i); } //compile the kernel //link object to image CALdeviceattribs attribs; attribs.struct_size = sizeof(CALdeviceattribs); if (calDeviceGetAttribs(&attribs, 0) != CAL_RESULT_OK) { fprintf(stderr, "There was an error getting device attribs.\n"); fprintf(stderr, "Error string is %s\n", calGetErrorString()); } CALobject obj=NULL; CALimage img=NULL; if(calclCompile(&obj, CAL_LANGUAGE_IL, ILKernel.c_str(), info.target) != CAL_RESULT_OK) { fprintf(stderr, "Error compiling, string is %s\n", calclGetErrorString()); getchar(); exit(1); } if(calclLink(&img, &obj, 1) != CAL_RESULT_OK) fprintf(stderr, "error linking object\n"); // load and run the kernel HERE CALmodule module=0; if(calModuleLoad(&module, ctx, img) != CAL_RESULT_OK) fprintf(stdout, "error loading module\n"); // Query the entry point in the module for the function “main” CALfunc func = 0; if(calModuleGetEntry(&func, ctx, module, "main") != CAL_RESULT_OK) fprintf(stdout, "error getting module entry point\n"); // Query the variable names for inName 0 and outName 0 CALname inName[MAX_INPUTS], outName[MAX_OUTPUTS], constName[MAX_CONST]; CALchar paramName[10]; for (i=0;i<num_inputs;i++) { sprintf_s(paramName, "i%d", i); inName = 0; if(calModuleGetName(&inName, ctx, module, paramName ) != CAL_RESULT_OK) fprintf(stdout,"error getting module name %s\n", paramName); } for (i=0;i<num_const;i++) { sprintf_s(paramName, "cb0"); constName = 0; if(calModuleGetName(&constName, ctx, module, paramName ) != CAL_RESULT_OK) fprintf(stdout,"error getting module name %s\n", paramName); } for (i=0;i<num_outputs;i++) { sprintf_s(paramName, "o%d", i); outName=0; if(calModuleGetName(&outName, ctx, module, paramName) != CAL_RESULT_OK) fprintf(stdout,"error getting module name %s\n", paramName); } // Bind resources to memory handles for this context // …………… for (i=0;i<num_inputs;i++) { if(calCtxSetMem(ctx, inName, inmem) != CAL_RESULT_OK) fprintf(stdout, "error setting context memory %s\n", inName); } for (i=0;i<num_const;i++) { if(calCtxSetMem(ctx, constName, constmem) != CAL_RESULT_OK) fprintf(stdout, "error setting context memory %s\n", constName); } for(i=0;i<num_outputs;i++) { if(calCtxSetMem(ctx, outName, outmem) != CAL_RESULT_OK) fprintf(stdout, "error setting context memory %s\n", outName); } // Setup the domain for execution CALdomain domain = {0, 0, size, size}; // Event ID corresponding to the kernel invocation CALevent event = 0; // Launch the CAL kernel on the given domain CALresult calCtxError; double total_time=0.0f, total_idle=0.0f, total_cache=0.0f; int j; counter_func_init(); CALcounter cacheCounter; CALcounter idleCounter; calCtxCreateCounterExt(&cacheCounter, ctx, CAL_COUNTER_INPUT_CACHE_HIT_RATE); calCtxCreateCounterExt(&idleCounter, ctx, CAL_COUNTER_IDLE); CALfloat idlePercentage = 0.0f; CALfloat cachePercentage = 0.0f; fdata<<setw(10)<<curNum.alu_fetch; fdata<<setw(7)<<curNum.num_inputs; fdata<<setw(8)<<curNum.num_outputs; fdata<<setw(7)<<curNum.num_const; fdata<<setw(8)<<curNum.num_alu_ops; CALprogramGrid pg; static PFNCALCTXRUNPROGRAMGRID calCtxRunProgramGrid = 0; if (calCtxRunProgramGrid == 0) { calExtGetProc((CALextproc*)&calCtxRunProgramGrid, CAL_EXT_COMPUTE_SHADER, "calCtxRunProgramGrid"); if (calCtxRunProgramGrid == 0) { fprintf(stderr, "Error: Compute shader extension not found\n"); } } for (j=0;j<OUTER_LOOP+1;j++) { calCtxFlush(ctx); calCtxBeginCounterExt(ctx, idleCounter); calCtxBeginCounterExt(ctx, cacheCounter); CALdomain3D rect; rect.width = curNum.num_domain; rect.height = curNum.num_domain; rect.depth = 1; pg.func = func; pg.flags = 0; pg.gridBlock.width = 64; //needs to be same value as what is in the kernal for thread group size. pg.gridBlock.height = 1; pg.gridBlock.depth = 1; pg.gridSize.width = (rect.width*rect.height + pg.gridBlock.width - 1) / pg.gridBlock.width; pg.gridSize.height = 1; pg.gridSize.depth = 1; start = clock(); calCtxError = calCtxRunProgramGrid(&event, ctx, &pg); //calCtxError = calCtxRunProgram(&event, ctx, func, &domain); //fprintf(stdout, "%s\n", calGetErrorString()); if (calCtxError == CAL_RESULT_BAD_HANDLE) fprintf(stdout, "bad handle error running program\n"); if (calCtxError == CAL_RESULT_ERROR) { fprintf(stdout, "symbol error running context program\n"); fprintf(stderr, "Error running, string is %s\n", calclGetErrorString()); printf("%s", ILKernel.c_str()); //getchar(); } // Wait on the event for kernel completion while(calCtxIsEventDone(ctx, event) == CAL_RESULT_PENDING); stop=clock(); calCtxEndCounterExt(ctx, idleCounter); calCtxEndCounterExt(ctx, cacheCounter); duration =(stop-start); calCtxGetCounterExt(&idlePercentage, ctx, idleCounter); calCtxGetCounterExt(&cachePercentage, ctx, cacheCounter); idlePercentage *= 100.0f; cachePercentage *= 100.0f; //fdata<<"Idle percentage: "<<idlePercentage<<endl; //fdata<<"Cache hit rate: "<<cachePercentage<<endl; duration = duration/(double)CLOCKS_PER_SEC; if (j!=0) total_time+=duration; total_idle+=idlePercentage; total_cache+=cachePercentage; //fdata<<"Kernel "<<j<<" Time: "<<duration<<endl; } getchar(); string bottleneck; float core_time=0.0f; float fetch_time=0.0f; float mem_time=0.0f; float exp_time=0.0f; cout<<"ALU Ops: "<<curNum.num_alu_ops<<endl; core_time=((float)(curNum.num_domain*curNum.num_domain)*((float)curNum.num_alu_ops))/((160.0f)*((float)attribs.engineClock*1000000.0f)); fetch_time=((float)(curNum.num_domain*curNum.num_domain)*((float)curNum.num_inputs))/((40.0f)*((float)attribs.engineClock*1000000.0f)); mem_time=((float)(curNum.num_domain*curNum.num_domain)*((float)curNum.num_outputs*128.0f))/((256.0f)*((float)attribs.memoryClock*1000000.0f*2.0f)); cout<<"Core Time: "<<core_time<<endl; cout<<"Fetch Time: "<<fetch_time<<endl; cout<<"Mem Time: "<<mem_time<<endl; if (core_time >= fetch_time) { if (core_time >= mem_time) { exp_time = core_time; bottleneck="ALU"; } else { exp_time = mem_time; bottleneck="MEMORY"; } } else { if (fetch_time >= mem_time) { exp_time=fetch_time; bottleneck="FETCH"; } else { exp_time=mem_time; bottleneck="MEMORY"; } } calCtxDestroyCounterExt(ctx, idleCounter); calCtxDestroyCounterExt(ctx, cacheCounter); fdata<<setw(6)<<OUTER_LOOP*INNER_LOOP; fdata<<setw(13)<<total_cache/(OUTER_LOOP*INNER_LOOP); fdata<<setw(13)<<total_idle/(OUTER_LOOP*INNER_LOOP); fdata<<setw(13)<<total_time; fdata<<setw(7)<<curNum.num_domain; fdata<<setw(13)<<(exp_time*OUTER_LOOP*INNER_LOOP); fdata<<setw(11)<<bottleneck; fdata<<setw(5)<<curNum.num_GPR; fdata<<setw(5)<<curNum.num_wf; fdata<<endl; //remap the resource for output for (i=0;i<num_outputs;i++) { outPtr = NULL; if (calResMap((CALvoid**)&outPtr, &pitch, outLocal, 0) != CAL_RESULT_OK) fprintf(stderr, "error occured mapping resource outLocal %d", i); } //print the memory CALfloat *out1[MAX_OUTPUTS]; for (i=0;i<num_outputs;i++) { for (unsigned int k=0;k < size; k++) { out1 = &outPtr[k*pitch]; for (unsigned int j=0;j<size;j++) { //printf("out1[%d][%d]: %f\n", i, j+k*size, out1); } } } // verify using CPU resource and function /*float *verify_out = (float*)malloc(curNum.num_domain*curNum.num_domain*sizeof(float)); float *tmpf = (float*)malloc(curNum.num_alu_ops*sizeof(float)); for(i=0;i<curNum.num_domain;i++) { for(j=0;j<curNum.num_domain;j++) { tmpf[0]=verify[j+size*i]+verify[j+size*i]; tmpf[1]=tmpf[0]+verify[i+size*j]; tmpf[2]=tmpf[1]+tmpf[0]; tmpf[3]=tmpf[2]+tmpf[1]; tmpf[4]=tmpf[3]+tmpf[2]; tmpf[5]=tmpf[4]+tmpf[3]; verify_out[j+size*i]=tmpf[5]+tmpf[4]; } } bool confirm=false; for(i=0;i<curNum.num_domain;i++) { out1[0]=&outPtr[0][i*pitch]; for (j=0;j<curNum.num_domain;j++) { if (out1[0] == verify_out[j+size*i]) { confirm = true; } else { confirm=false; printf("%d: %f = %f\n", j+size*i, out1[0], verify_out[j+size*i]); printf("ERROR, output does not compute!\n"); getchar(); } if (confirm == false) { exit(1); } } }*/ //unmap the resource for output for (i=0;i<num_outputs;i++) { if (calResUnmap(outLocal) != CAL_RESULT_OK) fprintf(stderr, "error occured unmapping outLocal %d", i); } //unload module calModuleUnload(ctx, module); //free the image calclFreeImage(img); //free the object calclFreeObject(obj); //release the resource from the context for (i=0;i<num_inputs;i++) { if (calCtxReleaseMem(ctx, inmem) != CAL_RESULT_OK) fprintf(stderr, "error occured releasing resource inmem %d from context", i); } for (i=0;i<num_const;i++) { if (calCtxReleaseMem(ctx, constmem) != CAL_RESULT_OK) fprintf(stderr, "error occured releasing resource constmem %d from context", 0); } for (i=0;i<num_outputs;i++) { if (calCtxReleaseMem(ctx, outmem) != CAL_RESULT_OK) fprintf(stderr, "error occured releasing resource from context"); } // deallocate local resource for (i=0;i<num_inputs;i++) { if (calResFree(inLocal) != CAL_RESULT_OK) fprintf(stderr, "error occured freeing inLocal %d", i); } for (i=0;i<num_const;i++) { if (calResFree(constLocal) != CAL_RESULT_OK) fprintf(stderr, "error occured freeing constLocal %d", 0); } for (i=0;i<num_outputs;i++) { if (calResFree(outLocal) != CAL_RESULT_OK) fprintf(stderr, "error occured freeing outLocal\n"); } // Destroy the context if(calCtxDestroy(ctx) != CAL_RESULT_OK) fprintf(stderr, "error occured"); // Closing the device calDeviceClose(device); // Shutting down CAL if(calShutdown() != CAL_RESULT_OK) fprintf(stderr, "error occured"); }

MicahVillmow · ‎08-03-2009

Is the error still with the outputLocal allocation or somewhere else?

ryta1203 · ‎08-03-2009

No, I get an error: "error getting module name o0" and "error setting context output memory (null)"

Then when the kernel runs (when I call calRunProgramGrid(..)) I get "symbol error running context program" and "Error running, string is No Error".

It's definitely something on the host side code but I'm really having a problem because there is such a shortage of documentation on this. Any help would be great, thanks.

MicahVillmow · ‎08-03-2009

Ok, I just wanted to make sure. I've made this error myself many times. The problem is you are trying to map a memory buffer to the module 'o0', however, the output buffers ONLY exist in pixel shader code and not compute shader. The correct name to map the global buffer is 'g[]'. This should fix this issue for you.

ryta1203 · ‎08-03-2009

Micah,

So even if I have many outputs they all have the same name: 'g[]', just like that?

ryta1203 · ‎08-03-2009

Micah,

Ok, I have tried that and it seems that the errors are gone. Once again, thank you for your time, trust me I understand how valuable it is (your time that is).

Hopefully, this will all become much clearer in the new documentation. This is not very clear in the docs now, even under the "how to use the global buffer" section.

MicahVillmow · ‎08-03-2009

Yeah, just 'g[]' as there is only one memory buffer, which is very similiar to a C++ style array. Unlike in pixel shader with the color buffers, you can write to it as many times as you want but only need to initialize it once.

ryta1203 · ‎09-02-2009

I'm getting incorrect results with this kernel:

"il_cs_2_0\n"

"dcl_num_thread_per_group 64\n"

"dcl_cb cb0[1]\n"

"dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"

"dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"

"dcl_resource_id(2)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"

"itof r7.z, vAbsTidFlat.x\n"

"mul r7.y, r7.z, cb0[0].y\n"

"mod r7.x, r7.z, cb0[0].x\n"

"flr r8, r7\n"

"sample_resource(0)_sampler(0) r0, r8\n"

"sample_resource(1)_sampler(0) r1, r8\n"

"sample_resource(2)_sampler(0) r2, r8\n"

"add r3, r1, r0\n"

"add r4, r3, r2\n"

"add r5, r4, r3\n"

"add r6, r5, r4\n"

"add r7, r6, r5\n"

"add r8, r7, r6\n"

"add r9, r8, r7\n"

"mov g[vAbsTidFlat.x], r9\n"

"ret_dyn\n"

"end\n"

;

The "results" are actually "correct" but they are in the wrong place (and some just show 0, meaning they are not being computed on at all)... how I've done this is how they do it in the inputspeed_cs example, so I'm a bit confused.

cb0[0].x = domain and cb0[0].y = 1/domain (it's a squared domain)

Actually, I'm still fairly confused when it comes to getting the right 2D index to use in texture fetching for compute shader mode.

What's wrong with the above? Any ideas?

MicahVillmow · ‎09-02-2009

Ryta,
Are your textures allocated as linear or tiled formats? i.e. are you passing to all your calResAlloc the RES_ALLOC_GLOBAL_BUFFER flag? You are indexing into the sampler with a linear address converted into a 2D address from a tiled surface, so the data you think you are grabbing is actually in a different location.
If your resources are tiled location 2,2 in the texture is the 4th data element and not the (width + 2)th, and location 3,1 is the 5th data element and not the 3rd.

ryta1203 · ‎09-02-2009

The output is linear, not the input.

I didn't think it was possible to declare linear input and still use texture fetches to get the input.

ryta1203 · ‎09-02-2009

Why is 2,2 the 4th element? What element is 0, 0? Should be 0, yes? 0, 1 is 1? 1, 0 is 2? 1, 1 is 3 (the 4th element)?

So for example, if I use 2,2 as the access for the kernel then I get, by index:

0: 0

1: 0

2: 0

3: Real value

4: 0

5: 0

6: 0

7: Real value

ryta1203 · ‎09-02-2009

Actually, I get no change in output whether using the CAL_RESALLOC_GLOBAL_BUFFER or just using 0 for the flag, the result is the same.

MicahVillmow · ‎09-02-2009

The tiling mode is just a method of optimizing for the rasterization pattern. In compute shader, since your rasterization pattern is linear, you want your textures to be linear so that they hit the cache in a more friendly manner. You still want to do blocking for cache locality however. In pixel shader, the rasterization pattern is hierarchical-z, so the tiling pattern matches this pattern, resulting in good cache/access behaviour. However, as you are finding out, when using linear addressing on a tiled surface, the data you think you are getting is not the data you are actually getting. This also was a problem with using vObjIndex in pixel shader and is one of the quirks of our hardware.

MicahVillmow · ‎09-02-2009

0,0 is the first element in the memory.

ryta1203 · ‎09-02-2009

Micah,

Ok, I understand that one is tiled and the other is linear.. though I can't get either to work for compute mode... sadly this doesn't tell me anything about the tiled arrangment.

Maybe some better documentation with graphs would work go far to help people (or at least me) understand this.

If I try to sample off a literal I get the same result regardless of the literal values... 0,0 returns same result as 4, 0 or 127, 35, etc..

So my question is this: how are the groups arranged off of the absolute thread index? Using a 64x1 block, if I want to access absolute index 63 then it should just be 0, 63 correct?

MicahVillmow · ‎09-02-2009

Ryta,
If your texture is linear, then absolute index 63 will be at 63, 0(x, y) and in a tiled texture, it will be at location 8,8(x,y).

ryta1203 · ‎09-02-2009

Maybe I have the output wrong, I'm just using vAbsTidFlat.x... this is for a 64x1 block.

Or is it possible that I'm reading the output incorrectly back? (I have verified that my method works fine in pixel shader mode)

ryta1203 · ‎09-02-2009

Also, in tiled layout are 1,0 and 0,1 the same element?

MicahVillmow · ‎09-02-2009

Ryta,
If you look at 1.2.5.6 of the Stream Computing User Guide, it shows you the tiled memory format. 1,0 is B and 0,1 is C. Also, is your format a float4? the global buffer only works on 128 bits with a straight move, you can do conditional moves to various components to get 32bit writes.

Archives Discussions

Compute Mode Questions