46 Replies Latest reply on Sep 2, 2009 9:27 PM by MicahVillmow

    Compute Mode Questions

    ryta1203

      1. Where do the docs talk about compute shader mode?

      2. Why do I get the "No Error" string for a kernel that doesn't return CAL_RESULT_OK??? There is obviously an error but it tells me "don't worry about it.. no error... but your kernel still won't run, so sorry"!? EDIT: Kernel compiles fine in SKA.

      3. Do you have to sample from an offset. If VaTid is the global thread id, why can't I just sample (either sampling or getting from/to the global buffer) from that (like you would with vWinCoord0)?

      4. If I'm sampling the inputs as streams in compute shader mode, do I have to allocate the resource as global?

      5. Can you burst write in CS?

      6. Where is the docs that talk about CS?

        • Compute Mode Questions
          MicahVillmow
          1&6) If they were not in 1.4, they should be in the next release.
          2) Most likely a PS instruction is being used as a CS instruction, but without the kernel I cannot be sure.
          3) the sample instruction expects the address to be two dimensional and in floating point format, vATid is a single dimension integer.
          4) No, only the write must be allocated as global
          5) Yes, just do global buffer writes with address offsets of + 0, +1, +2, +3, etc...
            • Compute Mode Questions
              ryta1203

              Thank you, AGAIN!!

              The kernel is below, pretty straightforward actually; however, I was using vaTid to sample, apparently I cannot. What should I be using instead to access the thread id? All the "...Tid" registers are 1 component it seems.

               

               

              const

               

               

              char

              HILKernel[] =

              "il_cs_2_0\n"

              "dcl_num_thread_per_group 64\n"

              "dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"

              "dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"

              "mov r2.x, vaTid.x\n"

              "sample_resource(0)_sampler(0) r0, r2\n"

              "sample_resource(1)_sampler(0) r1, r2\n"

              "add r2, r1, r0\n"

              "add r3, r2, r1\n"

              "add r4, r3, r2\n"

              "add r5, r4, r3\n"

              "add r6, r5, r4\n"

              "add r7, r6, r5\n"

              "add r8, r7, r6\n"

              "add r9, r8, r7\n"

              "add r10, r9, r8\n"

              "add r11, r10, r9\n"

              "add r12, r11, r10\n"

              "add r13, r12, r11\n"

              "add r14, r13, r12\n"

              "add r15, r14, r13\n"

              "add r16, r15, r14\n"

              "add r17, r16, r15\n"

              "mov g[vaTid0.x], r17\n"

              "ret_dyn\n"

              "end\n"

               

               



            • Compute Mode Questions
              MicahVillmow
              So, after your copy to r2.x, you need to convert it into x & y cordinates via either a shl/and or a mod/div and then convert the results to fp using itof, then you can index correctly into the samplers. Also, it might be vaTid.x or vaTid0.x that is causing the problems. Please use the same in both locations and make sure that you are using the correct one specified in the docs. I know that they were updated recently but not sure if it was a 1.3 change or a 1.4 change.
                • Compute Mode Questions
                  ryta1203

                  Ok, thanks, will try that.

                  So when I output, if I have 8 outputs going to the global buffer for that kernel (each of the same element) then I should burst write by doing:

                  g[r0]

                  g[r0+1]

                  g[r0+2]

                  g[r0+3]

                  g[r0+4]... etc, etc.. like in the burst_write_cs example, correct?

                  Is this the case for float4 AND float data types? Does it not matter the data type?

                  What about inputs from the global buffer, are they handled the same way (with that same stride)?

                • Compute Mode Questions
                  MicahVillmow
                  Yes that is the correct way of doing it. In 1.4 the global buffer inputs had some performance issues that we have since fixed, but if you setup your inputs and outputs using that manner, than you can possibly get the best performance.
                    • Compute Mode Questions
                      ryta1203

                      Also, can you re-explain how to sample in compute shader mode? I'm not sure I understand why you need to div/mod (you guys mul/mod in your examples).

                      What is cb0[0] in this example?

                       

                       

                      "itof r0.z, vaTid0.x\n"

                      "mul r0.y, r0.z, cb0[0].y\n"

                      "mod r0.x, r0.z, cb0[0].x\n"

                      "flr r0.xy, r0.xy\n"

                       





                    • Compute Mode Questions
                      MicahVillmow
                      cb0[0].y is actually 1 / width, so we do the division on the host side(as it is done once instead of once per thread) and cb0[0].x is width.
                      • Compute Mode Questions
                        MicahVillmow
                        if you want to base your computation on a dynamic width, then yes. However, you can hardcode your width to say 1024 and then just vary the height of the data domain. It requires a little bit of translation on the host side, but would simplify the kernels.

                          • Compute Mode Questions
                            ryta1203

                            il_cs_2_0
                            dcl_num_thread_per_group 64
                            dcl_cb cb0[1]
                            dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
                            dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
                            itof r2.z, vaTid0.x
                            mul r2.y, r2.z, cb0[0].y
                            mod r2.x, r2.z, cb0[0].x
                            flr r3, r2
                            sample_resource(0)_sampler(0) r0, r3
                            sample_resource(1)_sampler(0) r1, r3
                            add r2, r1, r0
                            add r3, r2, r1
                            add r4, r3, r2
                            add r5, r4, r3
                            add r6, r5, r4
                            add r7, r6, r5
                            add r8, r7, r6
                            add r9, r8, r7
                            add r10, r9, r8
                            add r11, r10, r9
                            add r12, r11, r10
                            add r13, r12, r11
                            mov r14.x, vaTid0.x
                            mov g[r14.x], r13
                            ret_dyn
                            end

                            This is the kernel I have so far that does not work, I get the same errors. I have my const (float4) declared as: c[0]=1/width, c[1]=width, c[2] and c[3] = 0.

                            I also have allocation errors for the output outLocal, error getting module name o0 and error setting context memory (null)

                          • Compute Mode Questions
                            MicahVillmow
                            you have the mul and mod backwards. c[0] = width and c[1] = 1/width.
                            • Compute Mode Questions
                              MicahVillmow
                              Well, then the next problem is to find out what is going wrong by finding out line by line what is causing the error. Also, can you try 64, 1, 1 as the thread_per_group and vAbsTidFlat.x?
                                • Compute Mode Questions
                                  ryta1203

                                  Micah,

                                     I get "error occured allocating resource outLocal 0" and "Error compiling, string is 'No Error'".

                                  Maybe it will help if I post some of my code:

                                  Here is the kernel:

                                   

                                  const

                                   

                                   

                                   

                                   

                                   

                                   

                                   

                                  char  HILKernel[] =

                                  "il_cs_2_0\n"

                                  "dcl_num_thread_per_group 64,1,1\n"

                                  "dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"

                                  "dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"

                                  "itof r2.z, vAbsTidFlat.x\n"

                                  "mod r2.x, r2.z, cb0[0].y\n"

                                  "mul r2.y, r2.z, cb0[0].y\n"

                                  "flr r3, r2\n"

                                  "sample_resource(0)_sampler(0) r0, r3\n"

                                  "sample_resource(1)_sampler(0) r1, r3\n"

                                  "add r2, r1, r0\n"

                                  "add r3, r2, r1\n"

                                  "add r4, r3, r2\n"

                                  "add r5, r4, r3\n"

                                  "mov r6.x, vAbsTidFlat.x\n"

                                  "mov g[r6.x], r5\n"

                                  "ret_dyn\n"

                                  "end\n"

                                   

                                   

                                   

                                   

                                   

                                   

                                   

                                  ;

                                  My constants:



                                  for (i=0;i < num_const ; i++)

                                  {

                                  constPtr[1]=1.0f/(float)curNum.num_domain; constPtr[0]=(float)curNum.num_domain; constPtr[2]=0.0f; constPtr[3]=0.0f;

                                  And my output resource allocation:

                                  f

                                  or (i=0; i < num_outputs ; i ++)

                                   

                                  if(calResAllocLocal2D(&outLocal,device, cal_size, cal_size, CAL_FORMAT_FLOAT_4, CAL_RESALLOC_GLOBAL_BUFFER) != CAL_RESULT_OK)

                                  fprintf(stderr,

                                  "error occured allocating resource outLocal %d", i);

                                  With a domain of > 32 the "error occured allocating resource outLocal" goes away.



                                  It doesn't seem like the compiler likes 64,1,1 because when I take that away the kernel compiles fine, though I still get errors.






                                   

                                    • Compute Mode Questions
                                      the729

                                      Ryta,

                                      AFAIK, one GPU context can not load both CS and PS kernels due to a bug in perhaps CAL.

                                      Are you using only CS, or mixing up CS and PS in a single context in your application? If the latter, you will get weird results.

                                       

                                        • Compute Mode Questions
                                          ryta1203

                                          I only have one context and one kernel, it is a cs kernel that is posted above.

                                          Also, outside of using RunProgramGrid and using the global buffer flag for the output my CAL code hasn't changed from my ps kernel running CAL code. What else should I be modifying?

                                            • Compute Mode Questions
                                              the729

                                              I found, in your latest posted codes, you do not declare cb0? Maybe that is the problem, just maybe.

                                              Also, personally I do not use ret_dyn in the end of the main il programs, since it is for functions. But I do not think this will cause any problem in the case you posted. However, according to the documents, you should use endmain to end the main procedure and begin declaration of functions, if there is any.

                                                • Compute Mode Questions
                                                  ryta1203

                                                  the729,

                                                    That just happened to get left out, sorry, this is not the problem. I also changed ret_dyn to ret and end to endmain, that didn't help. Any other ideas?

                                                  I'm still getting an error when running the program but the stringError is No Error, sort of contradicts itself.

                                                    • Compute Mode Questions
                                                      ryta1203

                                                      Sorry, forgot to repost my kernel:

                                                      il_cs_2_0
                                                      dcl_num_thread_per_group 64
                                                      dcl_cb cb0[1]
                                                      dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
                                                      dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
                                                      dcl_resource_id(2)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
                                                      dcl_resource_id(3)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
                                                      dcl_resource_id(4)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
                                                      dcl_resource_id(5)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
                                                      itof r6.z, vaTid0.x
                                                      mul r6.y, r6.z, cb0[0].y
                                                      mod r6.x, r6.z, cb0[0].x
                                                      flr r7, r6
                                                      sample_resource(0)_sampler(0) r0, r7
                                                      sample_resource(1)_sampler(0) r1, r7
                                                      sample_resource(2)_sampler(0) r2, r7
                                                      sample_resource(3)_sampler(0) r3, r7
                                                      sample_resource(4)_sampler(0) r4, r7
                                                      sample_resource(5)_sampler(0) r5, r7
                                                      add r6, r1, r0
                                                      add r7, r6, r2
                                                      add r8, r7, r3
                                                      add r9, r8, r4
                                                      add r10, r9, r5
                                                      add r11, r10, r9
                                                      add r12, r11, r10
                                                      add r13, r12, r11
                                                      add r14, r13, r12
                                                      add r15, r14, r13
                                                      add r16, r15, r14
                                                      add r17, r16, r15
                                                      add r18, r17, r16
                                                      add r19, r18, r17
                                                      add r20, r19, r18
                                                      add r21, r20, r19
                                                      add r22, r21, r20
                                                      add r23, r22, r21
                                                      add r24, r23, r22
                                                      add r25, r24, r23
                                                      add r26, r25, r24
                                                      mov r27.x, vaTid0.x
                                                      mov g[r27.x], r26
                                                      ret_dyn
                                                      end

                                                       

                                                      BTW, the SKA compiles this code just fine, it makes me think that there is something on the host code that is wrong, I have:

                                                      const[0]=width and const[1]=1/width

                                                      RunProgramGrid instead of RunProgram (though I've tried both)

                                                      and the output flag is RES_ALLOC_GLOBAL_BUFFER..

                                                      any other ideas?

                                          • Compute Mode Questions
                                            MicahVillmow
                                            Ryta,
                                            Can you try with vAbsTidFlat.x instead of vaTid0.x? Also, this shader compiles fine for me on my machine.
                                              • Compute Mode Questions
                                                ryta1203

                                                Micah,

                                                  I tried vAbsTidFlat.x and that didn't help.

                                                  The shader COMPILES fine on my machine too, this is not where I get the error, it's not a compile error it's a runtime error I am getting.

                                                  I think that the problem is on the host side code:

                                                 

                                                void callCalIL() { CALuint cal_size = curNum.num_domain; unsigned int size = curNum.num_domain; unsigned int num_inputs=curNum.num_inputs; unsigned int num_outputs=curNum.num_outputs; unsigned int num_const=curNum.num_const; unsigned int i=0; double duration=0.0f; clock_t start, stop; // Initialize CAL system for computation if(calInit() != CAL_RESULT_OK) fprintf(stderr, "error occured"); // Query and print the runtime version that is loaded CALuint version[3]; calGetVersion(&version[0], &version[1], &version[2]); fprintf(stderr, "CAL Runtime version %d.%d.%d\n", version[0], version[1], version[2]); // Query the compiler version that is loaded calclGetVersion(&version[0], &version[1], &version[2]); fprintf(stderr, "CAL Compiler version %d.%d.%d\n", version[0], version[1], version[2]); // Query the number of devices on the system CALuint numDevices = 0; if(calDeviceGetCount(&numDevices) != CAL_RESULT_OK) fprintf(stderr, "error occured"); printf("Number of Devices: %d\n", numDevices); // Get the information on the 0th device CALdeviceinfo info; if(calDeviceGetInfo(&info, 0) != CAL_RESULT_OK) fprintf(stderr, "error occured getting info\n"); switch(info.target) { case CAL_TARGET_600: { fprintf(stdout, "Device Type = GPU R600\n"); break; } case CAL_TARGET_670: { fprintf(stdout, "Device Type = GPU RV670\n"); break; } case CAL_TARGET_770: { fprintf(stdout, "Device Type = GPU RV770\n"); break; } default: { fprintf(stdout, "Unknown Device\n"); } } // Opening the 0th device CALdevice device = 0; if(calDeviceOpen(&device, 0) != CAL_RESULT_OK) fprintf(stderr, "error occured opening device\n"); // Create context on the device CALcontext ctx=0; if(calCtxCreate(&ctx, device) != CAL_RESULT_OK) fprintf(stderr, "error occured"); // allocate local resource CALresource inLocal[MAX_INPUTS], outLocal[MAX_OUTPUTS], constLocal[MAX_CONST]; for (i=0;i<num_inputs;i++) { inLocal[i]=0; if(calResAllocLocal2D(&inLocal[i], device, cal_size, cal_size, CAL_FORMAT_FLOAT_4, 0) != CAL_RESULT_OK) fprintf(stderr, "error occured allocating resource inLocal %d", i); } for (i=0;i<num_outputs;i++) { if(calResAllocLocal2D(&outLocal[i] ,device, cal_size, cal_size, CAL_FORMAT_FLOAT_4, CAL_RESALLOC_GLOBAL_BUFFER) != CAL_RESULT_OK) fprintf(stderr, "error occured allocating resource outLocal %d", i); } for (i=0;i<num_const;i++) { if(calResAllocRemote1D(&constLocal[i], &device, 1, 1, CAL_FORMAT_FLOAT_4, 0) != CAL_RESULT_OK) fprintf(stderr, "error occured allocating remote constLocal %d", i); } CALfloat *inPtr[MAX_INPUTS]; /*CALfloat **inPtr=(CALfloat**)malloc(sizeof(CALfloat)); for (i=0;i<MAX_INPUTS;i++) { inPtr[i] = (CALfloat*)malloc(sizeof(CALfloat)); }*/ CALfloat *outPtr[MAX_OUTPUTS]; CALfloat *constPtr[MAX_CONST]; CALuint pitch = 0; CALuint constPitch=0; //map the resource for input for (i=0;i<num_inputs;i++) { inPtr[i] = NULL; if (calResMap((CALvoid**)&inPtr[i], &pitch, inLocal[i], 0) != CAL_RESULT_OK) fprintf(stderr, "error occured mapping resource inPtr %d", i); } //init the memory //float *verify=(float*)malloc(curNum.num_domain*curNum.num_domain*sizeof(float)); CALfloat *tmp[MAX_INPUTS]; for (i=0;i<num_inputs;i++) { for (unsigned int k=0;k < size; k++) { tmp[i] = &inPtr[i][k*pitch]; for (unsigned int j=0;j<size;j++) { //verify[j+size*k] = (float)(j+k); tmp[i][4*j] = (CALfloat)(j+k); tmp[i][4*j+1] = (CALfloat)(j+k+1); tmp[i][4*j+2] = (CALfloat)(j+k+2); tmp[i][4*j+3] = (CALfloat)(j+k+3); //printf("input %d: [%d] = %f\n", i, j+size*k, tmp[i][j+size*k]); //printf("verify[%d]: %f\n", i, verify[j+size*k]); } } } //unmap the resource for input for (i=0;i<num_inputs;i++) { if (calResUnmap(inLocal[i]) != CAL_RESULT_OK) fprintf(stderr, "error occured unmapping resource inLocal %d\n",i); } for (i=0;i<num_const;i++) { constPtr[i]=NULL; if (calResMap((CALvoid**)&constPtr[i], &constPitch, constLocal[i], 0) != CAL_RESULT_OK) fprintf(stderr, "error occured mapping resource constPtr %d", 0); } for (i=0;i<num_const;i++) { constPtr[i][1]=1.0f/(float)curNum.num_domain; constPtr[i][0]=(float)curNum.num_domain; constPtr[i][2]=0.0f; constPtr[i][3]=0.0f; } for (i=0;i<num_const;i++) { if (calResUnmap(constLocal[i]) != CAL_RESULT_OK) fprintf(stderr, "error occured unmapping resource constLocal %d\n",0); } CALmem inmem[MAX_INPUTS], outmem[MAX_OUTPUTS], constmem[MAX_CONST]; for (i=0;i<num_inputs;i++) { inmem[i]=0; if (calCtxGetMem(&inmem[i], ctx, inLocal[i]) != CAL_RESULT_OK) fprintf(stderr, "error binding resource %d to context\n", i); } for (i=0;i<num_const;i++) { constmem[i]=0; if (calCtxGetMem(&constmem[i], ctx, constLocal[i]) != CAL_RESULT_OK) fprintf(stderr, "error binding resource %d to context\n", 0); } for (i=0;i<num_outputs;i++) { outmem[i]=0; if (calCtxGetMem(&outmem[i], ctx, outLocal[i]) != CAL_RESULT_OK) fprintf(stderr, "error binding out resource %d to context\n",i); } //compile the kernel //link object to image CALdeviceattribs attribs; attribs.struct_size = sizeof(CALdeviceattribs); if (calDeviceGetAttribs(&attribs, 0) != CAL_RESULT_OK) { fprintf(stderr, "There was an error getting device attribs.\n"); fprintf(stderr, "Error string is %s\n", calGetErrorString()); } CALobject obj=NULL; CALimage img=NULL; if(calclCompile(&obj, CAL_LANGUAGE_IL, ILKernel.c_str(), info.target) != CAL_RESULT_OK) { fprintf(stderr, "Error compiling, string is %s\n", calclGetErrorString()); getchar(); exit(1); } if(calclLink(&img, &obj, 1) != CAL_RESULT_OK) fprintf(stderr, "error linking object\n"); // load and run the kernel HERE CALmodule module=0; if(calModuleLoad(&module, ctx, img) != CAL_RESULT_OK) fprintf(stdout, "error loading module\n"); // Query the entry point in the module for the function “main” CALfunc func = 0; if(calModuleGetEntry(&func, ctx, module, "main") != CAL_RESULT_OK) fprintf(stdout, "error getting module entry point\n"); // Query the variable names for inName 0 and outName 0 CALname inName[MAX_INPUTS], outName[MAX_OUTPUTS], constName[MAX_CONST]; CALchar paramName[10]; for (i=0;i<num_inputs;i++) { sprintf_s(paramName, "i%d", i); inName[i] = 0; if(calModuleGetName(&inName[i], ctx, module, paramName ) != CAL_RESULT_OK) fprintf(stdout,"error getting module name %s\n", paramName); } for (i=0;i<num_const;i++) { sprintf_s(paramName, "cb0"); constName[i] = 0; if(calModuleGetName(&constName[i], ctx, module, paramName ) != CAL_RESULT_OK) fprintf(stdout,"error getting module name %s\n", paramName); } for (i=0;i<num_outputs;i++) { sprintf_s(paramName, "o%d", i); outName[i]=0; if(calModuleGetName(&outName[i], ctx, module, paramName) != CAL_RESULT_OK) fprintf(stdout,"error getting module name %s\n", paramName); } // Bind resources to memory handles for this context // …………… for (i=0;i<num_inputs;i++) { if(calCtxSetMem(ctx, inName[i], inmem[i]) != CAL_RESULT_OK) fprintf(stdout, "error setting context memory %s\n", inName[i]); } for (i=0;i<num_const;i++) { if(calCtxSetMem(ctx, constName[i], constmem[i]) != CAL_RESULT_OK) fprintf(stdout, "error setting context memory %s\n", constName[i]); } for(i=0;i<num_outputs;i++) { if(calCtxSetMem(ctx, outName[i], outmem[i]) != CAL_RESULT_OK) fprintf(stdout, "error setting context memory %s\n", outName[i]); } // Setup the domain for execution CALdomain domain = {0, 0, size, size}; // Event ID corresponding to the kernel invocation CALevent event = 0; // Launch the CAL kernel on the given domain CALresult calCtxError; double total_time=0.0f, total_idle=0.0f, total_cache=0.0f; int j; counter_func_init(); CALcounter cacheCounter; CALcounter idleCounter; calCtxCreateCounterExt(&cacheCounter, ctx, CAL_COUNTER_INPUT_CACHE_HIT_RATE); calCtxCreateCounterExt(&idleCounter, ctx, CAL_COUNTER_IDLE); CALfloat idlePercentage = 0.0f; CALfloat cachePercentage = 0.0f; fdata<<setw(10)<<curNum.alu_fetch; fdata<<setw(7)<<curNum.num_inputs; fdata<<setw(8)<<curNum.num_outputs; fdata<<setw(7)<<curNum.num_const; fdata<<setw(8)<<curNum.num_alu_ops; CALprogramGrid pg; static PFNCALCTXRUNPROGRAMGRID calCtxRunProgramGrid = 0; if (calCtxRunProgramGrid == 0) { calExtGetProc((CALextproc*)&calCtxRunProgramGrid, CAL_EXT_COMPUTE_SHADER, "calCtxRunProgramGrid"); if (calCtxRunProgramGrid == 0) { fprintf(stderr, "Error: Compute shader extension not found\n"); } } for (j=0;j<OUTER_LOOP+1;j++) { calCtxFlush(ctx); calCtxBeginCounterExt(ctx, idleCounter); calCtxBeginCounterExt(ctx, cacheCounter); CALdomain3D rect; rect.width = curNum.num_domain; rect.height = curNum.num_domain; rect.depth = 1; pg.func = func; pg.flags = 0; pg.gridBlock.width = 64; //needs to be same value as what is in the kernal for thread group size. pg.gridBlock.height = 1; pg.gridBlock.depth = 1; pg.gridSize.width = (rect.width*rect.height + pg.gridBlock.width - 1) / pg.gridBlock.width; pg.gridSize.height = 1; pg.gridSize.depth = 1; start = clock(); calCtxError = calCtxRunProgramGrid(&event, ctx, &pg); //calCtxError = calCtxRunProgram(&event, ctx, func, &domain); //fprintf(stdout, "%s\n", calGetErrorString()); if (calCtxError == CAL_RESULT_BAD_HANDLE) fprintf(stdout, "bad handle error running program\n"); if (calCtxError == CAL_RESULT_ERROR) { fprintf(stdout, "symbol error running context program\n"); fprintf(stderr, "Error running, string is %s\n", calclGetErrorString()); printf("%s", ILKernel.c_str()); //getchar(); } // Wait on the event for kernel completion while(calCtxIsEventDone(ctx, event) == CAL_RESULT_PENDING); stop=clock(); calCtxEndCounterExt(ctx, idleCounter); calCtxEndCounterExt(ctx, cacheCounter); duration =(stop-start); calCtxGetCounterExt(&idlePercentage, ctx, idleCounter); calCtxGetCounterExt(&cachePercentage, ctx, cacheCounter); idlePercentage *= 100.0f; cachePercentage *= 100.0f; //fdata<<"Idle percentage: "<<idlePercentage<<endl; //fdata<<"Cache hit rate: "<<cachePercentage<<endl; duration = duration/(double)CLOCKS_PER_SEC; if (j!=0) total_time+=duration; total_idle+=idlePercentage; total_cache+=cachePercentage; //fdata<<"Kernel "<<j<<" Time: "<<duration<<endl; } getchar(); string bottleneck; float core_time=0.0f; float fetch_time=0.0f; float mem_time=0.0f; float exp_time=0.0f; cout<<"ALU Ops: "<<curNum.num_alu_ops<<endl; core_time=((float)(curNum.num_domain*curNum.num_domain)*((float)curNum.num_alu_ops))/((160.0f)*((float)attribs.engineClock*1000000.0f)); fetch_time=((float)(curNum.num_domain*curNum.num_domain)*((float)curNum.num_inputs))/((40.0f)*((float)attribs.engineClock*1000000.0f)); mem_time=((float)(curNum.num_domain*curNum.num_domain)*((float)curNum.num_outputs*128.0f))/((256.0f)*((float)attribs.memoryClock*1000000.0f*2.0f)); cout<<"Core Time: "<<core_time<<endl; cout<<"Fetch Time: "<<fetch_time<<endl; cout<<"Mem Time: "<<mem_time<<endl; if (core_time >= fetch_time) { if (core_time >= mem_time) { exp_time = core_time; bottleneck="ALU"; } else { exp_time = mem_time; bottleneck="MEMORY"; } } else { if (fetch_time >= mem_time) { exp_time=fetch_time; bottleneck="FETCH"; } else { exp_time=mem_time; bottleneck="MEMORY"; } } calCtxDestroyCounterExt(ctx, idleCounter); calCtxDestroyCounterExt(ctx, cacheCounter); fdata<<setw(6)<<OUTER_LOOP*INNER_LOOP; fdata<<setw(13)<<total_cache/(OUTER_LOOP*INNER_LOOP); fdata<<setw(13)<<total_idle/(OUTER_LOOP*INNER_LOOP); fdata<<setw(13)<<total_time; fdata<<setw(7)<<curNum.num_domain; fdata<<setw(13)<<(exp_time*OUTER_LOOP*INNER_LOOP); fdata<<setw(11)<<bottleneck; fdata<<setw(5)<<curNum.num_GPR; fdata<<setw(5)<<curNum.num_wf; fdata<<endl; //remap the resource for output for (i=0;i<num_outputs;i++) { outPtr[i] = NULL; if (calResMap((CALvoid**)&outPtr[i], &pitch, outLocal[i], 0) != CAL_RESULT_OK) fprintf(stderr, "error occured mapping resource outLocal %d", i); } //print the memory CALfloat *out1[MAX_OUTPUTS]; for (i=0;i<num_outputs;i++) { for (unsigned int k=0;k < size; k++) { out1[i] = &outPtr[i][k*pitch]; for (unsigned int j=0;j<size;j++) { //printf("out1[%d][%d]: %f\n", i, j+k*size, out1[i][j]); } } } // verify using CPU resource and function /*float *verify_out = (float*)malloc(curNum.num_domain*curNum.num_domain*sizeof(float)); float *tmpf = (float*)malloc(curNum.num_alu_ops*sizeof(float)); for(i=0;i<curNum.num_domain;i++) { for(j=0;j<curNum.num_domain;j++) { tmpf[0]=verify[j+size*i]+verify[j+size*i]; tmpf[1]=tmpf[0]+verify[i+size*j]; tmpf[2]=tmpf[1]+tmpf[0]; tmpf[3]=tmpf[2]+tmpf[1]; tmpf[4]=tmpf[3]+tmpf[2]; tmpf[5]=tmpf[4]+tmpf[3]; verify_out[j+size*i]=tmpf[5]+tmpf[4]; } } bool confirm=false; for(i=0;i<curNum.num_domain;i++) { out1[0]=&outPtr[0][i*pitch]; for (j=0;j<curNum.num_domain;j++) { if (out1[0][j] == verify_out[j+size*i]) { confirm = true; } else { confirm=false; printf("%d: %f = %f\n", j+size*i, out1[0][j], verify_out[j+size*i]); printf("ERROR, output does not compute!\n"); getchar(); } if (confirm == false) { exit(1); } } }*/ //unmap the resource for output for (i=0;i<num_outputs;i++) { if (calResUnmap(outLocal[i]) != CAL_RESULT_OK) fprintf(stderr, "error occured unmapping outLocal %d", i); } //unload module calModuleUnload(ctx, module); //free the image calclFreeImage(img); //free the object calclFreeObject(obj); //release the resource from the context for (i=0;i<num_inputs;i++) { if (calCtxReleaseMem(ctx, inmem[i]) != CAL_RESULT_OK) fprintf(stderr, "error occured releasing resource inmem %d from context", i); } for (i=0;i<num_const;i++) { if (calCtxReleaseMem(ctx, constmem[i]) != CAL_RESULT_OK) fprintf(stderr, "error occured releasing resource constmem %d from context", 0); } for (i=0;i<num_outputs;i++) { if (calCtxReleaseMem(ctx, outmem[i]) != CAL_RESULT_OK) fprintf(stderr, "error occured releasing resource from context"); } // deallocate local resource for (i=0;i<num_inputs;i++) { if (calResFree(inLocal[i]) != CAL_RESULT_OK) fprintf(stderr, "error occured freeing inLocal %d", i); } for (i=0;i<num_const;i++) { if (calResFree(constLocal[i]) != CAL_RESULT_OK) fprintf(stderr, "error occured freeing constLocal %d", 0); } for (i=0;i<num_outputs;i++) { if (calResFree(outLocal[i]) != CAL_RESULT_OK) fprintf(stderr, "error occured freeing outLocal\n"); } // Destroy the context if(calCtxDestroy(ctx) != CAL_RESULT_OK) fprintf(stderr, "error occured"); // Closing the device calDeviceClose(device); // Shutting down CAL if(calShutdown() != CAL_RESULT_OK) fprintf(stderr, "error occured"); }

                                              • Compute Mode Questions
                                                MicahVillmow
                                                Is the error still with the outputLocal allocation or somewhere else?
                                                  • Compute Mode Questions
                                                    ryta1203

                                                    No, I get an error: "error getting module name o0" and "error setting context output memory (null)"

                                                    Then when the kernel runs (when I call calRunProgramGrid(..)) I get "symbol error running context program" and "Error running, string is No Error".

                                                    It's definitely something on the host side code but I'm really having a problem because there is such a shortage of documentation on this. Any help would be great, thanks.

                                                  • Compute Mode Questions
                                                    MicahVillmow
                                                    Ok, I just wanted to make sure. I've made this error myself many times. The problem is you are trying to map a memory buffer to the module 'o0', however, the output buffers ONLY exist in pixel shader code and not compute shader. The correct name to map the global buffer is 'g[]'. This should fix this issue for you.
                                                    • Compute Mode Questions
                                                      MicahVillmow
                                                      Yeah, just 'g[]' as there is only one memory buffer, which is very similiar to a C++ style array. Unlike in pixel shader with the color buffers, you can write to it as many times as you want but only need to initialize it once.
                                                        • Compute Mode Questions
                                                          ryta1203

                                                          I'm getting incorrect results with this kernel:

                                                           

                                                           

                                                          "il_cs_2_0\n"

                                                          "dcl_num_thread_per_group 64\n"

                                                          "dcl_cb cb0[1]\n"

                                                          "dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"

                                                          "dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"

                                                          "dcl_resource_id(2)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)\n"

                                                          "itof r7.z, vAbsTidFlat.x\n"

                                                          "mul r7.y, r7.z, cb0[0].y\n"

                                                          "mod r7.x, r7.z, cb0[0].x\n"

                                                          "flr r8, r7\n"

                                                          "sample_resource(0)_sampler(0) r0, r8\n"

                                                          "sample_resource(1)_sampler(0) r1, r8\n"

                                                          "sample_resource(2)_sampler(0) r2, r8\n"

                                                          "add r3, r1, r0\n"

                                                          "add r4, r3, r2\n"

                                                          "add r5, r4, r3\n"

                                                          "add r6, r5, r4\n"

                                                          "add r7, r6, r5\n"

                                                          "add r8, r7, r6\n"

                                                          "add r9, r8, r7\n"

                                                          "mov g[vAbsTidFlat.x], r9\n"

                                                          "ret_dyn\n"

                                                          "end\n"

                                                           

                                                          ;

                                                           

                                                          The "results" are actually "correct" but they are in the wrong place (and some just show 0, meaning they are not being computed on at all)... how I've done this is how they do it in the inputspeed_cs example, so I'm a bit confused.

                                                          cb0[0].x = domain and cb0[0].y = 1/domain (it's a squared domain)

                                                          Actually, I'm still fairly confused when it comes to getting the right 2D index to use in texture fetching for compute shader mode. 

                                                          What's wrong with the above? Any ideas?



                                                        • Compute Mode Questions
                                                          MicahVillmow
                                                          Ryta,
                                                          Are your textures allocated as linear or tiled formats? i.e. are you passing to all your calResAlloc the RES_ALLOC_GLOBAL_BUFFER flag? You are indexing into the sampler with a linear address converted into a 2D address from a tiled surface, so the data you think you are grabbing is actually in a different location.
                                                          If your resources are tiled location 2,2 in the texture is the 4th data element and not the (width + 2)th, and location 3,1 is the 5th data element and not the 3rd.
                                                          • Compute Mode Questions
                                                            MicahVillmow
                                                            The tiling mode is just a method of optimizing for the rasterization pattern. In compute shader, since your rasterization pattern is linear, you want your textures to be linear so that they hit the cache in a more friendly manner. You still want to do blocking for cache locality however. In pixel shader, the rasterization pattern is hierarchical-z, so the tiling pattern matches this pattern, resulting in good cache/access behaviour. However, as you are finding out, when using linear addressing on a tiled surface, the data you think you are getting is not the data you are actually getting. This also was a problem with using vObjIndex in pixel shader and is one of the quirks of our hardware.
                                                            • Compute Mode Questions
                                                              MicahVillmow
                                                              0,0 is the first element in the memory.
                                                                • Compute Mode Questions
                                                                  ryta1203

                                                                  Micah,

                                                                    Ok, I understand that one is tiled and the other is linear.. though I can't get either to work for compute mode... sadly this doesn't tell me anything about the tiled arrangment.

                                                                    Maybe some better documentation with graphs would work go far to help people (or at least me) understand this.

                                                                    If I try to sample off a literal I get the same result regardless of the literal values... 0,0 returns same result as 4, 0 or 127, 35, etc..

                                                                    So my question is this: how are the groups arranged off of the absolute thread index? Using a 64x1 block, if I want to access absolute index 63 then it should just be 0, 63 correct?

                                                                • Compute Mode Questions
                                                                  MicahVillmow
                                                                  Ryta,
                                                                  If your texture is linear, then absolute index 63 will be at 63, 0(x, y) and in a tiled texture, it will be at location 8,8(x,y).
                                                                  • Compute Mode Questions
                                                                    MicahVillmow
                                                                    Ryta,
                                                                    If you look at 1.2.5.6 of the Stream Computing User Guide, it shows you the tiled memory format. 1,0 is B and 0,1 is C. Also, is your format a float4? the global buffer only works on 128 bits with a straight move, you can do conditional moves to various components to get 32bit writes.
                                                                      • Compute Mode Questions
                                                                        ryta1203

                                                                        Micah,

                                                                          Yes, I was using float, not float4, must have been my problem.

                                                                          Again though, I didn't notice any difference between my input being tiled or linear (using the GLOBAL flag for my inputs), the results were the same using the same exact kernel.

                                                                        • Compute Mode Questions
                                                                          ryta1203

                                                                           

                                                                          Originally posted by: MicahVillmow Ryta, If you look at 1.2.5.6 of the Stream Computing User Guide,


                                                                          Ok, thanks, I thought that was just "one example".

                                                                          • Compute Mode Questions
                                                                            ryta1203

                                                                             

                                                                            Originally posted by: MicahVillmow Ryta, If you look at 1.2.5.6 of the Stream Computing User Guide, it shows you the tiled memory format. 1,0 is B and 0,1 is C. Also, is your format a float4? the global buffer only works on 128 bits with a straight move, you can do conditional moves to various components to get 32bit writes.


                                                                            Micah,

                                                                              Going off that pattern then the fourth element is 1, 1, not 2,2 correct? This is assuming that the 4th element is D and that the term "element" is generic, so D is in 1, 1.

                                                                              If each letter were a float then the fourth element would be 2,2 assuming float4 usage, yes?

                                                                          • Compute Mode Questions
                                                                            MicahVillmow
                                                                            Yeah, your right, i was basing all my calculations with a 1,1 offset.

                                                                            As for each letter, each letter is an element, or vector, not a component and is determined by your data format of the resource. So, if you have a float format, then D would be the 4th float value, but if your format is float4, then the D element is floats 12-15, or the fourth float4 vector.
                                                                              • Compute Mode Questions
                                                                                ryta1203

                                                                                Ok, Micah, thank you.

                                                                                • Compute Mode Questions
                                                                                  ryta1203

                                                                                   

                                                                                  Originally posted by: MicahVillmow Yeah, your right, i was basing all my calculations with a 1,1 offset. As for each letter, each letter is an element, or vector, not a component and is determined by your data format of the resource. So, if you have a float format, then D would be the 4th float value, but if your format is float4, then the D element is floats 12-15, or the fourth float4 vector.


                                                                                  Is there a significant reason why you would use a 1,1 offset that we should know about?

                                                                                • Compute Mode Questions
                                                                                  MicahVillmow
                                                                                  Nope,
                                                                                  Just a mistake on my part.