Archives Discussions

empty_knapsack · ‎01-08-2010

I've noticed that once kernel becomes big enough performance significantly dropped. I have feelings that reason of this that GPU constantly reloading instructions of kernel to "code cache" (or how to call it on GPU?) replacing old ones with new ones -- thus, performance problem.

Anyone knows what exact size of this cache? My guess is 64Kb for RV770 but may be this information already somewhere around?

empty_knapsack · ‎01-10-2010

Apparently it's only 48K. And once you exceeds this value performance starts to drop. A bit (10-15%) for RV870 and a lot (2x-3x times) for RV770. I guess it's another reason why RV770 looking so bad at OpenCL.

ryta1203 · ‎01-10-2010

Where did you get this info?

48k? How big a kernel are you running?

empty_knapsack · ‎01-11-2010

As I noticed weird performance drop for my kernels (which are heavy ALU bound and doing integer calculations only) when their compiled binary size exceeds some value I've decided to write some test application. Which I'm attaching to this message.

It's simply runs array of 2048*2048 threads, each thread doing some amount of MAD operations. As side effect this program computes how close can we get to peak performance. By increasing number of MADs we obviously increasing kernel size (and as I already wrote earlier, CAL compiler no longer trying to optimize functions, it just unrolls everything == resulting code can be very large). The breakpoint for code size is 48K. Once kernel reaches this value performance starts to decrease.

For 5770 it looks like:

2 func calls, #MAD = 3204, code size = 26640 : GFLOPS: 1311.230 96.41%
2 func calls, #MAD = 3716, code size = 30736 : GFLOPS: 1317.749 96.89%
2 func calls, #MAD = 4228, code size = 34832 : GFLOPS: 1321.035 97.13%
2 func calls, #MAD = 4740, code size = 38928 : GFLOPS: 1323.159 97.29%
2 func calls, #MAD = 5252, code size = 43024 : GFLOPS: 1325.925 97.49%
2 func calls, #MAD = 5508, code size = 45072 : GFLOPS: 1328.048 97.65%
2 func calls, #MAD = 5764, code size = 47120 : GFLOPS: 1328.294 97.67%
2 func calls, #MAD = 6020, code size = 49168 : GFLOPS: 1328.918 97.71% <- peak
2 func calls, #MAD = 6276, code size = 51216 : GFLOPS: 1316.312 96.79%
2 func calls, #MAD = 6532, code size = 53264 : GFLOPS: 1313.839 96.61%
2 func calls, #MAD = 6788, code size = 55312 : GFLOPS: 1310.235 96.34%
2 func calls, #MAD = 7044, code size = 57616 : GFLOPS: 1305.470 95.99%
2 func calls, #MAD = 7300, code size = 59664 : GFLOPS: 1301.602 95.71%
2 func calls, #MAD = 7556, code size = 61712 : GFLOPS: 1297.751 95.42%
2 func calls, #MAD = 8324, code size = 67856 : GFLOPS: 1287.495 94.67%
2 func calls, #MAD = 9092, code size = 74000 : GFLOPS: 1277.708 93.95%
2 func calls, #MAD = 9860, code size = 80144 : GFLOPS: 1269.952 93.38%
2 func calls, #MAD = 10628, code size = 86288 : GFLOPS: 1262.136 92.80%
2 func calls, #MAD = 11396, code size = 92688 : GFLOPS: 1257.568 92.47%
2 func calls, #MAD = 12164, code size = 98832 : GFLOPS: 1248.593 91.81%
2 func calls, #MAD = 12932, code size = 104976 : GFLOPS: 1240.815 91.24%
2 func calls, #MAD = 13700, code size = 111120 : GFLOPS: 1233.772 90.72%
2 func calls, #MAD = 14468, code size = 117264 : GFLOPS: 1230.431 90.47%
2 func calls, #MAD = 15236, code size = 123664 : GFLOPS: 1224.480 90.04%
2 func calls, #MAD = 16004, code size = 129808 : GFLOPS: 1222.136 89.86%
2 func calls, #MAD = 16772, code size = 135952 : GFLOPS: 1219.080 89.64%
2 func calls, #MAD = 17540, code size = 142096 : GFLOPS: 1218.529 89.60%
2 func calls, #MAD = 18820, code size = 152592 : GFLOPS: 1215.669 89.39%
2 func calls, #MAD = 20100, code size = 162832 : GFLOPS: 1211.555 89.08%
2 func calls, #MAD = 21380, code size = 173072 : GFLOPS: 1212.656 89.17%
3 func calls, #MAD = 21316, code size = 172560 : GFLOPS: 1212.450 89.15%
3 func calls, #MAD = 24004, code size = 194320 : GFLOPS: 1210.408 89.00%
3 func calls, #MAD = 27076, code size = 219152 : GFLOPS: 1211.378 89.07%
3 func calls, #MAD = 30148, code size = 243728 : GFLOPS: 1204.240 88.55%
3 func calls, #MAD = 32836, code size = 265488 : GFLOPS: 1204.444 88.56%
3 func calls, #MAD = 33220, code size = 268560 : GFLOPS: 1203.704 88.51%

Drop is noticeable but not that big. For 4770 it's way worse:

2 func calls, #MAD = 3204, code size = 26392 : GFLOPS: 940.430 97.96%
2 func calls, #MAD = 3716, code size = 30488 : GFLOPS: 941.825 98.11%
2 func calls, #MAD = 4228, code size = 34584 : GFLOPS: 943.068 98.24%
2 func calls, #MAD = 4740, code size = 38680 : GFLOPS: 943.758 98.31%
2 func calls, #MAD = 5252, code size = 42776 : GFLOPS: 943.911 98.32%
2 func calls, #MAD = 5508, code size = 44824 : GFLOPS: 945.497 98.49%
2 func calls, #MAD = 5764, code size = 46872 : GFLOPS: 945.335 98.47%
2 func calls, #MAD = 6020, code size = 48920 : GFLOPS: 945.607 98.50% <- peak
2 func calls, #MAD = 6276, code size = 50968 : GFLOPS: 937.109 97.62%
2 func calls, #MAD = 6532, code size = 53016 : GFLOPS: 935.275 97.42%
2 func calls, #MAD = 6788, code size = 55064 : GFLOPS: 933.538 97.24%
2 func calls, #MAD = 7300, code size = 59416 : GFLOPS: 925.587 96.42%
2 func calls, #MAD = 7812, code size = 63512 : GFLOPS: 919.742 95.81%
2 func calls, #MAD = 8324, code size = 67608 : GFLOPS: 913.723 95.18%
2 func calls, #MAD = 8836, code size = 71704 : GFLOPS: 908.007 94.58%
2 func calls, #MAD = 9348, code size = 75800 : GFLOPS: 902.376 94.00%
2 func calls, #MAD = 9860, code size = 79896 : GFLOPS: 895.945 93.33%
2 func calls, #MAD = 10372, code size = 83992 : GFLOPS: 891.972 92.91%
2 func calls, #MAD = 11140, code size = 90392 : GFLOPS: 884.653 92.15%
2 func calls, #MAD = 11908, code size = 96536 : GFLOPS: 875.763 91.23%
2 func calls, #MAD = 12676, code size = 102680 : GFLOPS: 867.565 90.37%
2 func calls, #MAD = 13444, code size = 108824 : GFLOPS: 859.136 89.49%
2 func calls, #MAD = 14212, code size = 114968 : GFLOPS: 825.985 86.04%
2 func calls, #MAD = 14980, code size = 121112 : GFLOPS: 832.218 86.69%
2 func calls, #MAD = 15748, code size = 127512 : GFLOPS: 793.418 82.65%
2 func calls, #MAD = 16516, code size = 133656 : GFLOPS: 785.046 81.78%
2 func calls, #MAD = 17796, code size = 143896 : GFLOPS: 734.009 76.46%
2 func calls, #MAD = 19076, code size = 154392 : GFLOPS: 714.798 74.46%
2 func calls, #MAD = 20356, code size = 164632 : GFLOPS: 713.319 74.30%
2 func calls, #MAD = 21636, code size = 174872 : GFLOPS: 653.055 68.03%
3 func calls, #MAD = 22084, code size = 178456 : GFLOPS: 665.056 69.28%
3 func calls, #MAD = 24004, code size = 194072 : GFLOPS: 688.289 71.70%
3 func calls, #MAD = 26308, code size = 212504 : GFLOPS: 630.478 65.67%
3 func calls, #MAD = 28228, code size = 228120 : GFLOPS: 619.829 64.57%
3 func calls, #MAD = 30148, code size = 243480 : GFLOPS: 612.869 63.84%
3 func calls, #MAD = 32068, code size = 259096 : GFLOPS: 605.504 63.07%

As kernel doing only huge amount of ALU operation I'm making a guess that reason of slow down is in fact a code cache size which is seems to be 48K.

/***************************************************************************** ATI GPU benchmarker/tester (c) 2010 Ivan Golubev, http://www.golubev.com *****************************************************************************/ // <windows.h> included only for // QueryPerformanceCounter() & QueryPerformanceFrequency() to have // good resolution timer #include <windows.h> #include <stdio.h> #include <assert.h> #include <conio.h> #include <stdlib.h> #include <cal.h> #include <calcl.h> // if there N GPUs at system -- define from 0 to N-1 #define DEVICENO 1 // grid size #define DIM_X 2048 #define DIM_Y DIM_X // no of function calls #define NC_STARTS 2 #define NC_ENDS (NC_STARTS + 1) // no of MADS inside functions #define NMADS_STARTS 200 #define NMADS_ENDS (NMADS_STARTS + 1200) #define NMADS_STEP 16 int madcounter; int codelen; static void __cdecl __logger(const CALchar *msg) { if (strstr(msg, ": MULADD")) madcounter++; if (strncmp(msg, "CodeLen", 7) == 0) { if (sscanf(msg, "CodeLen\t\t\t=%d;", &codelen) != 1) printf("Unknown Code Len: %s", msg); } // fprintf(stdout, msg); } void addline(char **p, int *npos, int *nmax, char *s) { int len = strlen(s); if ((*npos + len) > *nmax) { *nmax += 65536 + len; *p = (char *)realloc(*p, *nmax); } memcpy(*p + *npos, s, len); *npos += len; } // generate kernel with (<ncalls> * (<nmads>*2 + 1) * 4) MADs each char *genkernel(int ncalls, int nmads) { char *pKernel = NULL; int npos = 0; int nmax = 0; addline(&pKernel, &npos, &nmax, "il_cs_2_0\n"); addline(&pKernel, &npos, &nmax, "dcl_num_thread_per_group 64\n"); addline(&pKernel, &npos, &nmax, "\n"); addline(&pKernel, &npos, &nmax, "dcl_literal l1, 1.0, 2.0, 3.0, 4.0\n"); addline(&pKernel, &npos, &nmax, "dcl_literal l2, 4.0, 2.0, 3.7, 4.7\n"); addline(&pKernel, &npos, &nmax, "dcl_literal l3, 1.1, 7.0, 8.0, 9.0\n"); addline(&pKernel, &npos, &nmax, "dcl_literal l4, 1.2, 2.0, 3.4, 4.2\n"); addline(&pKernel, &npos, &nmax, "\n"); addline(&pKernel, &npos, &nmax, "mov r10.x,vaTid0.x\n"); addline(&pKernel, &npos, &nmax, "itof r0.x,r10.x\n"); addline(&pKernel, &npos, &nmax, "add r0.y,r0.x,l1.y\n"); addline(&pKernel, &npos, &nmax, "add r0.z,r0.x,l1.z\n"); addline(&pKernel, &npos, &nmax, "add r0.w,r0.x,l1.w\n"); addline(&pKernel, &npos, &nmax, "add r1,r0,l2\n"); addline(&pKernel, &npos, &nmax, "add r2,r1,l3\n"); addline(&pKernel, &npos, &nmax, "add r3,r1,l4\n"); addline(&pKernel, &npos, &nmax, "\n"); for (int i=0; i<ncalls; i++) addline(&pKernel, &npos, &nmax, "call 10\n"); addline(&pKernel, &npos, &nmax, "mad r0,r0,r2,r3\n"); addline(&pKernel, &npos, &nmax, "\n"); addline(&pKernel, &npos, &nmax, "mov r5,cb0[0]\n"); addline(&pKernel, &npos, &nmax, "ieq r4,r0,r5\n"); addline(&pKernel, &npos, &nmax, "ior r6.x,r4.x,r4.y\n"); addline(&pKernel, &npos, &nmax, "ior r6.z,r4.z,r4.w\n"); addline(&pKernel, &npos, &nmax, "ior r6.x,r6.x,r6.z\n"); addline(&pKernel, &npos, &nmax, "\n"); addline(&pKernel, &npos, &nmax, "if_logicalnz r4.x\n"); addline(&pKernel, &npos, &nmax, " mov g[0].x___,r0.x\n"); addline(&pKernel, &npos, &nmax, " mov g[0]._y__,r10.x\n"); addline(&pKernel, &npos, &nmax, " mov g[1],r4\n"); addline(&pKernel, &npos, &nmax, "endif\n"); addline(&pKernel, &npos, &nmax, "\n"); addline(&pKernel, &npos, &nmax, "endmain\n"); addline(&pKernel, &npos, &nmax, "\n"); addline(&pKernel, &npos, &nmax, "func 10\n"); for (int i=0; i<nmads; i++) { addline(&pKernel, &npos, &nmax, "mad r0,r0,r0,r1\n"); addline(&pKernel, &npos, &nmax, "mad r2,r2,r2,r3\n"); } addline(&pKernel, &npos, &nmax, "ret\n"); addline(&pKernel, &npos, &nmax, "\n"); addline(&pKernel, &npos, &nmax, "end\n"); if ((npos + 1) > nmax) { nmax += 16; pKernel = (char *)realloc(pKernel, nmax); } pKernel[npos] = 0; return pKernel; } int main(int argc, char** argv) { if (calInit() != CAL_RESULT_OK) return 1; { CALuint major, minor, imp; calGetVersion(&major, &minor, &imp); printf("CAL v%d.%d.%d\n", major, minor, imp); calclGetVersion(&major, &minor, &imp); printf("Compiler v%d.%d.%d\n", major, minor, imp); } int deviceno = DEVICENO; CALuint numDevices = 0; calDeviceGetCount(&numDevices); CALdevice device = 0; calDeviceOpen(&device, deviceno); CALdeviceinfo info; calDeviceGetInfo(&info, deviceno); CALcontext ctx = 0; calCtxCreate(&ctx, device); CALdeviceattribs attr; attr.struct_size = sizeof(attr); if (calDeviceGetAttribs(&attr, deviceno) != CAL_RESULT_OK) { attr.engineClock = 0; attr.numberOfSIMD = 0; } printf("%d SIMD %d clock\n", attr.numberOfSIMD, attr.engineClock); // 2 ops * # of SIMD * # TP per SIMD * # ALUs per TP * engine clock in Ghz double peakgflops = 2 * attr.numberOfSIMD * 16 * 5 * attr.engineClock / 1000.0; if (info.target == CAL_TARGET_710 || info.target == CAL_TARGET_730) peakgflops /= 2; // they have only 8 thread processors per SIMD printf("Peak GFLOPS = %.3lf\n", peakgflops); CALobject obj = NULL; CALimage image = NULL; CALlanguage lang = CAL_LANGUAGE_IL; int ncalls; int nmads; char *pKernel = NULL; CALresource localRes = 0; CALresource constRes = 0; CALmem localMem = 0; CALmem constMem = 0; if (calResAllocLocal2D(&localRes, device, DIM_X, DIM_Y, CAL_FORMAT_UINT_4, CAL_RESALLOC_GLOBAL_BUFFER) != CAL_RESULT_OK) { printf("Error Local2D [%s]\n", calGetErrorString()); } if (calResAllocLocal1D(&constRes, device, 4, CAL_FORMAT_UINT_4, 0) != CAL_RESULT_OK) { printf("Error Local1D [%s]\n", calGetErrorString()); return 1; } unsigned int* constPtr = NULL; CALuint constPitch = 0; calResMap((CALvoid**)&constPtr, &constPitch, constRes, 0); constPtr[ 0] = constPtr[ 1] = constPtr[ 2] = constPtr[ 3] = -12345789.123f; calResUnmap(constRes); calCtxGetMem(&localMem, ctx, localRes); calCtxGetMem(&constMem, ctx, constRes); // main cycle for (ncalls = NC_STARTS; ncalls <= NC_ENDS; ncalls++) for (nmads = NMADS_STARTS; nmads < NMADS_ENDS; nmads += NMADS_STEP) { pKernel = genkernel(ncalls, nmads); if (calclCompile(&obj, lang, pKernel, info.target) != CAL_RESULT_OK) { fprintf(stdout, "Kernel compilation failed. Exiting.\n"); return 1; } if (calclLink(&image, &obj, 1) != CAL_RESULT_OK) { fprintf(stdout, "Kernel linking failed. Exiting.\n"); return 1; } free(pKernel); madcounter = 0; calclDisassembleImage(image, (CALLogFunction)__logger); printf("%d func calls, #MAD = %d, code size = %d : ", ncalls, madcounter, codelen); CALmodule module = 0; calModuleLoad(&module, ctx, image); CALfunc func = 0; CALname constName = 0; CALname localName = 0; calModuleGetEntry(&func, ctx, module, "main"); calModuleGetName(&constName, ctx, module, "cb0"); if (calModuleGetName(&localName, ctx, module, "g[]") != CAL_RESULT_OK) { printf("Error in getname [%s]\n", calGetErrorString()); } calCtxSetMem(ctx, localName, localMem); calCtxSetMem(ctx, constName, constMem); // run kernel for 10x times and get average flops value int counter = 0; int countermax = 10; double avflops = 0; do { CALprogramGrid pg; pg.func = func; pg.flags = 0; pg.gridBlock.width = 64; pg.gridBlock.height = 1; pg.gridBlock.depth = 1; pg.gridSize.width = (DIM_X * DIM_Y + pg.gridBlock.width - 1) / pg.gridBlock.width; pg.gridSize.height = 1; pg.gridSize.depth = 1; LARGE_INTEGER qFrequency, qStart, qEnd; QueryPerformanceFrequency(&qFrequency); QueryPerformanceCounter(&qStart); CALevent e = 0; if (calCtxRunProgramGrid(&e, ctx, &pg) != CAL_RESULT_OK) { printf("error in run [%s]\n", calGetErrorString()); return 1; } while (calCtxIsEventDone(ctx, e) == CAL_RESULT_PENDING); QueryPerformanceCounter(&qEnd); double OpsDone = ( (double) (madcounter * 2) ) * DIM_X * DIM_Y; double ElapsedTime = double( qEnd.QuadPart - qStart.QuadPart ) / qFrequency.QuadPart; double GFlops = OpsDone / ElapsedTime / 1e9; // exclude first execution as warm-up run if (counter) avflops += GFlops; if (++counter > countermax) break; } while (1); printf("GFLOPS: %.3lf %.2lf%%\n", avflops / (counter - 1), avflops * 100.0 / (peakgflops * (counter - 1)) ); calModuleUnload(ctx, module); calclFreeImage(image); calclFreeObject(obj); } calCtxReleaseMem(ctx, constMem); calCtxReleaseMem(ctx, localMem); calResFree(constRes); calResFree(localRes); calCtxDestroy(ctx); calDeviceClose(device); calShutdown(); return 0; }

Archives Discussions

What's the "code cache" size at RV770/RV870?