0 Replies Latest reply on May 17, 2010 10:07 AM by godsic

    calMemCopy ISSUE

    godsic
      Brief report on calMemCopy routine performance

       

      I’m writing this letter to report  an issue in AMD CAL memory routines implementation. The issue was found by experimental measurements and corresponding C++ code is attached. It has been found that calMemCopy routine  reports extraordinary low results when copying data either from Remote resource to Local (but not vice-versa ) or from Remote to Remote. Please see results (code generated by MSVS2010 with maximum optimizations enabled) for my testing system which includes:

      CPU: AMD Phenom II 940 running at 3600MHz

      RAM: 8Gb DDR2-800 running at 900MHz with 4-4-4-15 common timings

      GPU: ATI Radeon HD4890 running at 1GHz with 1Gb onboard memory running at 4GHz (effective) powered by Catalyst 10.4 driver

      MB: ASUS M4A79-T Deluxe based on RD790FX NB.

      Results:

      CAL memory models implementation test

      by (c) 2010 Mykola Dvornik aka godsic

      -------------------------------------

      Trying CAL initialization...                    OK

      Querying CAL API version...                     1.4.635

      Querying CAL devices count...                   1

      Quering CAL devices info...                     OK

       

      Brief info on device:

      GPU core type:                                  RV770 family

      ASIC platform revision:                         10

      Local memory size:                              1024 Mbytes

      Uncached remote memory size:                    2047 Mbytes

      Cached remote memory size:                      2047 Mbytes

      Local memory frequency:                         1000 MHz

      Pitch aligment:                                 256 elements

      Surfae aligment:                                4096 bytes

      Number of shader cores:                         1 units

      Number of SIMD cores:                           10 units

      Wavefront size:                                 64 units

      Shader's core frequency:                        1000 MHz

       

      Opening the most powerfull device...            OK

      Trying CAL context creation...                  OK

       

      Compiling kernel...                             OK

      Linking kernel...                               OK

       

       

      Allocation local buffer...                      OK

      Allocation local buffer...                      OK

      Allocation remote buffer...                     OK

      Allocation remote buffer...                     OK

       

       

      Map CAL resource to CPU...                      OK

      Unmap CAL resource...                           OK

      Map CAL resource to CPU...                      OK

      Unmap CAL resource...                           OK

      Map CAL resource to CPU...                      OK

      Unmap CAL resource...                           OK

      Map CAL resource to CPU...                      OK

      Unmap CAL resource...                           OK

      CPU mem bandwidth (including CAL API overhead):         6370.138722 MBytes

      Getting memory objects...                       OK

      Getting CAL memory objects...                   OK

      Getting CAL memory objects...                   OK

      Getting CAL memory objects...                   OK

       

       

      Loading DC kernel image...                      OK

      Get kernel entry point...                       OK

      Get kernel variable name...                     OK

      Get kernel variable name...                     OK

       

      Performing several DMA memory transfers tests...

      -------------------------------------

       

      DMA Remote->Local copying...                    OK

      DMA Remote->Local transfer bandwidth:           425.249002 MBytes/s

       

       

      DMA Local->Remote copying...                    OK

      DMA Local->Remote transfer bandwidth:           2560.002441 MBytes/s

       

       

      DMA Remote->Remote copying...                   OK

      DMA Remote->Remote transfer bandwidth:          292.237467 MBytes/s

       

       

      DMA Remote->Remote copying...                   OK

      DMA Local->Local transfer bandwidth:            25599.414076 MBytes/s

       

       

      Performing several DC memory transfers tests...

      -------------------------------------

       

      Binding buffer to kernel...                     OK

      Binding buffer to kernel...                     OK

      Running kernel...                               OK

      DC Remote->Local transfer bandwidth:            2612.243576 MBytes/s

       

       

      Binding buffer to kernel...                     OK

      Binding buffer to kernel...                     OK

      Running kernel...                               OK

      DC Local->Remote transfer bandwidth:            2560.002441 MBytes/s

       

       

      Binding buffer to kernel...                     OK

      Binding buffer to kernel...                     OK

      Running kernel...                               OK

      DC Remote->Remote transfer bandwidth:           1580.245165 MBytes/s

       

       

      Binding buffer to kernel...                     OK

      Binding buffer to kernel...                     OK

      Running kernel...                               OK

      DC Local->Local transfer bandwidth:             32000.411993 MBytes/s

       

       

      Unbinding CAL memory object...                  OK

      Unbinding CAL memory object...                  OK

      Unbinding CAL memory object...                  OK

      Unbinding CAL memory object...                  OK

      Destroing CAL resource...                       OK

      Destroing CAL resource...                       OK

      Destroing CAL resource...                       OK

      Destroing CAL resource...                       OK

      Destroing CAL context...                        OK

      Closing CAL device...                           OK

       

      From that point of view, one can suggest that calMemCopy routines  less efficient  than simple data copy kernel. Therefore, I can suggest AMD to add cache-friendly SSEx support into calMemCopy routines for Remote->Local and Remote->Remote transfers, since they are far from experimentally measured values for SSEx stream data transfers. 

       

      Please point me to my errors, if I wrong or suggest that AGP memory locking/unlocking (without transfers) routines consume more time than transfers itself.

      P.S. I can send a MSVS2010 project on request. Please query mad211@ex.ac.uk



      //CALBench.h #ifndef _CAL_BENCH_H #define _CAL_BENCH_H #include <stdlib.h> #include <stdio.h> #include <malloc.h> #include <cal.h> #include <calcl.h> #include <string> #ifdef __linux__ #include <emmintrin.h> #include <xmmintrin.h> #include <sys/time.h> #endif #ifdef WIN32 #include <pmmintrin.h> #include <time.h> #endif #include <sys/types.h> #include <sys/timeb.h> #include "DataCopyKernel.h" #endif //DataCopyKernel.h #ifndef _DATACOPYSHADER_H #define _DATACOPYSHADER_H const char str_dc_kernel[] = "il_ps_2_0 \n" "dcl_input_position_interp(linear_noperspective) vWinCoord0.xy__ \n" "dcl_output_generic o0 \n" "dcl_resource_id(0)_type(2d,unnorm)_fmtx(unknown)_fmty(unknown)_fmtz(unknown)_fmtw(unknown) \n" "sample_resource(0)_sampler(0) r0, vWinCoord0 \n" "mov o0, r0 \n" "end \n"; #endif // _DATACOPYSHADER_H //CALBench.cpp #include "CALBench.h" #define divider "-------------------------------------\n" #define CALERROR "CAL won't work properly! Shutdown will be initiated...\n" #define error_handler if (error!=CAL_RESULT_OK) return 1 #define BUFF_SIZE_BYTES 64*1024*1024 //64 MBytes double buff_size_mb = (double)BUFF_SIZE_BYTES/(1024.0*1024.0); CALuint num_devices = 0; CALdevice* devices; CALdeviceattribs* attr_devices; CALdevicestatus* stat_devices; CALdeviceinfo* info_devices; CALuint current_device; CALresource rs_inLocal = 0; CALmem mem_inLocal = 0; CALname name_inLocal = 0; CALresource rs_outLocal = 0; CALmem mem_outLocal = 0; CALname name_outLocal =0; CALresource rs_inRemote = 0; CALmem mem_inRemote = 0; CALname name_inRemote = 0; CALresource rs_outRemote = 0; CALmem mem_outRemote = 0; CALname name_outRemote = 0; CALcontext ctx = 0; CALobject obj_dc_kernel = 0; CALimage img_dc_kernel = 0; CALlanguage cal_lng = CAL_LANGUAGE_IL; CALmodule mdl_dc_kernel = 0; CALfunc entr_dc_kernel = 0; CALname inBuffer = 0; CALname outBuffer = 0; CALuint sz_X = 0; CALuint sz_Y = 0; CALboolean isCALInit = CAL_FALSE; CALboolean isDeviceOpened = CAL_FALSE; CALboolean isCtxCreated = CAL_FALSE; CALboolean is_rs_inLocal_alloc = CAL_FALSE; CALboolean is_rs_outLocal_alloc = CAL_FALSE; CALboolean is_rs_inRemote_alloc = CAL_FALSE; CALboolean is_rs_outRemote_alloc = CAL_FALSE; CALboolean is_mem_inLocal_get = CAL_FALSE; CALboolean is_mem_outLocal_get = CAL_FALSE; CALboolean is_mem_inRemote_get = CAL_FALSE; CALboolean is_mem_outRemote_get = CAL_FALSE; #ifdef __linux__ double get_sys_time() { struct timeval tp; struct timezone tzp; int i; i = gettimeofday(&tp,&tzp); return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 ); } #endif #ifdef WIN32 double get_sys_time() { time_t cur_time; struct _timeb tstruct; _tzset(); time(&cur_time); _ftime(&tstruct); double d_time = (double)cur_time + (double)tstruct.millitm * 1e-3; return d_time; } #endif void printwelcome () { printf("\n\n\nCAL memory models implementation test\n"); printf("by (c) 2010 Mykola Dvornik aka godsic\n"); printf(divider); } void deviceInfoPrint(CALuint curr_dev) { printf("\nBrief info on device:\n"); printf("GPU core type:\t\t"); switch(attr_devices[curr_dev].target) { case CAL_TARGET_600: printf("\t\t\tRV600 family\n"); break; case CAL_TARGET_610: printf("\t\t\tRV610 family\n"); break; case CAL_TARGET_630: printf("\t\t\tRV630 family\n"); break; case CAL_TARGET_670: printf("\t\t\tRV670 family\n"); break; case CAL_TARGET_7XX: printf("\t\t\tRV7XX family\n"); break; case CAL_TARGET_770: printf("\t\t\tRV770 family\n"); break; case CAL_TARGET_710: printf("\t\t\tRV710 family\n"); break; case CAL_TARGET_730: printf("\t\t\tRV730 family\n"); break; case CAL_TARGET_CYPRESS: printf("\t\t\tCYPRESS family\n"); break; case CAL_TARGET_JUNIPER: printf("\t\t\tJUNIPER family\n"); break; } printf ("ASIC platform revision:\t\t\t\t%u\n",attr_devices[curr_dev].targetRevision); printf("Local memory size:\t\t\t\t%u Mbytes\n",attr_devices[curr_dev].localRAM); printf("Uncached remote memory size:\t\t\t%u Mbytes\n",attr_devices[curr_dev].uncachedRemoteRAM); printf("Cached remote memory size:\t\t\t%u Mbytes\n",attr_devices[curr_dev].cachedRemoteRAM); printf("Local memory frequency:\t\t\t\t%u MHz\n",attr_devices[curr_dev].memoryClock); printf("Pitch aligment:\t\t\t\t\t%u elements\n",attr_devices[curr_dev].pitch_alignment); printf("Surfae aligment:\t\t\t\t%u bytes\n",attr_devices[curr_dev].surface_alignment); printf("Number of shader cores:\t\t\t\t%u units\n",attr_devices[curr_dev].numberOfShaderEngines); printf("Number of SIMD cores:\t\t\t\t%u units\n",attr_devices[curr_dev].numberOfSIMD); printf("Wavefront size:\t\t\t\t\t%u units\n",attr_devices[curr_dev].wavefrontSize); printf("Shader's core frequency:\t\t\t%u MHz\n", attr_devices[curr_dev].engineClock); printf("\n"); } int CompileAndLinkKernel(CALuint cur_dev) { printf("\n"); printf("Compiling kernel..."); CALresult error; error = calclCompile(&obj_dc_kernel,cal_lng,str_dc_kernel,attr_devices[cur_dev].target); error_handler; printf("\t\t\t\tOK\n"); printf("Linking kernel..."); error = calclLink(&img_dc_kernel,&obj_dc_kernel,1); error_handler; printf("\t\t\t\tOK\n"); printf("\n"); return 0; } int initCAL() { printf("Trying CAL initialization..."); CALresult error; error = calInit(); error_handler; isCALInit = CAL_TRUE; printf("\t\t\tOK\n"); CALuint major=0, minor=0, imp=0; printf("Querying CAL API version..."); error = calGetVersion(&major,&minor,&imp); error_handler; printf("\t\t\t%u.%u.%u\n",major,minor,imp); printf("Querying CAL devices count..."); error = calDeviceGetCount(&num_devices); error_handler; printf("\t\t\t%u\n",num_devices); devices = new CALdevice[num_devices]; attr_devices = (CALdeviceattribs*)calloc(num_devices,sizeof(CALdeviceattribs));//new CALdeviceattribs[num_devices]; info_devices = new CALdeviceinfo[num_devices]; stat_devices = (CALdevicestatus*)calloc(num_devices,sizeof(CALdevicestatus));//new CALdevicestatus[num_devices]; //Open devices and query all information about them printf("Quering CAL devices info..."); CALuint ordinal = num_devices - 1; for (CALuint i=0;i<num_devices;i++) { attr_devices[i].struct_size = sizeof(CALdeviceattribs); error = calDeviceGetAttribs(&attr_devices[i], i); error_handler; error = calDeviceGetInfo(&info_devices[i], i); error_handler; } printf("\t\t\tOK\n"); if (num_devices == 1) { current_device = 0; } else { //sort devices by higher performance and return apporiate current_device } deviceInfoPrint(current_device); printf("Opening the most powerfull device..."); error = calDeviceOpen(&devices[current_device],current_device); error_handler; isDeviceOpened = CAL_TRUE; printf("\t\tOK\n"); printf("Trying CAL context creation..."); error = calCtxCreate(&ctx,devices[current_device]); error_handler; isCtxCreated = CAL_TRUE; printf("\t\t\tOK\n"); return 0; } int AllocateBuffers(CALuint cur_dev) { CALresult error; CALuint sz_f4 = 4 * sizeof(float); sz_X = info_devices[cur_dev].maxResource2DWidth; sz_Y = BUFF_SIZE_BYTES/(sz_X * sizeof(sz_f4)); printf("\n"); printf("Allocation local buffer..."); error = calResAllocLocal2D(&rs_inLocal,devices[cur_dev],sz_X,sz_Y,CAL_FORMAT_FLOAT32_4,0); error_handler; is_rs_inLocal_alloc = CAL_TRUE; printf("\t\t\tOK\n"); printf("Allocation local buffer..."); error = calResAllocLocal2D(&rs_outLocal,devices[cur_dev],sz_X,sz_Y,CAL_FORMAT_FLOAT32_4,0); error_handler; is_rs_outLocal_alloc = CAL_TRUE; printf("\t\t\tOK\n"); printf("Allocation remote buffer..."); error = calResAllocRemote2D(&rs_inRemote,&devices[cur_dev],num_devices,sz_X,sz_Y,CAL_FORMAT_FLOAT32_4,0); error_handler; is_rs_inRemote_alloc = CAL_TRUE; printf("\t\t\tOK\n"); printf("Allocation remote buffer..."); error = calResAllocRemote2D(&rs_outRemote,&devices[cur_dev],num_devices,sz_X,sz_Y,CAL_FORMAT_FLOAT32_4,0); error_handler; is_rs_outRemote_alloc = CAL_TRUE; printf("\t\t\tOK\n"); printf("\n"); return 0; } int initBuffers(CALuint cur_dev) { printf("\n"); CALresult error; CALuint sz_X_d4 = sz_X/4; // 4 floats __m128 XMM0,XMM1,XMM2; __declspec(align(16)) float float_one[2] = {1.0f,1.0f}; __declspec(align(16)) float float_zero[2] = {0.0f,0.0f}; __declspec(align(16)) float float_two[2] = {2.0f,2.0f}; XMM0 = _mm_load_ps(&float_zero[0]); XMM1 = _mm_load_ps(&float_one[0]); XMM2 = _mm_load_ps(&float_two[0]); double s_time = get_sys_time(); float *buffer_a = NULL; CALuint x_offset = 0; printf("Map CAL resource to CPU..."); error = calResMap((CALvoid**)&buffer_a, &x_offset, rs_inLocal,0); // is buffer will be 16 byte aligned? error_handler; printf("\t\t\tOK\n"); #pragma omp parallel for //Generaly multithreaded memory copy/initialization is useless on single processor systems, since we have only 1 128 bit channel or 2 64 channels. for (int j=0; j<sz_Y; j++) { int ix_j = j * x_offset; for (int i=0; i<sz_X_d4; i+=4) { int ix_i = ix_j + i; _mm_stream_ps(&buffer_a[ix_i],XMM1); } } _mm_mfence(); printf("Unmap CAL resource..."); error = calResUnmap(rs_inLocal); error_handler; printf("\t\t\t\tOK\n"); float *buffer_b = NULL; x_offset = 0; printf("Map CAL resource to CPU..."); error = calResMap((CALvoid**)&buffer_b, &x_offset, rs_outLocal,0); // is buffer will be 16 byte aligned? error_handler; printf("\t\t\tOK\n"); #pragma omp parallel for for (int j=0; j<sz_Y; j++) { int ix_j = j * x_offset; for (int i=0; i<sz_X_d4; i+=4) { int ix_i = ix_j + i; _mm_stream_ps(&buffer_b[ix_i],XMM0); } } _mm_mfence(); printf("Unmap CAL resource..."); error = calResUnmap(rs_outLocal); error_handler; printf("\t\t\t\tOK\n"); float *buffer_c = NULL; x_offset = 0; printf("Map CAL resource to CPU..."); error = calResMap((CALvoid**)&buffer_c, &x_offset, rs_inRemote,0); // is buffer will be 16 byte aligned? error_handler; printf("\t\t\tOK\n"); #pragma omp parallel for for (int j=0; j<sz_Y; j++) { int ix_j = j * x_offset; for (int i=0; i<sz_X_d4; i+=4) { int ix_i = ix_j + i; _mm_stream_ps(&buffer_c[ix_i],XMM2); } } _mm_mfence(); printf("Unmap CAL resource..."); error = calResUnmap(rs_inRemote); error_handler; printf("\t\t\t\tOK\n"); float *buffer_d = NULL; x_offset = 0; printf("Map CAL resource to CPU..."); error = calResMap((CALvoid**)&buffer_d, &x_offset, rs_outRemote,0); // is buffer will be 16 byte aligned? error_handler; printf("\t\t\tOK\n"); #pragma omp parallel for for (int j=0; j<sz_Y; j++) { int ix_j = j * x_offset; for (int i=0; i<sz_X_d4; i+=4) { int ix_i = ix_j + i; _mm_stream_ps(&buffer_d[ix_i],XMM0); } } _mm_mfence(); printf("Unmap CAL resource..."); error = calResUnmap(rs_outRemote); error_handler; printf("\t\t\t\tOK\n"); s_time = get_sys_time() - s_time; double bandwidth = 16.0 * 4.0 * buff_size_mb/s_time;//Assuming that Map/Unmap is 2 memory operations routine printf("CPU mem bandwidth (including CAL API overhead):\t\t%f MBytes/s\n",bandwidth); printf("Getting memory objects..."); error = calCtxGetMem(&mem_inLocal,ctx,rs_inLocal); error_handler; is_mem_inLocal_get = CAL_TRUE; printf("\t\t\tOK\n"); printf("Getting CAL memory objects..."); error = calCtxGetMem(&mem_outLocal,ctx,rs_outLocal); error_handler; is_mem_outLocal_get = CAL_TRUE; printf("\t\t\tOK\n"); printf("Getting CAL memory objects..."); error = calCtxGetMem(&mem_inRemote,ctx,rs_inRemote); error_handler; is_mem_inRemote_get = CAL_TRUE; printf("\t\t\tOK\n"); printf("Getting CAL memory objects..."); error = calCtxGetMem(&mem_outRemote,ctx,rs_outRemote); error_handler; is_mem_outRemote_get = CAL_TRUE; printf("\t\t\tOK\n"); printf("\n"); return 0; } int shutDown(CALuint cur_dev) { CALresult error; printf("\n"); if (is_mem_inLocal_get) { printf("Unbinding CAL memory object..."); error = calCtxReleaseMem(ctx,mem_inLocal); error_handler; printf("\t\t\tOK\n"); } if (is_mem_outLocal_get) { printf("Unbinding CAL memory object..."); error = calCtxReleaseMem(ctx,mem_outLocal); error_handler; printf("\t\t\tOK\n"); } if (is_mem_inRemote_get) { printf("Unbinding CAL memory object..."); error = calCtxReleaseMem(ctx,mem_inRemote); error_handler; printf("\t\t\tOK\n"); } if (is_mem_outRemote_get) { printf("Unbinding CAL memory object..."); error = calCtxReleaseMem(ctx,mem_outRemote); error_handler; printf("\t\t\tOK\n"); } if (is_rs_inLocal_alloc) { printf("Destroing CAL resource..."); error = calResFree(rs_inLocal); error_handler; printf("\t\t\tOK\n"); } if (is_rs_outLocal_alloc) { printf("Destroing CAL resource..."); error = calResFree(rs_outLocal); error_handler; printf("\t\t\tOK\n"); } if (is_rs_inRemote_alloc) { printf("Destroing CAL resource..."); error = calResFree(rs_inRemote); error_handler; printf("\t\t\tOK\n"); } if (is_rs_outRemote_alloc) { printf("Destroing CAL resource..."); error = calResFree(rs_outRemote); error_handler; printf("\t\t\tOK\n"); } if (isCtxCreated) { printf("Destroing CAL context..."); error = calCtxDestroy(ctx); error_handler; printf("\t\t\tOK\n"); } if (isDeviceOpened) { printf("Closing CAL device..."); error = calDeviceClose(devices[cur_dev]); error_handler; printf("\t\t\t\tOK\n"); } return 0; } int SetupKernel(CALuint curr_dev) { CALresult error; printf("\n"); printf("Loading DC kernel image..."); error = calModuleLoad(&mdl_dc_kernel,ctx,img_dc_kernel); error_handler; printf("\t\t\tOK\n"); printf("Get kernel entry point..."); error = calModuleGetEntry(&entr_dc_kernel,ctx,mdl_dc_kernel,"main"); error_handler; printf("\t\t\tOK\n"); printf("Get kernel variable name..."); error = calModuleGetName(&inBuffer,ctx,mdl_dc_kernel,"i0"); error_handler; printf("\t\t\tOK\n"); printf("Get kernel variable name..."); error = calModuleGetName(&outBuffer,ctx,mdl_dc_kernel,"o0"); error_handler; printf("\t\t\tOK\n"); return 0; } int DMAMemCpyRemoteLocalTest(CALuint cur_dev) { printf("\n"); printf("DMA Remote->Local copying..."); double time = get_sys_time(); CALevent e; CALresult error; error = calMemCopy(&e,ctx,mem_inRemote,mem_inLocal,0); error_handler; printf("\t\t\tOK\n"); if (error == CAL_RESULT_OK) { while(calCtxIsEventDone(ctx,e)==CAL_RESULT_PENDING); } time = get_sys_time() - time; double bandwidth = 2.0 * buff_size_mb / time; printf("DMA Remote->Local transfer bandwidth:\t\t%f MBytes/s\n",bandwidth); printf("\n"); return 0; } int DMAMemCpyLocalRemoteTest(CALuint cur_dev) { printf("\n"); printf("DMA Local->Remote copying..."); double time = get_sys_time(); CALevent e; CALresult error; error = calMemCopy(&e,ctx,mem_inLocal,mem_inRemote,0); error_handler; printf("\t\t\tOK\n"); if (error == CAL_RESULT_OK) { while(calCtxIsEventDone(ctx,e)==CAL_RESULT_PENDING); } time = get_sys_time() - time; double bandwidth = 2.0 * buff_size_mb / time; printf("DMA Local->Remote transfer bandwidth:\t\t%f MBytes/s\n",bandwidth); printf("\n"); return 0; } int DMAMemCpyRemoteRemoteTest(CALuint cur_dev) { printf("\n"); printf("DMA Remote->Remote copying..."); double time = get_sys_time(); CALevent e; CALresult error; error = calMemCopy(&e,ctx,mem_inRemote,mem_outRemote,0); error_handler; printf("\t\t\tOK\n"); if (error == CAL_RESULT_OK) { while(calCtxIsEventDone(ctx,e)==CAL_RESULT_PENDING); } time = get_sys_time() - time; double bandwidth = 2.0 * buff_size_mb / time; printf("DMA Remote->Remote transfer bandwidth:\t\t%f MBytes/s\n",bandwidth); printf("\n"); return 0; } int DMAMemCpyLocalLocalTest(CALuint cur_dev) { printf("\n"); printf("DMA Remote->Remote copying..."); double time = get_sys_time(); CALevent e; CALresult error; error = calMemCopy(&e,ctx,mem_inLocal,mem_outLocal,0); error_handler; printf("\t\t\tOK\n"); if (error == CAL_RESULT_OK) { while(calCtxIsEventDone(ctx,e)==CAL_RESULT_PENDING); } time = get_sys_time() - time; double bandwidth = 2.0 * buff_size_mb / time; printf("DMA Local->Local transfer bandwidth:\t\t%f MBytes/s\n",bandwidth); printf("\n"); return 0; } int DCMemCpyRemoteLocalTest(CALuint cur_dev) { CALresult error; printf("\n"); printf("Binding buffer to kernel..."); error = calCtxSetMem(ctx,inBuffer,mem_inRemote); error_handler; printf("\t\t\tOK\n"); printf("Binding buffer to kernel..."); error = calCtxSetMem(ctx,outBuffer,mem_inLocal); error_handler; printf("\t\t\tOK\n"); CALdomain dmn = {0, 0, sz_X, sz_Y}; CALevent e; printf("Running kernel..."); double time = get_sys_time(); error = calCtxRunProgram(&e,ctx,entr_dc_kernel,&dmn); error_handler; printf("\t\t\t\tOK\n"); while (calCtxIsEventDone(ctx,e)==CAL_RESULT_PENDING); time = get_sys_time() - time; double bandwidth = 2 * buff_size_mb/time; printf("DC Remote->Local transfer bandwidth:\t\t%f MBytes/s\n",bandwidth); printf("\n"); return 0; } int DCMemCpyLocalRemoteTest(CALuint cur_dev) { CALresult error; printf("\n"); printf("Binding buffer to kernel..."); error = calCtxSetMem(ctx,inBuffer,mem_inLocal); error_handler; printf("\t\t\tOK\n"); printf("Binding buffer to kernel..."); error = calCtxSetMem(ctx,outBuffer,mem_inRemote); error_handler; printf("\t\t\tOK\n"); CALdomain dmn = {0, 0, sz_X, sz_Y}; CALevent e; printf("Running kernel..."); double time = get_sys_time(); error = calCtxRunProgram(&e,ctx,entr_dc_kernel,&dmn); error_handler; printf("\t\t\t\tOK\n"); while (calCtxIsEventDone(ctx,e)==CAL_RESULT_PENDING); time = get_sys_time() - time; double bandwidth = 2 * buff_size_mb/time; printf("DC Local->Remote transfer bandwidth:\t\t%f MBytes/s\n",bandwidth); printf("\n"); return 0; } int DCMemCpyRemoteRemoteTest(CALuint cur_dev) { CALresult error; printf("\n"); printf("Binding buffer to kernel..."); error = calCtxSetMem(ctx,inBuffer,mem_inRemote); error_handler; printf("\t\t\tOK\n"); printf("Binding buffer to kernel..."); error = calCtxSetMem(ctx,outBuffer,mem_outRemote); error_handler; printf("\t\t\tOK\n"); CALdomain dmn = {0, 0, sz_X, sz_Y}; CALevent e; printf("Running kernel..."); double time = get_sys_time(); error = calCtxRunProgram(&e,ctx,entr_dc_kernel,&dmn); error_handler; printf("\t\t\t\tOK\n"); while (calCtxIsEventDone(ctx,e)==CAL_RESULT_PENDING); time = get_sys_time() - time; double bandwidth = 2 * buff_size_mb/time; printf("DC Remote->Remote transfer bandwidth:\t\t%f MBytes/s\n",bandwidth); printf("\n"); return 0; } int DCMemCpyLocalLocalTest(CALuint cur_dev) { CALresult error; printf("\n"); printf("Binding buffer to kernel..."); error = calCtxSetMem(ctx,inBuffer,mem_inLocal); error_handler; printf("\t\t\tOK\n"); printf("Binding buffer to kernel..."); error = calCtxSetMem(ctx,outBuffer,mem_outLocal); error_handler; printf("\t\t\tOK\n"); CALdomain dmn = {0, 0, sz_X, sz_Y}; CALevent e; printf("Running kernel..."); double time = get_sys_time(); error = calCtxRunProgram(&e,ctx,entr_dc_kernel,&dmn); error_handler; printf("\t\t\t\tOK\n"); while (calCtxIsEventDone(ctx,e)==CAL_RESULT_PENDING); time = get_sys_time() - time; double bandwidth = 2 * buff_size_mb/time; printf("DC Local->Local transfer bandwidth:\t\t%f MBytes/s\n",bandwidth); printf("\n"); return 0; } int doDMAtests(int cur_dev) { printf("\n"); printf("Performing several DMA memory transfers tests...\n"); printf(divider); if (DMAMemCpyRemoteLocalTest(cur_dev)!=0) { printf("DMA transfer test failed!\n"); } if (DMAMemCpyLocalRemoteTest(cur_dev)!=0) { printf("DMA transfer test failed!\n"); } if (DMAMemCpyRemoteRemoteTest(cur_dev)!=0) { printf("DMA transfer test failed!\n"); } if (DMAMemCpyLocalLocalTest(cur_dev)!=0) { printf("DMA transfer test failed!\n"); } return 0; } int doDCtests(int cur_dev) { printf("\n"); printf("Performing several DC memory transfers tests...\n"); printf(divider); if (DCMemCpyRemoteLocalTest(cur_dev)!=0) { printf("DC transfer test failed!\n"); } if (DCMemCpyLocalRemoteTest(cur_dev)!=0) { printf("DC transfer test failed!\n"); } if (DCMemCpyRemoteRemoteTest(cur_dev)!=0) { printf("DC transfer test failed!\n"); } if (DCMemCpyLocalLocalTest(cur_dev)!=0) { printf("DC transfer test failed!\n"); } return 0; } void main(int argc, char** argv) { printwelcome(); if (initCAL() !=0) { printf(CALERROR); return; } if (CompileAndLinkKernel(current_device) !=0) { printf(CALERROR); return; } if (AllocateBuffers(current_device) !=0) { printf(CALERROR); return; } if (initBuffers(current_device) !=0) { printf(CALERROR); return; } if (SetupKernel(current_device) !=0) { printf(CALERROR); return; } if (doDMAtests(current_device)!=0) { printf(CALERROR); return; } if (doDCtests(current_device)!=0) { printf(CALERROR); return; } if (shutDown(current_device) !=0) { printf(CALERROR); return; } return; }