Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

ATI stream profiler vs. custom profiler problem

ATI stream profiler vs. custom profiler problem


I'm having some problems with the ATI stream profiler and my own custom profiler.

The code I am working on is the Matrix Multiplication sample from the ATI stream SDK. I have inserted my own profiler into this code (one that uses clGetEventProfilingInfo() to retreive the start and end time of kernel execution, the ATI documentation says that the ATI stream profiler works in the same manner), and compared it's results with those of the ATI stream profiler. The results of this comparison are somewhat unexpected. As it turns out, when executing the code through the ATI stream profiler the results are almost the same (difference is at most 0,1 ms). However while executing the code normally in Release mode (not though the ATI profiler) the output of my profiler has much hihger values (e.g. 79 ms total executuion time of all kernels through the ATI stream profiler, and 375 ms total execution time in Release mode). 

What can be the cause of this difference? Why is it so big?

I've checked that both executions are carried out on the same device (GPU), the CL_DEVICE_PROFILING_TIMER_RESOLUTION parameter is also unchanged.

The profiler code is given below.

// HEADER #ifndef MY_PROFILER #define MY_PROFILER #include <cstdio> #include <string> #include <CL/cl.h> namespace profiler { class Profiler { public: Profiler(bool console = true, bool file = true); ~Profiler(); bool initialize(const char* filename); void deinitialize(); bool saveEventProfileInfo(cl_event event); bool writeOutput(std::string str); private: cl_double totalTime; FILE* file; bool consoleOutput; bool fileOutput; }; } #endif // .CPP FILE #include <iostream> #include <ctime> #include "Profiler.h" namespace profiler { using namespace std; Profiler::Profiler(bool console, bool file) :consoleOutput(console), fileOutput(file), file(NULL), totalTime(0) { } Profiler::~Profiler() { } bool Profiler::initialize(const char* filename) { if (fileOutput) { if (file) { fclose(file); file = NULL; } this->file = fopen(filename, "a"); if (!file) { cout << "Profiler::initialize Failed to open output file" << endl; return false; } time_t rawtime; struct tm * timeinfo; time ( &rawtime ); timeinfo = localtime ( &rawtime ); fprintf ( file, "Initialize at: %s", asctime (timeinfo) ); delete timeinfo; } totalTime = 0; return true; } void Profiler::deinitialize() { if (fileOutput) { if (file) { fprintf(file, "total execution time [miliSecs]:\t%f\n\n", totalTime); fclose(file); file = NULL; } } if (consoleOutput) { cout << "total execution time: " << totalTime << endl; } } bool Profiler::saveEventProfileInfo(cl_event event) { cl_int status; cl_ulong queuedNanoSecs; cl_ulong submitNanoSecs; cl_ulong startNanoSecs; cl_ulong endNanoSecs; status = clGetEventProfilingInfo( event, CL_PROFILING_COMMAND_QUEUED, sizeof(cl_ulong), &queuedNanoSecs, NULL); if(status != CL_SUCCESS) { cout << "Profiler::initialize Failed to get event queued time" << endl; return false; } status = clGetEventProfilingInfo( event, CL_PROFILING_COMMAND_SUBMIT, sizeof(cl_ulong), &submitNanoSecs, NULL); if(status != CL_SUCCESS) { cout << "Profiler::initialize Failed to get event submit time" << endl; return false; } status = clGetEventProfilingInfo( event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &startNanoSecs, NULL); if(status != CL_SUCCESS) { cout << "Profiler::initialize Failed to get event start time" << endl; return false; } status = clGetEventProfilingInfo( event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &endNanoSecs, NULL); if(status != CL_SUCCESS) { cout << "Profiler::initialize Failed to get event end time" << endl; return false; } cl_double submitTimeMilis = ((cl_double)(submitNanoSecs - queuedNanoSecs))/1000000.0; cl_double startTimeMilis = ((cl_double)(startNanoSecs - submitNanoSecs))/1000000.0; cl_double execTimeMilis = ((cl_double)(endNanoSecs - startNanoSecs))/1000000.0; if (consoleOutput) { //cout << "event start [nanoSecs]: " << startNanoSecs << endl; //cout << "event end [nanoSecs]: " << endNanoSecs << endl; cout << "queued -> submit [miliSecs]: " << submitTimeMilis << endl; cout << "submit -> start [miliSecs]: " << startTimeMilis << endl; cout << "event execTime (start -> end) [miliSecs]: " << execTimeMilis << endl; } if (fileOutput) { //fprintf(file, "event start [nanoSecs]:\t%u\n", startNanoSecs); //fprintf(file, "event end [nanoSecs]:\t%u\n", endNanoSecs); fprintf(file, "queued -> submit [miliSecs]:\t%f\n", submitTimeMilis); fprintf(file, "submit -> start [miliSecs]:\t%f\n", startTimeMilis); fprintf(file, "event execTime [miliSecs]:\t%f\n", execTimeMilis); } totalTime += execTimeMilis; return true; } bool Profiler::writeOutput(string str) { if (fileOutput) { if (file) { fprintf(file, str.c_str()); } else { cout << "Profiler::writeOutput Failed to write to file" << endl; return false; } } if (consoleOutput) { cout << str; } return true; } }

5 Replies

hi crazybee,

The profiling info is calculated in Matrix Multiplication using clGetEventProfilingInfo() function.I ran the code on vista 64

sdk 2.2 and driver 10.8.I do not find any discrepencies.

The result for a 64*64 matrix was:

release mode:.391349ms

debug mode:0401486ms

Profiler mode:.39113ms.

Please send in your system config and GPU config.I can try to reproduce the problem on same GPU,if possible.


System config:

Inter core 2 duo e7400 2.8 ghz

4gb ddr3 ram

Windows xp professional sp 3


- Asus Radeon 4850 top (512 mb ddr3)

Sample configuration:

128*128 matrices, 100 iterations.


 Have you tried running the kernel for smaller matrices.It might be that your memory requirements are more then available on GPU.

If i am not wrong you are using ATI RAdeon chip


Yeah, I'm using an ATI Radeon chip. However, reducing the matrix size didn't help. The results of my own profiler ran without the ATI Stream profiler (in Release mode under Visual C++ 2008) are still significantly greater. When I use the ATI stream profiler the results of the two profilers are still almost identical. Any other ideas?



Please send in your complete code if possible.

How is the performance of your code on CPU & GPU with profiler in release mode and without profiler in release mode?

Which SDK and Driver are you using?

You can also refer to the code of Matrix multiplication in openCL Samples.There also the same clGetEventProfilingInfo is used for profiling.And i do not find any discrepencies in timing.

I hope it helps.