2 Replies Latest reply on Dec 1, 2010 6:02 PM by himanshu.gautam

    Question about SKA result interpretation

    edwin.cini

      I've just started using OpenCL for my university thesis which is an attempt to implement university timetabling on a GPU using a genetic algorithm.

      I'm now working on the initial parts of the system and testing exclusively on SKA, just to get a feeling of the syntax and code performance.  However I don't know how to interpret the overall performance results.  For example, is the attached report a good or a bad one?  Can it be improved?

      In my code I tried to balance execution paths as much as possible, but this has at times resulted in a drop in the Throughput figure.  I'm attaching my code for any comments/criticism.  It is by far incomplete and only the tip of the iceberg...

      Edwin

      //----------------------- SKA results ----------------------- //Name,GPR,Scratch Reg,Min,Max,Avg,ALU,Fetch,Write,Est Cycles,ALU:Fetch,BottleNeck,%s\Clock,Throughput,CF //Radeon HD 5770,17,0,1.00,182070.00,1548.89,43,5,3,1548.89,1.19,Global Write,0.01,9 M Threads\Sec,37 //----------------------- OpenCL code ----------------------- /* data structure for timetable configuration */ typedef struct { int TimeslotCount; int ExtendedSlotCount; } TimetableConfig; /* data structure for problem specific configuration */ typedef struct { int RoomCount; int RoomFeatureCount; int StudentCount; int EventCount; } ProblemConfig; /* data structure for room description */ typedef struct { int Id; // room id [0..(room count - 1)] int Capacity; // seating capacity int AvailableFeatures; // 32-bit vector indicating available facilities long ValidTimeSlots; // 64-bit vector indicating usable timeslots } Room; /* data structure for event description */ typedef struct { int Id; // event id [0..(event count - 1)] int RequiredFeatures; // 32-bit vector indicating required facilities int AttendanceCount; // number of attendees long ValidTimeslots; // 64-bit vector indicating usable timeslots } Event; /* data structure for student description */ typedef struct Student { int Id; // student id [0..(student count - 1)] }; // Create matrix of possible placements and placement count for event kernel void computeEventPlacements( __constant __read_only TimetableConfig* timetableConfig, // specifies number of timeslots __constant __read_only ProblemConfig* problemConfig, // specifies problem instance __global __read_only Event* eventList, // list of event structures __global __read_only Room* roomList, // list of room structures __global __write_only int* eventPlaceMatrix, __global __write_only int2* eventPlaceCount ) { int id = get_global_id(0); __global Event* pe = &eventList[id]; __global Room* pr; // get local copy of config for faster access __private int roomCount = problemConfig->RoomCount; __private int timeslotCount = timetableConfig->TimeslotCount; __private int extendedSlotCount = timetableConfig->ExtendedSlotCount; __private ulong timeslotMask = 0x01; // 64-bit mask used for timeslot testing eventPlaceCount[id].x = id; eventPlaceCount[id].y = 0; // scan timeslots for (int t = 0; t < timeslotCount; t++) { // scan rooms for (int r = 0; r < roomCount; r++) { pr = &roomList[r]; // check if room is suitable bool roomSuitable = ((pe->ValidTimeslots & timeslotMask != 0) && (pr->Capacity >= pe->AttendanceCount) && (pr->AvailableFeatures & pe->RequiredFeatures == pe->RequiredFeatures)); uint i = (roomSuitable) ? 1 : 0; eventPlaceCount[id].y += i; eventPlaceMatrix[t + (r * extendedSlotCount)] = i; } timeslotMask <<= 1; } // at this stage we have: // * a matrix showing which places we can use // * a vector showing the number of places we can use (sum of matrix elements) }

        • Question about SKA result interpretation
          edwin.cini

          Following some changes to my kernel, SKA is giving vastly better performance results, but I'm not so convinced it's for the better.  Maybe someone with more insight can offer an explanation?

          I see a 475x improvement in throughput, but isn't the amount of work carried out the same?  The old version was 1 dimensional and doing work serially in a loop, whereas the new version is splitting work into 3 dimensions and doing work in parallel.

          Sizes for the problem are 200 to 400 for first dimension, 10 to 30 for second dimension (variable r) and 45 or 64 for third dimension (variable s). New code is included with this post.

           

          ----------------------------------------------------- Old Results Name,GPR,Scratch Reg,Min,Max,Avg,ALU,Fetch,Write,Est Cycles,ALU:Fetch,BottleNeck,Threads\Clock,Throughput,CF Radeon HD 5770,17,0,1.00,182070.00,1548.89,43,5,3,1548.89,1.19,Global Write,0.01,9 M Threads\Sec,37 ----------------------------------------------------- New Results Name,GPR,Scratch Reg,Min,Max,Avg,ALU,Fetch,Write,Est Cycles,ALU:Fetch,BottleNeck,Threads\Clock,Throughput,CF Radeon HD 5770,11,0,3.00,4.10,3.17,41,6,3,3.17,1.76,ALU Ops,5.04,4283 M Threads\Sec,22 ----------------------------------------------------- /* create matrix of possible placements and placement count for a single event */ kernel void computeUsablePlaces( __constant __read_only TimetableConfig* timetableConfig, // specifies number of timeslots __constant __read_only ProblemConfig* problemConfig, // specifies problem instance __constant __read_only float entropy, __global __read_only Event* eventList, // list of event structures __global __read_only Room* roomList, // list of room structures __global __read_only int* timetable, __constant __read_only int* attendanceMatrix, // matrix of event attendances __constant __read_only int* clashMatrix, // matrix of event clashes __global __write_only int* eventPlaceMatrix, __global __write_only float2* eventPlaceCount ) { int id = get_global_id(0); // event id int r = get_global_id(1); // room id int t = get_global_id(2); // timeslot id __global Event* pe = &eventList[id]; __global Room* pr = &roomList[r] ; // get local copy of config for faster access __private int extendedSlotCount = timetableConfig->ExtendedSlotCount; __private int placedEventId; __private ulong timeslotMask = 1 << t; // 64-bit mask used for timeslot testing eventPlaceCount[id].x = (float)id; eventPlaceCount[id].y = 0.0f; placedEventId = timetable[t + (r * extendedSlotCount)]; pr = &roomList[r]; // check if room is suitable // - room capacity is higher than event attendance // - room features meet event requirement bool roomSuitable = ((pr->Capacity >= pe->AttendanceCount) && (pr->AvailableFeatures & pe->RequiredFeatures == pe->RequiredFeatures)); // check if slot is usable // - there are no clashes with existing timetable events // - timeslot is valid for event bool timeSuitable = ((pe->ValidTimeslots & timeslotMask != 0) && (placedEventId != -1) && (clashMatrix[id + (eventCount * placedEventId)] != 0)); uint i = (roomSuitable && timeSuitable) ? 1 : 0; eventPlaceCount[id].y += i; eventPlaceMatrix[t + (r * extendedSlotCount)] = i; // add some randomness to place count therefore randomizing selection of minimum value // this is useful for situations where we have more than one minimum integer value eventPlaceCount[id].y += entropy[i]; // on completion of this kernel, for the processed event we have // * a matrix showing which places we can use // * a vector showing the number of places we can use (sum of matrix elements) }

            • Question about SKA result interpretation
              himanshu.gautam

              ATI 5770 can have a peak throughpout of 1.36TFLOPS. Throughput is measured as number of instruction per unit time so it can increase if you use the GPU ALUs more efficiently.

              high throughputs depend on whether all the stream processing units get enough vectorized work to do or not. And whether the ALU are stalled by memory and instruction level latencies.

              I suggest you to rely more on ATI Stream Profiler as it is a dynamic tool which takes into account kernel launch overhead and memory latencies which the SKA do not.