Help needed to parallelize this program (prime numbers finder) to benifit from GPU

Discussion created by Mk4ever on Apr 29, 2010
Latest reply on May 5, 2010 by ryta1203
I already wrote a scalar version, I need help with parallelization for my graduation project

<!-- /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-parent:""; margin:0cm; margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:12.0pt; font-family:"Times New Roman"; mso-fareast-font-family:"Times New Roman";} @page Section1 {size:612.0pt 792.0pt; margin:72.0pt 90.0pt 72.0pt 90.0pt; mso-header-margin:36.0pt; mso-footer-margin:36.0pt; mso-paper-source:0;} div.Section1 {page:Section1;} -->



Sorry in advance for the long post, but I am trying to be thorough to cover everything you might need to know


Note: I am attaching my code.


I am developing an OpenCL program that calculates prime numbers. The purpose is to illustrate the speed and benefits of utilizing GPU instead of CPU, for a research as a graduation project (I am an undergraduate ITC student).


I know nearly nothing about programming. So any help would be appreciated.


My environment (2 machines):

Laptop ( Mainly used,I use remote desktop often to compile and execute code on Desktop)

- AMD Turion Ultra X2 ZM-80 2.1GHz (K8 architecture, OpenCL support)

- AMD/ATI 3200 (IGP on 780G northbridge), no OpenCL support

- Windows 7 Pro x64

- Visual Studio 2008 Pro, with AMD SDK 2.01, Catalyst 10.2



- AMD Athlon 64 X2 4000+ 2.1GHz (K8 architecture, OpenCL support)

- ATI HD4670, beta OpenCL support (Made by Sapphire, VEN_1002&DEV_9490, sometimes causes problems as this specific number isn't included in ATI's driver's .inf when installing - might be relevant)

- Windows Server 2008 R2 x64

- Visual Studio 2008 Pro, with AMD SDK 2.01, Catalyst 10.3


** My Desktop has a weird problem – reported by other users as well. I am only mentioning it just to check if it’s relevant in some way:

- Installing Catalyst 10.2 or 10.3 completes successfully, and the driver works fine, but catalyst menus never show up, and whenever I reboot my machine, or access it remotely, Catalyst software doesn’t open, giving an error message saying that my card is not supported.

- OpenCL SDK installer has no issues with installing though, no errors about the graphics card not being supported, and OpenCL samples compile and work fine

- Running OpenCL code for GPU through RDP (Remote Desktop) never works, as if GPU support has gone



Since I am not a programmer (yet), nearly all my code is copy/paste of the Introductory Tutorial to OpenCL, on AMD’s website (I tried to remove error checking and reporting as much as I could, as it’s not my target at this point). Only the code that calculates the prime numbers is mine.


The problem is, I don’t know how to parallelize it, to take advantage of the GPU. I tried to study all examples available, and read relevant parts of the OpenCL specification. I am afraid I understood only a few stuff, but in general still lost and don’t know what to do. This stuff is meant for programmers, not newbies like me. By the way, I don’t care at this stage about all error checking and other stuff at this stage, as it is beyond my research scope, and might embarrass me during discussion of my project, as I know very little about them.


The only thing I could try (with my knowledge) is switch between CL_DEVICE_TYPE_CPU and CL_DEVICE_TYPE_GPU when running my code on Desktop (directly, no remote desktop), and even that resulted in funny results, whether specifying CPU or GPU always results in 50-70% CPU utilization and 20-30% GPU utilization (measured by task manager and GPU-Z, respectively),


All I need is a minimal OpenCL program that executes my code, and the kernel to be optimized for parallelization only on my GPU, HD4670.


A step-by-step guide on how to do this, or even hints, would be highly appreciated. I try to depend on myself as much as possible, but I guess current turorials and documentation are targeting a higher level of experience and knowledge than mine. I still couldn’t locate any easy reference to OpenCL, for newbies like me.


Thank you very much in advance.


My code is attached.





#include <cstdio> #include <cstdlib> #include <fstream> #include <iostream> #include <string> #include <iterator> #include <time.h> #include <math.h> #include <utility> #define __NO_STD_VECTOR // Use cl::vector and cl::string and #define __NO_STD_STRING // not STL versions, more on this later #include <cl.h> #include <cl.hpp> #include <cl_platform.h> const std::string hw("Hello World\n"); inline void checkErr(cl_int err, const char * name) { if (err != CL_SUCCESS) { std::cerr << "ERROR: " << name << " (" << err << ")" << std::endl; exit(EXIT_FAILURE); } } int main() { cl_int err; cl::vector< cl::Platform > platformList; cl::Platform::get(&platformList); checkErr(platformList.size()!=0 ? CL_SUCCESS : -1, "cl::Platform::get"); std::cerr << "Platform number is: " << platformList.size() << std::endl; cl::STRING_CLASS platformVendor; platformList[0].getInfo(CL_PLATFORM_VENDOR, &platformVendor); //std::cerr << "Platform is by: " << &platformVendor << "\n"; cl_context_properties cprops[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platformList[0])(), 0}; cl::Context context( CL_DEVICE_TYPE_CPU, cprops, NULL, NULL, &err); checkErr(err, "Conext::Context()"); cl::vector<cl::Device> devices; devices = context.getInfo<CL_CONTEXT_DEVICES>(); cl::Program::Sources source; cl::Program program(context, source); err =,""); cl::Kernel kernel(program, "hello", &err); cl::CommandQueue queue(context, devices[0], 0, &err); cl::Event event; err = queue.enqueueNDRangeKernel( kernel, cl::NullRange, cl::NDRange(20), cl::NDRange(1, 1), NULL, &event); // The real code starts here, it is used to test numbers to determine and display prime numbers (2, 3, 5, 7, 11, 13, 17, etc) cl_int Limit, R, X, Y, Ar, Ar2; cl_bool Check = true; cl_int Array[4000]; Array[0] = 2; Ar = 1; Limit = 27222; R = Limit; /* Limit is an option, where I can modify the code later to make the user input the limit manually, R is the actual limit that the function will use X holds the number that will be tested Y is the values we test X against, using the result of X % Y Check is a boolean that is set to flase if the number tested is not a prime. If Check remains true for all Y values, we record the number in Array Array is an array that will hold the X values that are checked to be prime numbers, I still set its value manually as I still don't know a way to computationally predict/estimate it Ar is the Array counter Ar2 is also an Array counter, used only for printing Array values */ clock_t start, finish; start = clock(); // implemented a timer to check how much time the actual function takes for (X = 3; X <= R; X += 2) // testing all numbers, incrementing by 2 to test only odd numbers { Check = true; for (Y = 2; Y <= (sqrt(double(X))); Y++) //testing X against Y (from 2 until the square root of X). If intersted, read { //prime numbers in wikipedia to see how primes can be calculated effeceintly if (X % Y == 0) Check = false; //if not a prime then flase } if (Check == true) //only when the number is a prime, do the following { Array[Ar] = X; //store the prime in an empty place of the array Ar++; } } finish = clock(); for (Ar2 = 0; Ar2 <Ar; Ar2++) { std::cout << Array[Ar2] << "\t"; } std::cout << "\nTime needed for completion in milliseconds is:\t" << (double(finish - start)); // The code ends here cl_int xoxo; std::cin>>xoxo; return 0; }