7 Replies Latest reply on May 5, 2010 2:27 PM by ryta1203

    Help needed to parallelize this program (prime numbers finder) to benifit from GPU

    Mk4ever
      I already wrote a scalar version, I need help with parallelization for my graduation project

      <!-- /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-parent:""; margin:0cm; margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:12.0pt; font-family:"Times New Roman"; mso-fareast-font-family:"Times New Roman";} @page Section1 {size:612.0pt 792.0pt; margin:72.0pt 90.0pt 72.0pt 90.0pt; mso-header-margin:36.0pt; mso-footer-margin:36.0pt; mso-paper-source:0;} div.Section1 {page:Section1;} -->

      Hello,

       

      Sorry in advance for the long post, but I am trying to be thorough to cover everything you might need to know

       

      Note: I am attaching my code.

       

      I am developing an OpenCL program that calculates prime numbers. The purpose is to illustrate the speed and benefits of utilizing GPU instead of CPU, for a research as a graduation project (I am an undergraduate ITC student).

       

      I know nearly nothing about programming. So any help would be appreciated.

       

      My environment (2 machines):

      Laptop ( Mainly used,I use remote desktop often to compile and execute code on Desktop)

      - AMD Turion Ultra X2 ZM-80 2.1GHz (K8 architecture, OpenCL support)

      - AMD/ATI 3200 (IGP on 780G northbridge), no OpenCL support

      - Windows 7 Pro x64

      - Visual Studio 2008 Pro, with AMD SDK 2.01, Catalyst 10.2

       

      Desktop

      - AMD Athlon 64 X2 4000+ 2.1GHz (K8 architecture, OpenCL support)

      - ATI HD4670, beta OpenCL support (Made by Sapphire, VEN_1002&DEV_9490, sometimes causes problems as this specific number isn't included in ATI's driver's .inf when installing - might be relevant)

      - Windows Server 2008 R2 x64

      - Visual Studio 2008 Pro, with AMD SDK 2.01, Catalyst 10.3

      ------------

      ** My Desktop has a weird problem – reported by other users as well. I am only mentioning it just to check if it’s relevant in some way:

      - Installing Catalyst 10.2 or 10.3 completes successfully, and the driver works fine, but catalyst menus never show up, and whenever I reboot my machine, or access it remotely, Catalyst software doesn’t open, giving an error message saying that my card is not supported.

      - OpenCL SDK installer has no issues with installing though, no errors about the graphics card not being supported, and OpenCL samples compile and work fine

      - Running OpenCL code for GPU through RDP (Remote Desktop) never works, as if GPU support has gone

      ------------

       

      Since I am not a programmer (yet), nearly all my code is copy/paste of the Introductory Tutorial to OpenCL, on AMD’s website (I tried to remove error checking and reporting as much as I could, as it’s not my target at this point). Only the code that calculates the prime numbers is mine.

       

      The problem is, I don’t know how to parallelize it, to take advantage of the GPU. I tried to study all examples available, and read relevant parts of the OpenCL specification. I am afraid I understood only a few stuff, but in general still lost and don’t know what to do. This stuff is meant for programmers, not newbies like me. By the way, I don’t care at this stage about all error checking and other stuff at this stage, as it is beyond my research scope, and might embarrass me during discussion of my project, as I know very little about them.

       

      The only thing I could try (with my knowledge) is switch between CL_DEVICE_TYPE_CPU and CL_DEVICE_TYPE_GPU when running my code on Desktop (directly, no remote desktop), and even that resulted in funny results, whether specifying CPU or GPU always results in 50-70% CPU utilization and 20-30% GPU utilization (measured by task manager and GPU-Z, respectively),

       

      All I need is a minimal OpenCL program that executes my code, and the kernel to be optimized for parallelization only on my GPU, HD4670.

       

      A step-by-step guide on how to do this, or even hints, would be highly appreciated. I try to depend on myself as much as possible, but I guess current turorials and documentation are targeting a higher level of experience and knowledge than mine. I still couldn’t locate any easy reference to OpenCL, for newbies like me.

       

      Thank you very much in advance.

       

      My code is attached.

       

       

       

       



      #include <cstdio> #include <cstdlib> #include <fstream> #include <iostream> #include <string> #include <iterator> #include <time.h> #include <math.h> #include <utility> #define __NO_STD_VECTOR // Use cl::vector and cl::string and #define __NO_STD_STRING // not STL versions, more on this later #include <cl.h> #include <cl.hpp> #include <cl_platform.h> const std::string hw("Hello World\n"); inline void checkErr(cl_int err, const char * name) { if (err != CL_SUCCESS) { std::cerr << "ERROR: " << name << " (" << err << ")" << std::endl; exit(EXIT_FAILURE); } } int main() { cl_int err; cl::vector< cl::Platform > platformList; cl::Platform::get(&platformList); checkErr(platformList.size()!=0 ? CL_SUCCESS : -1, "cl::Platform::get"); std::cerr << "Platform number is: " << platformList.size() << std::endl; cl::STRING_CLASS platformVendor; platformList[0].getInfo(CL_PLATFORM_VENDOR, &platformVendor); //std::cerr << "Platform is by: " << &platformVendor << "\n"; cl_context_properties cprops[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platformList[0])(), 0}; cl::Context context( CL_DEVICE_TYPE_CPU, cprops, NULL, NULL, &err); checkErr(err, "Conext::Context()"); cl::vector<cl::Device> devices; devices = context.getInfo<CL_CONTEXT_DEVICES>(); cl::Program::Sources source; cl::Program program(context, source); err = program.build(devices,""); cl::Kernel kernel(program, "hello", &err); cl::CommandQueue queue(context, devices[0], 0, &err); cl::Event event; err = queue.enqueueNDRangeKernel( kernel, cl::NullRange, cl::NDRange(20), cl::NDRange(1, 1), NULL, &event); // The real code starts here, it is used to test numbers to determine and display prime numbers (2, 3, 5, 7, 11, 13, 17, etc) cl_int Limit, R, X, Y, Ar, Ar2; cl_bool Check = true; cl_int Array[4000]; Array[0] = 2; Ar = 1; Limit = 27222; R = Limit; /* Limit is an option, where I can modify the code later to make the user input the limit manually, R is the actual limit that the function will use X holds the number that will be tested Y is the values we test X against, using the result of X % Y Check is a boolean that is set to flase if the number tested is not a prime. If Check remains true for all Y values, we record the number in Array Array is an array that will hold the X values that are checked to be prime numbers, I still set its value manually as I still don't know a way to computationally predict/estimate it Ar is the Array counter Ar2 is also an Array counter, used only for printing Array values */ clock_t start, finish; start = clock(); // implemented a timer to check how much time the actual function takes for (X = 3; X <= R; X += 2) // testing all numbers, incrementing by 2 to test only odd numbers { Check = true; for (Y = 2; Y <= (sqrt(double(X))); Y++) //testing X against Y (from 2 until the square root of X). If intersted, read { //prime numbers in wikipedia to see how primes can be calculated effeceintly if (X % Y == 0) Check = false; //if not a prime then flase } if (Check == true) //only when the number is a prime, do the following { Array[Ar] = X; //store the prime in an empty place of the array Ar++; } } finish = clock(); for (Ar2 = 0; Ar2 <Ar; Ar2++) { std::cout << Array[Ar2] << "\t"; } std::cout << "\nTime needed for completion in milliseconds is:\t" << (double(finish - start)); // The code ends here cl_int xoxo; std::cin>>xoxo; return 0; }

        • Help needed to parallelize this program (prime numbers finder) to benifit from GPU
          Mk4ever

          I'm really desperate for help. Any help or even hints/recommended easy tutorials would help me a lot.

          To make things simpler, this is the exact code I need to parallelize on my HD 4670. I need this for my graduation project.

          Thank you in advance.

          int Limit, X, Y, Ar, Ar2; bool Check = true; int Prime_List[3250]; Prime_List[0] = 2; Ar = 1; Limit = 27222; for (X = 3; X <= Limit; X++) { Check = true; for (Y = 2; Y <= (sqrt(double(X))); Y++) { if (X % Y == 0) { Check = false; break; } } if (Check == true) { Prime_List[Ar] = X; Ar++; } }

            • Help needed to parallelize this program (prime numbers finder) to benifit from GPU
              hazeman

              Sorry to say but there is nothing to optimize - probably that's why no one bothered to answer.

              The program you published is naive implementation of prime number searching. Sieve of Erathosthenes (http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes) is much more efficient on CPU ( it should be orders of magnitude faster than your code ).

              But for gpu sieve of Erathosthenes isn't good choice. GPU's can do a lot of computations but don't like too much memory transfer and they really don't like thread synchronization.

              So first step in solving your problem on gpu is finding proper algorithm - and this is vast and not easy subject. Maybe you could try to modify some parallel ( on cpu ) algorithm for prime searching.

               

                • Help needed to parallelize this program (prime numbers finder) to benifit from GPU
                  Mk4ever

                  Thank you very much for your reply.

                  I know my code isn't the fastest, or even the most optimized. In fact, I already developed another version of this code (significantly faster on cpu, but not parallelizable).

                  The aim of my graduation project is to demonstrate that even less effeceint code can be faster through parallelization, or at least test that hypothesis  and reach conclusions. My intention is to provide 3 programs:

                  • Program on CPU, non-effeceint (the code I posted)
                  • Program on CPU, more effeceint and faster, but not parallelizable
                  • Program on GPU with OpenCL, which I need help with (based on the code I posted, but with parallelization, theoritically should be the fastest of the 3)

                  As for Sieve of Erathosthenes's code, I dont' need to use that, as it is just a graduation project. I don't have to use the most effecient code. I am only aiming to test the effect of parallelization on some code and reach conclusions, no matter how the code is retarded in the first place.

                  Also, I designed my program so that each iteration for X does not depend on the result of the iterations before (which is the non-effecient part), so there's is really no need for synchronization at all.

                  Now that I explained my reasoning, I am still hoping someone will help me with making this code OpenCL compliant, targeting my HD 4670, or give me any useful hints.

                  I read all examples from AMD, as well as Apple's "Hello world". I still don't know how to achieve my program yet though (I spent a lot of time modifying kernels of these samples/tutorials to suit my project, but nothing worked so far, as I'm still a newbie).

                  Any help would be highly appreciated.

                    • Help needed to parallelize this program (prime numbers finder) to benifit from GPU
                      galmok

                      How about letting your outer for-loop:

                      for (X = 3; X <= Limit; X++)

                      Be the loop that OpenCL manages? Store the result for each number tested in its own place in an array that you copy back to main memory when it is full. Iterate the kernel call every time the array is full until the limit is reached. This way each kernel tests 1 number to see if it is a prime number.

                        • Help needed to parallelize this program (prime numbers finder) to benifit from GPU
                          ryta1203

                           

                          Originally posted by: galmok How about letting your outer for-loop:

                          for (X = 3; X <= Limit; X++)

                          Be the loop that OpenCL manages? Store the result for each number tested in its own place in an array that you copy back to main memory when it is full. Iterate the kernel call every time the array is full until the limit is reached. This way each kernel tests 1 number to see if it is a prime number.

                          Yes, this is what I would go with too, unless you can find another algorithm like hazeman suggests, then I would go with that.

                          You are not going to be able to easily port the code you have now and expect good performance, you will need to modify it and use some "tricks".

                           

                        • Help needed to parallelize this program (prime numbers finder) to benifit from GPU
                          jcpalmer

                           

                          Originally posted by: Mk4ever

                           

                          The aim of my graduation project is to demonstrate that even less effeceint code can be faster through parallelization, or at least test that hypothesis  and reach conclusions. My intention is to provide 3 programs:

                           

                           

                          • Program on CPU, non-effeceint (the code I posted)
                          • Program on CPU, more effeceint and faster, but not parallelizable
                          • Program on GPU with OpenCL, which I need help with (based on the code I posted, but with parallelization, theoritically should be the fastest of the 3)

                          As for Sieve of Erathosthenes's code, I dont' need to use that, as it is just a graduation project. I don't have to use the most effecient code. I am only aiming to test the effect of parallelization on some code and reach conclusions, no matter how the code is retarded in the first place.



                          Sounds a little like test/application of amdahl's law.  Maybe it would be good bring that up in your final "paper", or whatever is it called.

                            • Help needed to parallelize this program (prime numbers finder) to benifit from GPU
                              Mk4ever

                              @ galmok

                              Thank you very much for your reply. I had a guess that this is how I should do it, but I wasn't so sure. Thank you for confirming this.

                              Still, I am still struggling with how to implement that in code. So far nothing worked on my side. Some help here would be truly appreciated.

                              @ jcpalmer

                              Thank you for your reply. I didn't know someone else came up with that! So unfair that if someone beats you to the idea you get no credit and they get everything. But since AMD is involved somehow (AMDahl's law), I guess I have to mention it!

                              Just kidding. A very useful idea that I'll include in my final report.

                              ___

                              Finally, I am currently working on an implementation like what galmok suggested, but I am still lost as to how to implement the kernel, how to specify memory management (At this point I think I don't need that, not to mention that I don't know how to do it anyways), and how to separate the kernel in a separate *.cl file (like every tutorial I read suggested) and how to call it form the main c/cpp file.

                              While I will keep trying, any help from you guys will be truly appreciated.