cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

dravisher
Journeyman III

The RANLUX pseudorandom number generator implemented in OpenCL (updated)

Easy to use (copy-paste) code for anyone interested

Edit march 2011: I've updated the code quite a bit. The generator now works correctly with SDK 2.3 and SDK 2.4 rc1, and it should be much simpler to use. Newest version can be found on bitbucket, just grab the newest zip-file. As before there's also a program that can verify that the generator is producing correct sequences, measures performance and shows how to use the generator.

 

Hello. This is a piece of code I've developed for my own use, because I couldn't really find any good pseudorandom number generators for OpenCL that were simple to implement and understand. I'm sharing it so that others who just want a good PRNG can use it, without having to spend time porting a generator themselves. The generator is meant to be used with one copy per work-item. State data is kept in 7 float4 vectors.

The algorithm was developed by Martin Lüscher, while the Fortran 77 implementation this code is based on was developed by Fred James. The two relevant papers are:

 Martin Lüscher, A portable high-quality random number generator for lattice field theory simulations, Computer Physics Communications 79 (1994) 100-110


F. James, RANLUX: A Fortran implementation of the high-quality pseudorandom number generator of Lüscher, Computer Physics Communications 79 (1994) 111-114

The code is licensed with an MIT license. If you really do find it usefull or interesting I'd love to hear about it :-). Also, if anyone thinks there is a problem with it please let me know, as I'm planning to make use of it myself for Monte Carlo simulations.

Usage of the generator is described as comments in the code, and I think the example code should show pretty clearly how to use the generator. The example code can also check the correctness of the implementation agains values generated by the original Fortran 77 code.

To see a list of the arguments just run prngtest.exe. To check performance and correctness of a GPU at luxury level 4 you'd run "prngtest.exe 4 1".

About RANLUX
RANLUX is an "old" (proposed in 1994) PRNG. It has been used extensively, and to the best of my knowledge no defects have been found at luxury setting 2 and above (which perfectly lines up with the inventors predictions).

 The generator has a few notable features:


All calculations are in 32-bit floating point. The algorithm generates a number between 0 and 1 directly.

It is possible to select a "luxury level", where higher luxuries mean better but slower numbers. Luxury levels 0-4 are presented here, 4 being best.

RANLUX is also one of very few (the only one I know of actually) PRNGs that has a underlying theory explaining why it generates "random" numbers, and why they are good. Indeed, if the theory is correct (and I don't know of anyone who has disputed it), RANLUX at the highest luxury level produces completely decorrelated numbers down to the last bit, with no long-range correlations as long as we stay well below the period (10^171). Most other generators can say very little about their quality (like Mersenne Twister, KISS etc.) They must rely on passing statistical tests.

Performance
At luxury level 0, more than 20*10^9 numbers per second on a Cypress. At luxury level 4, more than 4*10^9 numbers per second.

 

0 Likes
34 Replies
drstrip
Journeyman III

The RANLUX pseudorandom number generator implemented in OpenCL

While this is a useful and generous contribution, a word of caution: Using ostensibly independent RNGs for each element (work item) in a massively parallel computation is frowned on in the MP community, as it creates the opportunity for very hard to detect correlations between streams of RNGs. Using fewer generators is much preferred.

 

If one makes some assumptions about the architecture of the target GPU and how threads are assigned and processed, one can reduce the number of RNGs. (Ideally, we should have an OpenCL call which tells us which SIMD we're on.) If the code surrounding utilizing the RNG has constant execution time across all threads and you assume round-robin scheduling (which I've been told is currently true for AMD OpenCL), then you can determine which SIMD you're on. In this case you can reduce the number of RNGs to wavefront size * number of SIMDs and be assured of no read/write inconsistencies. For problems with very large numbers of work items (my case was millions), this is a huge reduction in the size of the state space, as well as better practice in terms of using shared RNGs.

0 Likes
dravisher
Journeyman III

The RANLUX pseudorandom number generator implemented in OpenCL

Hello and thanks for the comment drstrip 🙂

I do know that PRNGs in parallel still remains a bit of a problem even today. Indeed as far as I know there are no PRNGs that can actually guarantee independent sequences in parallel. For instance the usual method for Mersenne Twister, called Dynamic Creator for Mersenne Twister (DCMT) can not be proven to result in uncorrelated parallel sequences, it is only considered likely that it does.

I haven't really been able to find any solid documents expaining why RANLUX is safe in parallel, however I believe it stems from the underlying theory (that the generator is connected to a classical chaotic system). As such RANLUX is probably the generator which has the stronges evidence for being safe in parallel applications. If I understand it correctly, there should be no correlations unless we have overlapping sequences (which is a practical impossibility considering the period of 10^171). The other usual suspects (like Mersenne Twister) has no similar theory, so it is much more of a black art in those cases.

While the approach you mention would certainly be more elegant if performance is maintaned, it really is too bad that it can't be done in a portable manner.

0 Likes
drstrip
Journeyman III

The RANLUX pseudorandom number generator implemented in OpenCL

While OpenCL is in principle portable, if you're writing code for something that really matters (and presumably if you're going to the trouble of porting to a GPU it does matter), then the target architecture makes a big difference. Older AMD/ATI chips don't have fast shared memory and if I understand correctly, NVidia chips don't have fast vector processing like the AMD chips do. Some  code I've written recently doesn't have a natural vector nature, but by using int4 vectors at the expense of readability, I can get much better throughput. There would be no reason to do this on chipsets that don't have fast vector processors. So, portability is largely an illusion when speed matters. At least that's how it appears to me.

0 Likes
spectral
Adept II

The RANLUX pseudorandom number generator implemented in OpenCL

I have see this too : http://inst.cs.berkeley.edu/~ianh/proj1.html

MonteCarloCURAND:

and in the following page it sounds that the new CUDA has a "MonteCarloCURAND" generator example ! Maybe it will be interesting !

Mersenne twister :

http://git.jcornwall.me.uk/MersenneTwisterOCL/tree/MersenneTwisterOCL.py?id=00ea4652c0cf0d2ebcf91540...

http://www.pgroup.com/lit/articles/insider/v2n2a4.htm

0 Likes
dravisher
Journeyman III

The RANLUX pseudorandom number generator implemented in OpenCL

The CURAND library seems very cool, just wish it was for OpenCL instead. But thanks for pointing it out, it's interesting nonetheless!

Mersenne Twister has for some reason been very popular on the GPU, but I don't really understand why. It uses a huge state array, so unless several work-items share one array (which necessitates frequent synchronization) it's not really very fast. I think sometimes they reduce the size of the state array, but then we're suddenly dealing with a generator of unknown quality. Statistical tests only tell so much, it's also important that a generator has survived use in a wide range of actual applications (at least it is for me).

 That's why I like RANLUX. It's fast enough (at least for my uses), has a small state array (at least compared with Mersenne Twister), it's old and well tested, and unlike most other generators it has an underlying theory for why it's a good generator.

0 Likes
spectral
Adept II

The RANLUX pseudorandom number generator implemented in OpenCL

There is also this one :

http://src.luxrender.net/luxrays/file/c79fc1c82c15/samples/smallluxgpu/pathgpu/kernels/pathgpu_kerne...

I don't know how it works and if it is of good quality, but it tell this " maximally equidistributed combined Tausworthe generator".

First, it generate a set of 'random values', pass the array to the kernel, then generate numbers and update the 'array' with the new values.

0 Likes
joohongyee
Journeyman III

The RANLUX pseudorandom number generator implemented in OpenCL

I have tried to compile this file using below command

g++ -framework opencl -o prog.exe PRNGTest_1.9.cpp 

However, prog.exe 1 1 1 1 1 showed me an error message as below.

Can you tell me how can i compile this code without error?

=============================================

 

s1f1:test 1294566$ ./prog.exe 0 1 1 1 1 

PRNGTest 1.9

Using GPU

RANLUX illegal initialization seed: 0. RANLUX initialized using default seed instead: 314159265

Num platforms:       1

Platform Vendor:     Apple

Platform profile:    FULL_PROFILE

Platform version:    OpenCL 1.0 (Apr  7 2010 19:04:28)

Platform name:       Apple

Platform extensions: 

Device name:         GeForce 9400

Device type:         GPU 

Device vendor ID:    16918016

Max compute units:   2

Max work item dim:   3

Max work group size: 512

Max mem alloc size:  128 MiB

Global memory size:  256 MiB

Global memory cache: None

Local memory type:   Dedicated

Prof Timer Res:      1000 ns

OpenCL Driver ver:   CLH 1.0

Building OpenCL program: Done

 

Build log: 

:1:10: fatal error: 'ranlux_unified_1.0.0.cl' file not found

 #include "ranlux_unified_1.0.0.cl"

          ^

ERROR (-45): Kernel::Kernel(): Invalid program executable

==============================================

 

0 Likes
dravisher
Journeyman III

The RANLUX pseudorandom number generator implemented in OpenCL

Seems it's not finding ranlux_unified_1.0.0.cl. The line:

BuildOptions += "-I . "; //Search for include files in current directory

In the .cpp file should tell the OpenCL implementation to look for it in the working directory, are you sure it's there?

You could also just paste the entire contents of the ranlux_unified_1.0.0.cl file into the PRNGTest_kernels_1.9.cl file, since that's all the include statement is supposed to be doing anyway.

0 Likes
joohongyee
Journeyman III

The RANLUX pseudorandom number generator implemented in OpenCL

I have downloaded cl.hpp file from "http://www.khronos.org/registry/cl/api/1.1/cl.hpp"

then stored in the same directory with 3 downloaded files.

 

First I have compile with -I. option like below.

 g++ -framework opencl -I. -o test PRNGTest_1.9.cpp

However, it showed the same error as above. 

 

Second, as you have indicated I have included head file in PRNGTest_1.9.cpp.

#include "cl.hpp"

Actually, it was already included before getting above error sign.

 

Third, I didn't include "#include "ranlux_unified_1.0.0.cl"" then copied in PRNGTest_kernels._1.9.cl then compiled like above " g++ -framework opencl -I. -o test PRNGTest_1.9.cpp"

Then i got an error message as below.

============================================

 

ipswitch:test innovationspace$ g++ -framework opencl -I. -o test PRNGTest_1.9.cpp

ipswitch:test innovationspace$ ./test 0 1 1 1 1 

PRNGTest 1.9

Using GPU

RANLUX illegal initialization seed: 0. RANLUX initialized using default seed instead: 314159265

Num platforms:       1

Platform Vendor:     Apple

Platform profile:    FULL_PROFILE

Platform version:    OpenCL 1.0 (Apr  7 2010 19:04:28)

Platform name:       Apple

Platform extensions: 

Device name:         GeForce 9400

Device type:         GPU 

Device vendor ID:    16918016

Max compute units:   2

Max work item dim:   3

Max work group size: 512

Max mem alloc size:  128 MiB

Global memory size:  256 MiB

Global memory cache: None

Local memory type:   Dedicated

Prof Timer Res:      1000 ns

OpenCL Driver ver:   CLH 1.0

Building OpenCL program: Done

 

Build log: 

:124:6: error: conflicting types for 'ranlux_download_seed'

 void ranlux_download_seed(      float4 *s01to04, 

      ^

:14:2: note: previous implicit declaration is here

         ranlux_download_seed(&s01to04, &s05to08, &s09to12, &s13to16, &s17to20, &s21to24, &in24, &stepnr, &carry, PRNGTab, gid, global_size);

         ^

:153:6: error: conflicting types for 'ranlux_upload_seed'

 void ranlux_upload_seed(float4 *s01to04, 

      ^

:44:2: note: previous implicit declaration is here

         ranlux_upload_seed(&s01to04, &s05to08, &s09to12, &s13to16, &s17to20, &s21to24, &in24, &stepnr, &carry, PRNGTab, gid, global_size);

         ^

:239:8: error: conflicting types for 'ranlux'

 float4 ranlux(float4 *s01to04, 

        ^

:22:14: note: previous implicit declaration is here

                 randomnr = ranlux(&s01to04, &s05to08, &s09to12, &s13to16, &s17to20, &s21to24, &in24, &stepnr, &carry);

                            ^

ERROR (-45): Kernel::Kernel(): Invalid program executable

ipswitch:test innovationspace$ 

==============================================
I am running this program on Mac computer. That's is why i am using command like compiling method. Thanks for your quick reply for the posting. 
0 Likes