Edit march 2011: I've updated the code quite a bit. The generator now works correctly with SDK 2.3 and SDK 2.4 rc1, and it should be much simpler to use. Newest version can be found on bitbucket, just grab the newest zip-file. As before there's also a program that can verify that the generator is producing correct sequences, measures performance and shows how to use the generator.
Hello. This is a piece of code I've developed for my own use, because I couldn't really find any good pseudorandom number generators for OpenCL that were simple to implement and understand. I'm sharing it so that others who just want a good PRNG can use it, without having to spend time porting a generator themselves. The generator is meant to be used with one copy per work-item. State data is kept in 7 float4 vectors.
The algorithm was developed by Martin Lüscher, while the Fortran 77 implementation this code is based on was developed by Fred James. The two relevant papers are:
Martin Lüscher, A portable high-quality random number generator for lattice field theory simulations, Computer Physics Communications 79 (1994) 100-110
F. James, RANLUX: A Fortran implementation of the high-quality pseudorandom number generator of Lüscher, Computer Physics Communications 79 (1994) 111-114
The code is licensed with an MIT license. If you really do find it usefull or interesting I'd love to hear about it :-). Also, if anyone thinks there is a problem with it please let me know, as I'm planning to make use of it myself for Monte Carlo simulations.
Usage of the generator is described as comments in the code, and I think the example code should show pretty clearly how to use the generator. The example code can also check the correctness of the implementation agains values generated by the original Fortran 77 code.
To see a list of the arguments just run prngtest.exe. To check performance and correctness of a GPU at luxury level 4 you'd run "prngtest.exe 4 1".
About RANLUX
RANLUX is an "old" (proposed in 1994) PRNG. It has been used extensively, and to the best of my knowledge no defects have been found at luxury setting 2 and above (which perfectly lines up with the inventors predictions).
The generator has a few notable features:
All calculations are in 32-bit floating point. The algorithm generates a number between 0 and 1 directly.
It is possible to select a "luxury level", where higher luxuries mean better but slower numbers. Luxury levels 0-4 are presented here, 4 being best.
RANLUX is also one of very few (the only one I know of actually) PRNGs that has a underlying theory explaining why it generates "random" numbers, and why they are good. Indeed, if the theory is correct (and I don't know of anyone who has disputed it), RANLUX at the highest luxury level produces completely decorrelated numbers down to the last bit, with no long-range correlations as long as we stay well below the period (10^171). Most other generators can say very little about their quality (like Mersenne Twister, KISS etc.) They must rely on passing statistical tests.
Performance
At luxury level 0, more than 20*10^9 numbers per second on a Cypress. At luxury level 4, more than 4*10^9 numbers per second.
While this is a useful and generous contribution, a word of caution: Using ostensibly independent RNGs for each element (work item) in a massively parallel computation is frowned on in the MP community, as it creates the opportunity for very hard to detect correlations between streams of RNGs. Using fewer generators is much preferred.
If one makes some assumptions about the architecture of the target GPU and how threads are assigned and processed, one can reduce the number of RNGs. (Ideally, we should have an OpenCL call which tells us which SIMD we're on.) If the code surrounding utilizing the RNG has constant execution time across all threads and you assume round-robin scheduling (which I've been told is currently true for AMD OpenCL), then you can determine which SIMD you're on. In this case you can reduce the number of RNGs to wavefront size * number of SIMDs and be assured of no read/write inconsistencies. For problems with very large numbers of work items (my case was millions), this is a huge reduction in the size of the state space, as well as better practice in terms of using shared RNGs.
Hello and thanks for the comment drstrip 🙂
I do know that PRNGs in parallel still remains a bit of a problem even today. Indeed as far as I know there are no PRNGs that can actually guarantee independent sequences in parallel. For instance the usual method for Mersenne Twister, called Dynamic Creator for Mersenne Twister (DCMT) can not be proven to result in uncorrelated parallel sequences, it is only considered likely that it does.
I haven't really been able to find any solid documents expaining why RANLUX is safe in parallel, however I believe it stems from the underlying theory (that the generator is connected to a classical chaotic system). As such RANLUX is probably the generator which has the stronges evidence for being safe in parallel applications. If I understand it correctly, there should be no correlations unless we have overlapping sequences (which is a practical impossibility considering the period of 10^171). The other usual suspects (like Mersenne Twister) has no similar theory, so it is much more of a black art in those cases.
While the approach you mention would certainly be more elegant if performance is maintaned, it really is too bad that it can't be done in a portable manner.
While OpenCL is in principle portable, if you're writing code for something that really matters (and presumably if you're going to the trouble of porting to a GPU it does matter), then the target architecture makes a big difference. Older AMD/ATI chips don't have fast shared memory and if I understand correctly, NVidia chips don't have fast vector processing like the AMD chips do. Some code I've written recently doesn't have a natural vector nature, but by using int4 vectors at the expense of readability, I can get much better throughput. There would be no reason to do this on chipsets that don't have fast vector processors. So, portability is largely an illusion when speed matters. At least that's how it appears to me.
I have see this too : http://inst.cs.berkeley.edu/~ianh/proj1.html
MonteCarloCURAND:
and in the following page it sounds that the new CUDA has a "MonteCarloCURAND" generator example ! Maybe it will be interesting !
Mersenne twister :
The CURAND library seems very cool, just wish it was for OpenCL instead. But thanks for pointing it out, it's interesting nonetheless!
Mersenne Twister has for some reason been very popular on the GPU, but I don't really understand why. It uses a huge state array, so unless several work-items share one array (which necessitates frequent synchronization) it's not really very fast. I think sometimes they reduce the size of the state array, but then we're suddenly dealing with a generator of unknown quality. Statistical tests only tell so much, it's also important that a generator has survived use in a wide range of actual applications (at least it is for me).
That's why I like RANLUX. It's fast enough (at least for my uses), has a small state array (at least compared with Mersenne Twister), it's old and well tested, and unlike most other generators it has an underlying theory for why it's a good generator.
There is also this one :
I don't know how it works and if it is of good quality, but it tell this " maximally equidistributed combined Tausworthe generator".
First, it generate a set of 'random values', pass the array to the kernel, then generate numbers and update the 'array' with the new values.
I have tried to compile this file using below command
g++ -framework opencl -o prog.exe PRNGTest_1.9.cpp
However, prog.exe 1 1 1 1 1 showed me an error message as below.
Can you tell me how can i compile this code without error?
=============================================
s1f1:test 1294566$ ./prog.exe 0 1 1 1 1
PRNGTest 1.9
Using GPU
RANLUX illegal initialization seed: 0. RANLUX initialized using default seed instead: 314159265
Num platforms: 1
Platform Vendor: Apple
Platform profile: FULL_PROFILE
Platform version: OpenCL 1.0 (Apr 7 2010 19:04:28)
Platform name: Apple
Platform extensions:
Device name: GeForce 9400
Device type: GPU
Device vendor ID: 16918016
Max compute units: 2
Max work item dim: 3
Max work group size: 512
Max mem alloc size: 128 MiB
Global memory size: 256 MiB
Global memory cache: None
Local memory type: Dedicated
Prof Timer Res: 1000 ns
OpenCL Driver ver: CLH 1.0
Building OpenCL program: Done
Build log:
:1:10: fatal error: 'ranlux_unified_1.0.0.cl' file not found
#include "ranlux_unified_1.0.0.cl"
^
ERROR (-45): Kernel::Kernel(): Invalid program executable
Seems it's not finding ranlux_unified_1.0.0.cl. The line:
BuildOptions += "-I . "; //Search for include files in current directory
In the .cpp file should tell the OpenCL implementation to look for it in the working directory, are you sure it's there?
You could also just paste the entire contents of the ranlux_unified_1.0.0.cl file into the PRNGTest_kernels_1.9.cl file, since that's all the include statement is supposed to be doing anyway.
I have downloaded cl.hpp file from "http://www.khronos.org/registry/cl/api/1.1/cl.hpp"
then stored in the same directory with 3 downloaded files.
First I have compile with -I. option like below.
g++ -framework opencl -I. -o test PRNGTest_1.9.cpp
However, it showed the same error as above.
Second, as you have indicated I have included head file in PRNGTest_1.9.cpp.
#include "cl.hpp"
Actually, it was already included before getting above error sign.
Third, I didn't include "#include "ranlux_unified_1.0.0.cl"" then copied in PRNGTest_kernels._1.9.cl then compiled like above " g++ -framework opencl -I. -o test PRNGTest_1.9.cpp"
Then i got an error message as below.
============================================
ipswitch:test innovationspace$ g++ -framework opencl -I. -o test PRNGTest_1.9.cpp
ipswitch:test innovationspace$ ./test 0 1 1 1 1
PRNGTest 1.9
Using GPU
RANLUX illegal initialization seed: 0. RANLUX initialized using default seed instead: 314159265
Num platforms: 1
Platform Vendor: Apple
Platform profile: FULL_PROFILE
Platform version: OpenCL 1.0 (Apr 7 2010 19:04:28)
Platform name: Apple
Platform extensions:
Device name: GeForce 9400
Device type: GPU
Device vendor ID: 16918016
Max compute units: 2
Max work item dim: 3
Max work group size: 512
Max mem alloc size: 128 MiB
Global memory size: 256 MiB
Global memory cache: None
Local memory type: Dedicated
Prof Timer Res: 1000 ns
OpenCL Driver ver: CLH 1.0
Building OpenCL program: Done
Build log:
:124:6: error: conflicting types for 'ranlux_download_seed'
void ranlux_download_seed( float4 *s01to04,
^
:14:2: note: previous implicit declaration is here
ranlux_download_seed(&s01to04, &s05to08, &s09to12, &s13to16, &s17to20, &s21to24, &in24, &stepnr, &carry, PRNGTab, gid, global_size);
^
:153:6: error: conflicting types for 'ranlux_upload_seed'
void ranlux_upload_seed(float4 *s01to04,
^
:44:2: note: previous implicit declaration is here
ranlux_upload_seed(&s01to04, &s05to08, &s09to12, &s13to16, &s17to20, &s21to24, &in24, &stepnr, &carry, PRNGTab, gid, global_size);
^
:239:8: error: conflicting types for 'ranlux'
float4 ranlux(float4 *s01to04,
^
:22:14: note: previous implicit declaration is here
randomnr = ranlux(&s01to04, &s05to08, &s09to12, &s13to16, &s17to20, &s21to24, &in24, &stepnr, &carry);
^
ERROR (-45): Kernel::Kernel(): Invalid program executable
ipswitch:test innovationspace$