Archives Discussions

dravisher · ‎09-08-2010

Easy to use (copy-paste) code for anyone interested

Edit march 2011: I've updated the code quite a bit. The generator now works correctly with SDK 2.3 and SDK 2.4 rc1, and it should be much simpler to use. Newest version can be found on bitbucket, just grab the newest zip-file. As before there's also a program that can verify that the generator is producing correct sequences, measures performance and shows how to use the generator.

Hello. This is a piece of code I've developed for my own use, because I couldn't really find any good pseudorandom number generators for OpenCL that were simple to implement and understand. I'm sharing it so that others who just want a good PRNG can use it, without having to spend time porting a generator themselves. The generator is meant to be used with one copy per work-item. State data is kept in 7 float4 vectors.

The algorithm was developed by Martin Lüscher, while the Fortran 77 implementation this code is based on was developed by Fred James. The two relevant papers are:

Martin Lüscher, A portable high-quality random number generator for lattice field theory simulations, Computer Physics Communications 79 (1994) 100-110

F. James, RANLUX: A Fortran implementation of the high-quality pseudorandom number generator of Lüscher, Computer Physics Communications 79 (1994) 111-114

The code is licensed with an MIT license. If you really do find it usefull or interesting I'd love to hear about it :-). Also, if anyone thinks there is a problem with it please let me know, as I'm planning to make use of it myself for Monte Carlo simulations.

Usage of the generator is described as comments in the code, and I think the example code should show pretty clearly how to use the generator. The example code can also check the correctness of the implementation agains values generated by the original Fortran 77 code.

To see a list of the arguments just run prngtest.exe. To check performance and correctness of a GPU at luxury level 4 you'd run "prngtest.exe 4 1".

About RANLUX
RANLUX is an "old" (proposed in 1994) PRNG. It has been used extensively, and to the best of my knowledge no defects have been found at luxury setting 2 and above (which perfectly lines up with the inventors predictions).

The generator has a few notable features:

All calculations are in 32-bit floating point. The algorithm generates a number between 0 and 1 directly.

It is possible to select a "luxury level", where higher luxuries mean better but slower numbers. Luxury levels 0-4 are presented here, 4 being best.

RANLUX is also one of very few (the only one I know of actually) PRNGs that has a underlying theory explaining why it generates "random" numbers, and why they are good. Indeed, if the theory is correct (and I don't know of anyone who has disputed it), RANLUX at the highest luxury level produces completely decorrelated numbers down to the last bit, with no long-range correlations as long as we stay well below the period (10^171). Most other generators can say very little about their quality (like Mersenne Twister, KISS etc.) They must rely on passing statistical tests.

Performance
At luxury level 0, more than 20*10^9 numbers per second on a Cypress. At luxury level 4, more than 4*10^9 numbers per second.

drstrip · ‎09-10-2010

While this is a useful and generous contribution, a word of caution: Using ostensibly independent RNGs for each element (work item) in a massively parallel computation is frowned on in the MP community, as it creates the opportunity for very hard to detect correlations between streams of RNGs. Using fewer generators is much preferred.

If one makes some assumptions about the architecture of the target GPU and how threads are assigned and processed, one can reduce the number of RNGs. (Ideally, we should have an OpenCL call which tells us which SIMD we're on.) If the code surrounding utilizing the RNG has constant execution time across all threads and you assume round-robin scheduling (which I've been told is currently true for AMD OpenCL), then you can determine which SIMD you're on. In this case you can reduce the number of RNGs to wavefront size * number of SIMDs and be assured of no read/write inconsistencies. For problems with very large numbers of work items (my case was millions), this is a huge reduction in the size of the state space, as well as better practice in terms of using shared RNGs.

dravisher · ‎09-11-2010

Hello and thanks for the comment drstrip 🙂

I do know that PRNGs in parallel still remains a bit of a problem even today. Indeed as far as I know there are no PRNGs that can actually guarantee independent sequences in parallel. For instance the usual method for Mersenne Twister, called Dynamic Creator for Mersenne Twister (DCMT) can not be proven to result in uncorrelated parallel sequences, it is only considered likely that it does.

I haven't really been able to find any solid documents expaining why RANLUX is safe in parallel, however I believe it stems from the underlying theory (that the generator is connected to a classical chaotic system). As such RANLUX is probably the generator which has the stronges evidence for being safe in parallel applications. If I understand it correctly, there should be no correlations unless we have overlapping sequences (which is a practical impossibility considering the period of 10^171). The other usual suspects (like Mersenne Twister) has no similar theory, so it is much more of a black art in those cases.

While the approach you mention would certainly be more elegant if performance is maintaned, it really is too bad that it can't be done in a portable manner.

drstrip · ‎09-12-2010

While OpenCL is in principle portable, if you're writing code for something that really matters (and presumably if you're going to the trouble of porting to a GPU it does matter), then the target architecture makes a big difference. Older AMD/ATI chips don't have fast shared memory and if I understand correctly, NVidia chips don't have fast vector processing like the AMD chips do. Some code I've written recently doesn't have a natural vector nature, but by using int4 vectors at the expense of readability, I can get much better throughput. There would be no reason to do this on chipsets that don't have fast vector processors. So, portability is largely an illusion when speed matters. At least that's how it appears to me.

spectral · ‎10-21-2010

I have see this too : http://inst.cs.berkeley.edu/~ianh/proj1.html

MonteCarloCURAND:

and in the following page it sounds that the new CUDA has a "MonteCarloCURAND" generator example ! Maybe it will be interesting !

Mersenne twister :

http://git.jcornwall.me.uk/MersenneTwisterOCL/tree/MersenneTwisterOCL.py?id=00ea4652c0cf0d2ebcf91540...

http://www.pgroup.com/lit/articles/insider/v2n2a4.htm

dravisher · ‎10-21-2010

The CURAND library seems very cool, just wish it was for OpenCL instead. But thanks for pointing it out, it's interesting nonetheless!

Mersenne Twister has for some reason been very popular on the GPU, but I don't really understand why. It uses a huge state array, so unless several work-items share one array (which necessitates frequent synchronization) it's not really very fast. I think sometimes they reduce the size of the state array, but then we're suddenly dealing with a generator of unknown quality. Statistical tests only tell so much, it's also important that a generator has survived use in a wide range of actual applications (at least it is for me).

That's why I like RANLUX. It's fast enough (at least for my uses), has a small state array (at least compared with Mersenne Twister), it's old and well tested, and unlike most other generators it has an underlying theory for why it's a good generator.

spectral · ‎10-22-2010

There is also this one :

http://src.luxrender.net/luxrays/file/c79fc1c82c15/samples/smallluxgpu/pathgpu/kernels/pathgpu_kerne...

I don't know how it works and if it is of good quality, but it tell this " maximally equidistributed combined Tausworthe generator".

First, it generate a set of 'random values', pass the array to the kernel, then generate numbers and update the 'array' with the new values.

joohongyee · ‎11-15-2010

I have tried to compile this file using below command

g++ -framework opencl -o prog.exe PRNGTest_1.9.cpp

However, prog.exe 1 1 1 1 1 showed me an error message as below.

Can you tell me how can i compile this code without error?

=============================================

s1f1:test 1294566$ ./prog.exe 0 1 1 1 1

PRNGTest 1.9

Using GPU

RANLUX illegal initialization seed: 0. RANLUX initialized using default seed instead: 314159265

Num platforms: 1

Platform Vendor: Apple

Platform profile: FULL_PROFILE

Platform version: OpenCL 1.0 (Apr 7 2010 19:04:28)

Platform name: Apple

Platform extensions:

Device name: GeForce 9400

Device type: GPU

Device vendor ID: 16918016

Max compute units: 2

Max work item dim: 3

Max work group size: 512

Max mem alloc size: 128 MiB

Global memory size: 256 MiB

Global memory cache: None

Local memory type: Dedicated

Prof Timer Res: 1000 ns

OpenCL Driver ver: CLH 1.0

Building OpenCL program: Done

Build log:

:1:10: fatal error: 'ranlux_unified_1.0.0.cl' file not found

#include "ranlux_unified_1.0.0.cl"

^

ERROR (-45): Kernel::Kernel(): Invalid program executable

==============================================

dravisher · ‎11-15-2010

Seems it's not finding ranlux_unified_1.0.0.cl. The line:

BuildOptions += "-I . "; //Search for include files in current directory

In the .cpp file should tell the OpenCL implementation to look for it in the working directory, are you sure it's there?

You could also just paste the entire contents of the ranlux_unified_1.0.0.cl file into the PRNGTest_kernels_1.9.cl file, since that's all the include statement is supposed to be doing anyway.

joohongyee · ‎11-15-2010

I have downloaded cl.hpp file from "http://www.khronos.org/registry/cl/api/1.1/cl.hpp"

then stored in the same directory with 3 downloaded files.

First I have compile with -I. option like below.

g++ -framework opencl -I. -o test PRNGTest_1.9.cpp

However, it showed the same error as above.

Second, as you have indicated I have included head file in PRNGTest_1.9.cpp.

#include "cl.hpp"

Actually, it was already included before getting above error sign.

Third, I didn't include "#include "ranlux_unified_1.0.0.cl"" then copied in PRNGTest_kernels._1.9.cl then compiled like above " g++ -framework opencl -I. -o test PRNGTest_1.9.cpp"

Then i got an error message as below.

============================================

ipswitch:test innovationspace$ g++ -framework opencl -I. -o test PRNGTest_1.9.cpp

ipswitch:test innovationspace$ ./test 0 1 1 1 1

PRNGTest 1.9

Using GPU

RANLUX illegal initialization seed: 0. RANLUX initialized using default seed instead: 314159265

Num platforms: 1

Platform Vendor: Apple

Platform profile: FULL_PROFILE

Platform version: OpenCL 1.0 (Apr 7 2010 19:04:28)

Platform name: Apple

Platform extensions:

Device name: GeForce 9400

Device type: GPU

Device vendor ID: 16918016

Max compute units: 2

Max work item dim: 3

Max work group size: 512

Max mem alloc size: 128 MiB

Global memory size: 256 MiB

Global memory cache: None

Local memory type: Dedicated

Prof Timer Res: 1000 ns

OpenCL Driver ver: CLH 1.0

Building OpenCL program: Done

Build log:

:124:6: error: conflicting types for 'ranlux_download_seed'

void ranlux_download_seed( float4 *s01to04,

^

:14:2: note: previous implicit declaration is here

ranlux_download_seed(&s01to04, &s05to08, &s09to12, &s13to16, &s17to20, &s21to24, &in24, &stepnr, &carry, PRNGTab, gid, global_size);

^

:153:6: error: conflicting types for 'ranlux_upload_seed'

void ranlux_upload_seed(float4 *s01to04,

^

:44:2: note: previous implicit declaration is here

ranlux_upload_seed(&s01to04, &s05to08, &s09to12, &s13to16, &s17to20, &s21to24, &in24, &stepnr, &carry, PRNGTab, gid, global_size);

^

:239:8: error: conflicting types for 'ranlux'

float4 ranlux(float4 *s01to04,

^

:22:14: note: previous implicit declaration is here

randomnr = ranlux(&s01to04, &s05to08, &s09to12, &s13to16, &s17to20, &s21to24, &in24, &stepnr, &carry);

^

ERROR (-45): Kernel::Kernel(): Invalid program executable

ipswitch:test innovationspace$

==============================================

I am running this program on Mac computer. That's is why i am using command like compiling method. Thanks for your quick reply for the posting.

dravisher · ‎11-15-2010

I am not familiar with Mac, but it seems the problem is only with the OpenCL (.cl) files, since the C++ is compiling and running as expected. The errors are a bit confusing though. What changes did you do here? Judging by the errors it's like the contents of ranlux_unified_1.0.0.cl is being included at the end of the code instead of the beginning.

So just to clarify so we're on the same page, you should either leave all the codes as they are, or you can try to past the contents of ranlux_unified_1.0.0.cl into the top of the PRNGTest_kernels_1.9.cl file (so replace the first line, which is "#include "ranlux_unified_1.0.0.cl"" with the contents of the ranlux_unified_1.0.0.cl file).

joohongyee · ‎11-15-2010

It is working. Thank you very much. I think #include in .cl file is not working on Mac computer.

dravisher · ‎11-15-2010

Yes it's strange that the #include does not work. Glad to hear you got it working though

joohongyee · ‎11-16-2010

According to what you have posted, the program generated 20*10^9 numbers/second by using Cypress. Can you tell me what is that GPU. I have tested Nvidia 9400M(16 core with 450MHz core speed) on Mac computer with "./test.exe 0" and the result was just 200Mega numbers. This is too much lower than your result.

nou · ‎11-17-2010

Cypress is radeon 5870.

dravisher · ‎11-17-2010

As nou said, Cypress = HD 5870.

Note that running ./test.exe 0 will run on the CPU. Run ./test.exe 0 1 (where the last 1 is setting useGPU to true) to check GPU performance. I know that the Nvidia GTX480 is slightly faster (something like 20%) than Cypress on this code, which is not very surprising since it's somewhat linear/not vectorized. It should perform decently on Nvidia hardware.

joohongyee · ‎11-17-2010

Thank you for your reply.

I have always tested ./test.exe 0 1 1 1 1. To make it shout i have written ./test.exe 0. Your code is really well organized and easy to understand. 1600 stream GPU machine? Wow. That's amazing. I hope I can utilize this code for my research project. Thanks alot.

joohongyee · ‎12-18-2010

I have generated number by ./test.exe 0 1 1 1 1. Then i have printed out the PRNs numbers. The result is that from 1~153599 it is random number from 0 to 1. However, from 153600to 614399 it was all zeors. From 614400 to 767998, it was random numbers from 0 to 1 but from that point to 1228800, they were all zeros. This is repeated up to 24999999. Did I don't know why it is repeating zeros. If you know some reason please let me know.Code is the same except that ranlux_unified_1.0.0.cl part is put on top of the PRNGTest_kernels_1.9.cl and compiled on Mac. Thanks for reading.

dravisher · ‎12-19-2010

Thanks for your comment. I have just discovered some weirdness when I run the code on the GPU, even failing the correctness check on my GPU (Cypress) now. Very strange, since it seemed to work correctly on the GPU in the past with previous drivers/SDKs. Also it still seems to work correctly on the CPU.

I'll try to figure it out during christmas, very weird problem.

joohongyee · ‎12-25-2010

I have tested the random number generation using CPU.However, the test failed.

What i did was,

==================

./test exe 0 0 1 0 0;

==================

The result was,

=================

PRNGTest 1.9

Using CPU

RANLUX illegal initialization seed: 0. RANLUX initialized using default seed instead: 314159265

Building OpenCL program: Done

ERROR (-54): CommandQueue::enqueueNDRangeKernel(): Invalid work group size

================

In the middle of comparing the time for generating random numbers using CPU and GPU, I got this result. Look like drivers/SDKs problem. The code looks okay but I couldn't find why it shows this error. Merry Christmas and Happy New Year.

joohongyee · ‎01-06-2011

Unfortunately, I was not able to figure out how to fix those two problems. Do you have any progress?

himanshu_gautam · ‎01-09-2011

joohongyee,

I do not have any idea about the ranlux algorithm, so maybe someone else can guide you better. Anyhow, this error of Invalid work group size as per opencl spec is :

CL_INVALID_WORK_GROUP_SIZE if local_work_size is specified and number of workitems
specified by global_work_size is not evenly divisible by size of work-group given
by local_work_size or does not match the work-group size specified for kernel using the
__attribute__((reqd_work_group_size(X, Y, Z))) qualifier in program
source

It maybe helpful .

joohongyee · ‎01-09-2011

Thank you so much for your comment. I will try to find help from others and if I can make it, I will post it over here.

joohongyee · ‎01-20-2011

Hi Himanshu,

I ran the code on NVIDIA GPU machine and it works okay. However, ATI machine still makes errors. As you have indicated, it might be driver/SDK machine. I heard that NVIDIA GPU is more complex than ATI machine. But in case of Mac, it makes more errors with ATI machine. I hope this problem will be solved in the future time. I really appreciate your comment and help.

Joohong

himanshu_gautam · ‎01-20-2011

Joohongyee,

Are you trying it on MAC Book. The OpenCL implementation for MAc Books is being written and maintained by apple.

dravisher · ‎01-31-2011

Originally posted by: joohongyee Hi Himanshu,

I ran the code on NVIDIA GPU machine and it works okay. However, ATI machine still makes errors. As you have indicated, it might be driver/SDK machine. I heard that NVIDIA GPU is more complex than ATI machine. But in case of Mac, it makes more errors with ATI machine. I hope this problem will be solved in the future time. I really appreciate your comment and help. Joohong

Thank you for your interest and comments joohongyee. I've found out that with the latest SDK (AMD APP SDK 2.3) I get wrong (but still seemingly pseudo-random) sequences on my HD 5870. My CPU still works correctly though. Using the older ATI Stream SDK 2.0.1 with Catalyst 10.1 my GPU is producing correct results, and since you got it to work on an Nvidia machine it sounds like there's a problem with the more recent AMD implementations only.

I'll have to do some more tests to figure out with which SDK/driver the problems started. But it seems like a very difficult bug to find since the algorithm seems to run fine for a while, and then it suddenly diverges from the known good numbers, but it continues generating seemingly pseudorandom numbers.

himanshu_gautam · ‎02-02-2011

Hi dravisher,

Did you found what is causing this issue? Is it a Precision issue or some implementation bug?

Is the issue reproducible from the code you posted in the first post of this thread?

dravisher · ‎02-02-2011

What I've found out is that with my HD 5870:

SDK 2.01 with Catalyst 10.1 works

SDK 2.1 with Catalyst 10.4 works

SDK 2.2 with Catalyst 10.8 works

SDK 2.3 with Catalyst 11.1 produces wrong numbers after a while.

I've also tried it on a Nvidia T10 that I have access to at the university, and it too generates the correct numbers.

The problem does show up in the program I posted. For example running it as "prngtest.exe 0 1 1" or "prngtest 4 1 1" will show that the last numbers checked are incorrect. Running on the CPU with "prngtest.exe 0 0 1" works as expected. It can also be seen by printing or writing to file the PRNs array in the program, where the CPU (at least on my computer) is producing correct results, and seeing that after a while the GPU will start generating different values. I've checked this against 10000 values from the Fortran implementation and my OpenCL implementation generates the sequence correctly on the CPU, and on the GPU except with the newest SDK.

My guess would be that there is a (small) calculation error at some point which then causes the generated sequence to diverge. Even an error in the least significant bit would do it I think. As I understand it this would be a bug in the SDK since I'm only using addition and multiplication, which should be correctly rounded according to the specification.

Since the error seems to happen at the same place each time, I'll try to see if I can isolate the exact operation that's failing.

luluxiu · ‎07-18-2011

Recently I wrote some code that does not have a natural carrier of nature, but the use of INT4 vector sacrifice readability, I can get better throughput....

MicahVillmow · ‎01-20-2011

joohongyee,
We do not support the apple platform. Please contact apple with any issues with their OpenCL implementation.

MicahVillmow · ‎02-02-2011

dravisher,
If you can provide a small test case that hits the issue, we can get it fixed for SDK 2.4.

dravisher · ‎02-02-2011

Originally posted by: MicahVillmow dravisher, If you can provide a small test case that hits the issue, we can get it fixed for SDK 2.4.

Thanks, I'm still trying to pin down the issue.

dravisher · ‎03-31-2011

I wasn't able to figure out precisely what was wrong, however switching to vectorized (float4) memory operations eliminated the issue. I've also made many other improvements particularly to ease of use and the documentation (i.e. comments). The initialization is now in ranluxcl.h (host code initialization function ranluxcl_initialization) which should work well both with C and C++ and ranluxcl.cl which implements the actual generator in OpenCL C.

Code is posted on bitbucket.

spectral · ‎07-14-2011

Thanks for your open source project,

But why using a RNG and not a QMRNG like halton sequence, van der corput ... ?

They have lower discrepency than any RNG ? And they are very fast and simple ! Right ?

dravisher · ‎07-14-2011

Well quasi-random sequences can be very useful of course, but only if you know that they are suitable. Their usefulness is not at all universal, and so they are not as broadly usable as a pseudo-random sequence like RANLUX generates. Quasi-random sequences are for example not generally used for Markov Chain / Metropolis Monte Carlo methods (which is what I've been using the generator for).

This article talks a little about when quasi-random numbers are useful in section 5.

Archives Discussions

The RANLUX pseudorandom number generator implemented in OpenCL (updated)