Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

Wrapping OpenCL host code in C++ class problem

Does all the OpenCL host code need to be within the same scope? I am having trouble having my host code in a C++ class, which initializes all buffers and bulids kernels in the constructor and has all setting of kernel arguments and enquqeing inside different functions for each kernel.

I have an instance of this class outside a loop and I call the functions which again enqueues the kernels from inside the loop. I do this to avoid duplication and hide all the ugly host code from the client.

However, when I do this I do not get the correct answer anymore and I also get totaly different answers on a 9800GT (close to the correct answer each time) and a GTX460 (random totally wrong answers each time) (sorry for using nVidia..I ask here since they don't seem so dedicated to OpenCL. And their forum is down...).

Before I had all the host code inside one single function which works and gives the same answer on both cards. But I want to clean it up a bit. I also use QtOpenCL, but that should not matter since it is just a fancy wrapper.

10 Replies
Adept II

See A.2 in the OpenCL specification, version 1.1.

You may have multiple objects, each of which sets their own kernel arguments. If all the objects happen to be sharing the same kernel object and each object sets arguments and enqueues independently, then each object will overwrite the arguments set for the kernel and you will get chaos. That's because the kernel object (in the OpenCL runtime) is common to all of your objects that are setting arguments. That object is not thread-safe.

To get round this, simply make each of your own objects create a private kernel object with clCreateKernel. Each of your objects will create the kernel object from the same program object and kernel name string. But they will now act independently.

Notes 71 and 72 on page 363 might be preferable to my explanation!

Another way around this problem is that the method that sets arguments also enqueues execution of the kernel.



Please post some code and your System Details(CPU,GPU,SDK,Catalyst,OS).

Most of the SDK samples are written using classes. That might be helpful to you.



I am using nVidia cards atm. 9800gt and gtx460, but this forum is better i think. Running on CentOS 5 and OpenCL 1.0. I am also using QtOpenCL. Everything is run in the same thread, but I want to run on multiple GPUs as well. Guess I have to use different threads. (I all works if I have all host code inside "runSimulation" in one huge function)


Here is some code:



runSimulation() { AbsPermHost absPermHost(absPermVariables_, grid_, device_); Timer timer; // Main loop while (...) { for (0...100) { QCLEvent collideEvent = absPermHost.collideAndSwap(); collideEvent.waitForFinished(); QCLEvent streamEvent = absPermHost.streamBySwapping(); streamEvent.waitForFinished(); ++nIterations; } (...check convergence...) } // End main loop (...print results...) } class AbsPermHost { public: /** * Constructor */ AbsPermHost(const AbsPermVariables& absPermVariables, const Grid& grid, QCLDevice& device); QCLEvent collideAndSwap(); QCLEvent streamBySwapping(); (...) private: void createKernels(); (...) // OpenCL objects QCLContext context_; QCLDevice device_; QCLKernel collideAndSwapKernel_; QCLKernel streamBySwappingKernel_; QCLKernel computeAverageVelocitiesKernel_; // Work sizes QCLWorkSize simulationSize_; QCLWorkSize nCollisionThreads_; QCLWorkSize localSizeCollision_; QCLWorkSize localSizeStream_; unsigned int localSizeXCompAverage_; // Kernel data Lattices latticesCl_; QCLBuffer bodyForcesCl_; QCLVector<int> obstaclesCl_; // Shared memory sizes unsigned int sharedMemoryByteSizeCollision_; unsigned int sharedMemoryByteSizeCompAvg_; }; AbsPermHost::AbsPermHost(const AbsPermVariables& absPermVariables, const Grid& grid, QCLDevice& device) : absPermVariables_(absPermVariables), nInlets_(AbsPermCL::computeInlets(grid)), nPoreCells_(grid.getNPoreCells()), device_(device) { // Create a context QList<QCLDevice> devices; devices.push_back(device_); context_.create(devices); createKernels() (...init buffers and stuff...) } QCLEvent AbsPermHost::collideAndSwap() { collideAndSwapKernel_.setGlobalWorkSize(nCollisionThreads_); collideAndSwapKernel_.setLocalWorkSize(localSizeCollision_); // Collide and swap on GPU collideAndSwapKernel_.setArg(0, latticesCl_.f0); collideAndSwapKernel_.setArg(1, latticesCl_.f1); collideAndSwapKernel_.setArg(2, latticesCl_.f2); collideAndSwapKernel_.setArg(3, latticesCl_.f3); collideAndSwapKernel_.setArg(4, latticesCl_.f4); collideAndSwapKernel_.setArg(5, latticesCl_.f5); collideAndSwapKernel_.setArg(6, latticesCl_.f6); collideAndSwapKernel_.setArg(7, latticesCl_.f7); collideAndSwapKernel_.setArg(8, latticesCl_.f8); collideAndSwapKernel_.setArg(9, latticesCl_.f9); collideAndSwapKernel_.setArg(10, latticesCl_.f10); collideAndSwapKernel_.setArg(11, latticesCl_.f11); collideAndSwapKernel_.setArg(12, latticesCl_.f12); collideAndSwapKernel_.setArg(13, latticesCl_.f13); collideAndSwapKernel_.setArg(14, latticesCl_.f14); collideAndSwapKernel_.setArg(15, latticesCl_.f15); collideAndSwapKernel_.setArg(16, latticesCl_.f16); collideAndSwapKernel_.setArg(17, latticesCl_.f17); collideAndSwapKernel_.setArg(18, latticesCl_.f18); collideAndSwapKernel_.setArg(19, bodyForcesCl_); collideAndSwapKernel_.setArg(20, 0, sharedMemoryByteSizeCollision_); collideAndSwapKernel_.setArg(21, absPermVariables_.getOmega()); return; }


The obvious question, then: what happens if you remove the loop, so that the simulation runs for a single step? Is that result correct?



Correct answer: 1739.24

9800gt:  1425

gtx460:  0


I don't get it...




Well, there's nothing unusual about what you're doing if you were using OpenCL's API directly or the C++ bindings. Maybe you should try the QtOpenCL forum. I can't think of anything else, but I can't see how this is an OpenCL problem.


Thanks for the help. I'll try with the c++ API instead.


I think I found the problem:

QCLBuffer bodyForcesCl_ =
 context_.createBufferHost(&(bodyForcesVector[0]), N_VECTORS * sizeof(float),

The bodyForcesVector went out of scope in the class constructor, since I thought I didn't need it anymore. Then I guess the two machines with the different cards picked up random stuff from that address when the kernels were called with bodyForcesCl_.


Glad it worked out. Dealing with your own bugs is much easier than someone else's.


[ignore duplicate post]