• Error code -2 (Device not availaible) when running clCreateContextFromType

    Hello Everyone,   I'm currently retesting some OpenCL code and I recently had a problem on my code. When I'm trying to get the device list on my computer with the C++ Wrapper function ... I get a error...
    last modified by fyfy
  • Running OpenCL Work Groups with >256 Elements

    Hi all,   I am currently re-writing some OpenCL code of mine and would like to split the work of the group to more waves in order to have more waves in flight. The code is a OpenCL 1.2 code (because it needs to ...
    last modified by lolliedieb
  • OpenCL: Delay in inter-kernel execution when requesting callbacks

    Hi I have a problem with delays in kernel execution when I request callbacks from OpenCL. In my application, I need to execute kernels at a "very" high rate (around 300Hz), and I need a callback to my host applicati...
    last modified by nfogh
  • Kernel runs slower for local workgroup size greater than 64

    Hi bros, I'm a CS undergraduate student and I recently wrote a GPU path tracer using OpenCL. If you don't know what path tracing it's basically a method to generate photorealistic images by shooting rays through every...
    last modified by gallickgunner
  • OpenCL: repeat kernel execution?

    I'm queuing kernels that modify a buffer over and over again and am wondering if there's a more efficient way to do what I'm doing.   Here's pseudocode:   for (int q = 0; q < iterations; q++) {  ...
    last modified by ivanisavich
  • Wavefront and kernel occupancy

    I reduced number or vgpr from 88 to 84. The number of wavefront per compute unit increased from 8 to 12. However, I cannot see any performance gain. The vgpr reduce should not slow down the performance of each work it...
    last modified by fancyix
  • S_WAKEUP instruction

    The Vega Shader ISA doc (https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf) describes S_WAKEUP instruction as follows (I quote) -   Allow a wave to 'ping' all the other waves in its t...
    last modified by sp314
  • Processing two buffers using an out of order queue

    I have a PCI data acquisition card that supports P2P. It will be capturing records one after the other at a very rapid rate, and the plan is to write each record to the GPU using DirectGMA, where a kernel will process...
    last modified by andyste1
  • The values returned by clGetDeviceInfo() and clGetPlatformInfo() seem to be just a little off. Why?

    I've got Ubuntu Linux 16.04 with ROCm and AMDGPU-PRO drivers, and an R290x card, which is the only GPU I have on this computer. When I query the device name with clGetDeviceInfo(...CL_DEVICE_NAME...), for some reason,...
    last modified by sp314
  • Why my VGPRs Usage increases so fast when I use this assignment statement code in OpenCL?

    if (condition) {*foundFlag = 1; dst[gid] = gid * crack_cnt + num; break; } This code is used in ending kernel funtion when password is found(2 AMD 7970 devices and OpenCL platform). *foundFlag is a pointer to a char v...
    last modified by yanmin950122
  • Optimizing data transfer with APU (best way to test zero-copy?)

    So finally I have got my APU test system (I paid for it!): -CPU: AMD Ryzen 5 2400G -MB: Asrock X470 Fatality Gaming mini-ITX -RAM: G.Skill 3200 C14, 16GB*2 -OS: Windows 10 Pro -IDE and compiler: Visual Studio 2017 Com...
    last modified by sandbo
  • OpenCL memory transfer / zero copy buffers on embedded GPUs

    Hi,   I am trying to understand the mechanics of OpenCL memory access and transfers (in particular on AMD Ryzen V1000 embedded systems coming with Zen cores and an embedded Vega GPU), with the motivation of want...
    last modified by exilef
  • OpenCL amdgpu-pro generated code performance - please convert 'select' to cndmask

    Hi,   I don't know if this place is the best place to report opencl compiler performance issues, but well I didn't find a better place.   SUMMARY: Please AMD devs, when an OpenCL dev takes the time to expl...
    last modified by mannerov
  • CL-GL Interop fastest way to synchronize?

    We are using OpenCL on Windows as part of a proprietary game-engine where we use the CL-GL interop functionality to communicate between the simulation and the rendering engine. Our core loop currently executes the fol...
    last modified by george72
  • Is unit2 operations faster than ulong in OpenCL on AMD GCN cards?

    Which of the "+" calculation is faster? 1) uint2 a, b, c; c = a + b; 2) ulong a, b, c; c = a + b; Specifically on RX 580 or Vega cards.
    last modified by fancyix
  • Any instruction level or line-by-line profiler?

    It will be very helper if we can analyze the cost of each instruction or each OpenCL line. Either ROCm or AMDGPU driver is fine. Thanks in advance.
    last modified by fancyix
  • How can I know which gpu is used by the OS?

    If I have multi GPUs in my system, how can I know which gpu is used by the OS. I am using OpenCL to do computing, and I don't want do use this gpu to do gpgpu. thanks in advance.
    last modified by tdchen
  • A problem to solve with OpenCL and DirectGMA...

    I've been tasked with solving a problem that feels like it might be a good fit for a GPU, although I could be wrong...   We have a data acquisition card that generates nearly 8Gb/sec, typically in the form of a ...
    last modified by andyste1
  • Store array in regs?

    If I made an array like uint[128], the driver will spill it even if there is enough registers to store this array. Any way I can do to let compiler store big array in registers? Maybe some compile option?
    last modified by fancyix
  • How to use pinned memory for reading from GPU?

    I'm struggling to find examples of using pinned memory, especially when it comes to reading data from the GPU. Assuming my kernel has a 'int*' argument (containing the "results" to be read back by the host), would th...
    last modified by andyste1