• Report on work group/work item utilisation

    If I call clEnqueueNDRangeKernel(...) with a local size of NULL, is there any way to find out how the hardware has decided to utilise the work groups, i.e. how many work items (kernel instances) are running in each gr...
    last modified by andyste1
  • OpenCL development documentation on AMD GPUs

    Is there a publicly available list of all AMD GPUs supporting OpenCL which includes: product name ('AMD Radeon RX Vega 64') internal name ('gfx900', can be obtained as CL_DEVICE_NAME) architecture ('GCN gen 5') ar...
    last modified by timchist
  • clGgetDeviceIDs suddenly very slow

    We are currently developing an OpenCL application on Windows 10 (Visual Studio 2017) but have noticed that the OpenCL performance has recently degraded, with the call to clGetDeviceIDs now taking around 10 second...
    last modified by andyste1
  • OpenCL occupancy-performance nightmare

    These days I tried to squeeze some performance from a memory-intensive OCL kernel and went for GCN assembly. Saved a few registers here, few instructions there, got a nice occupancy and thought to have a perfect kerne...
    last modified by kbala
  • clEnqueueAcquireD3D11ObjectsKHR blocks for a long time

    In my application, I have a processing thread that enqueues an OpenCL kernel that writes to a ID3D11Texture2D object.   Everything works fine in terms of correctness. I can successfully acquire the shared O...
    last modified by elad
  • Need tips to hide memory latency - 20x speed-loss when writing to memory

    I have an OpenCL code that does Monte Carlo photon transport simulations in a voxelated space (https://github.com/fangq/mcxcl). The code involves simulating a large number of random photon trajectories, each in a thre...
    last modified by FangQ
  • OpenCL: Delay in inter-kernel execution when requesting callbacks

    Hi I have a problem with delays in kernel execution when I request callbacks from OpenCL. In my application, I need to execute kernels at a "very" high rate (around 300Hz), and I need a callback to my host applicati...
    last modified by nfogh
  • Kernel runs slower for local workgroup size greater than 64

    Hi bros, I'm a CS undergraduate student and I recently wrote a GPU path tracer using OpenCL. If you don't know what path tracing it's basically a method to generate photorealistic images by shooting rays through every...
    last modified by gallickgunner
  • Wavefront and kernel occupancy

    I reduced number or vgpr from 88 to 84. The number of wavefront per compute unit increased from 8 to 12. However, I cannot see any performance gain. The vgpr reduce should not slow down the performance of each work it...
    last modified by fancyix
  • Why my VGPRs Usage increases so fast when I use this assignment statement code in OpenCL?

    if (condition) {*foundFlag = 1; dst[gid] = gid * crack_cnt + num; break; } This code is used in ending kernel funtion when password is found(2 AMD 7970 devices and OpenCL platform). *foundFlag is a pointer to a char v...
    last modified by yanmin950122
  • Optimizing data transfer with APU (best way to test zero-copy?)

    So finally I have got my APU test system (I paid for it!): -CPU: AMD Ryzen 5 2400G -MB: Asrock X470 Fatality Gaming mini-ITX -RAM: G.Skill 3200 C14, 16GB*2 -OS: Windows 10 Pro -IDE and compiler: Visual Studio 2017 Com...
    last modified by sandbo
  • line-by-line profiling

    I am wondering if there is a profiler for OpenCL on the AMD devices that supports line-by-line profiling? For CUDA, nvprof already has the PC sampling profiling option that gives per-line run time info; for OpenCL, ri...
    last modified by FangQ
  • OpenCL amdgpu-pro generated code performance - please convert 'select' to cndmask

    Hi,   I don't know if this place is the best place to report opencl compiler performance issues, but well I didn't find a better place.   SUMMARY: Please AMD devs, when an OpenCL dev takes the time to expl...
    last modified by mannerov
  • CL-GL Interop fastest way to synchronize?

    We are using OpenCL on Windows as part of a proprietary game-engine where we use the CL-GL interop functionality to communicate between the simulation and the rendering engine. Our core loop currently executes the fol...
    last modified by george72
  • Memory bandwidth anomaly

    Hi, Recently I did some tests about GDDR5 memory bandwidth. In a few words: It launches numberOfCUes*4 waveFronts. Each WF is reading 1024 bytes from a random aligned location in a large buffer. The whole kernel ...
    last modified by realhet
  • How to tune the performance of ROCm(llvm) compiler?

    I modified llvm (roc-1.6.x) a bit to generate a code that can run on AMDGPU pro dirver. It can run but the performance is over 10% slower than AMDGPU's online compiler, for the same opencl code.  I wonder if ther...
    last modified by fancyix
  • Disappointing opencl half-precision performance on vega - any advice?

    I bought a Vega 64 recently. From the specs, it has 23 TFLOPs fp16 throughput compared to 12 TFLOP fp32. so I converted portion of my Monte Carlo code to half, expecting to gain some noticeable speed up. Disappointing...
    last modified by FangQ
  • AMD FirePro S9100 Performance test

    Hi We want to measure the performance and power consumption for S9100 GPU card, as spec mentioned that should get 2.11 TFLOPS peak double-precision floating point and 4.22 TFLOPS peak single-precision floating point...
    last modified by vites
  • AMDGPU OpenCL Weird Results

    Hello AMD OpenCL Gurus. I am facing a problem when building and running an opencl example. Here are details of my setup: a.) I installed amdgpu-pro-install --opencl=legacy --headless b.) I get the output from clinfo...
    last modified by skn1975
  • OpenCL issues on EPYC 7551

    Hi All,   I am seeing extremely poor performance using OpenCL on an EPYC 7551. I can't imagine this is intended as it is far slower in our testing than my old E5-2695 v3 machines.   See the clinfo below. ...
    last modified by jameskap