• OpenCL occupancy-performance nightmare

    These days I tried to squeeze some performance from a memory-intensive OCL kernel and went for GCN assembly. Saved a few registers here, few instructions there, got a nice occupancy and thought to have a perfect kerne...
    kbala
    last modified by kbala
  • clGgetDeviceIDs suddenly very slow

    We are currently developing an OpenCL application on Windows 10 (Visual Studio 2017) but have noticed that the OpenCL performance has recently degraded, with the call to clGetDeviceIDs now taking around 10 second...
    andyste1
    last modified by andyste1
  • clEnqueueAcquireD3D11ObjectsKHR blocks for a long time

    In my application, I have a processing thread that enqueues an OpenCL kernel that writes to a ID3D11Texture2D object.   Everything works fine in terms of correctness. I can successfully acquire the shared O...
    elad
    last modified by elad
  • Need tips to hide memory latency - 20x speed-loss when writing to memory

    I have an OpenCL code that does Monte Carlo photon transport simulations in a voxelated space (https://github.com/fangq/mcxcl). The code involves simulating a large number of random photon trajectories, each in a thre...
    FangQ
    last modified by FangQ
  • OpenCL: Delay in inter-kernel execution when requesting callbacks

    Hi I have a problem with delays in kernel execution when I request callbacks from OpenCL. In my application, I need to execute kernels at a "very" high rate (around 300Hz), and I need a callback to my host applicati...
    nfogh
    last modified by nfogh
  • Kernel runs slower for local workgroup size greater than 64

    Hi bros, I'm a CS undergraduate student and I recently wrote a GPU path tracer using OpenCL. If you don't know what path tracing it's basically a method to generate photorealistic images by shooting rays through every...
    gallickgunner
    last modified by gallickgunner
  • Wavefront and kernel occupancy

    I reduced number or vgpr from 88 to 84. The number of wavefront per compute unit increased from 8 to 12. However, I cannot see any performance gain. The vgpr reduce should not slow down the performance of each work it...
    fancyix
    last modified by fancyix
  • Why my VGPRs Usage increases so fast when I use this assignment statement code in OpenCL?

    if (condition) {*foundFlag = 1; dst[gid] = gid * crack_cnt + num; break; } This code is used in ending kernel funtion when password is found(2 AMD 7970 devices and OpenCL platform). *foundFlag is a pointer to a char v...
    yanmin950122
    last modified by yanmin950122
  • Optimizing data transfer with APU (best way to test zero-copy?)

    So finally I have got my APU test system (I paid for it!): -CPU: AMD Ryzen 5 2400G -MB: Asrock X470 Fatality Gaming mini-ITX -RAM: G.Skill 3200 C14, 16GB*2 -OS: Windows 10 Pro -IDE and compiler: Visual Studio 2017 Com...
    sandbo
    last modified by sandbo
  • line-by-line profiling

    I am wondering if there is a profiler for OpenCL on the AMD devices that supports line-by-line profiling? For CUDA, nvprof already has the PC sampling profiling option that gives per-line run time info; for OpenCL, ri...
    FangQ
    last modified by FangQ
  • OpenCL amdgpu-pro generated code performance - please convert 'select' to cndmask

    Hi,   I don't know if this place is the best place to report opencl compiler performance issues, but well I didn't find a better place.   SUMMARY: Please AMD devs, when an OpenCL dev takes the time to expl...
    mannerov
    last modified by mannerov
  • CL-GL Interop fastest way to synchronize?

    We are using OpenCL on Windows as part of a proprietary game-engine where we use the CL-GL interop functionality to communicate between the simulation and the rendering engine. Our core loop currently executes the fol...
    george72
    last modified by george72
  • Memory bandwidth anomaly

    Hi, Recently I did some tests about GDDR5 memory bandwidth. In a few words: It launches numberOfCUes*4 waveFronts. Each WF is reading 1024 bytes from a random aligned location in a large buffer. The whole kernel ...
    realhet
    last modified by realhet
  • How to tune the performance of ROCm(llvm) compiler?

    I modified llvm (roc-1.6.x) a bit to generate a code that can run on AMDGPU pro dirver. It can run but the performance is over 10% slower than AMDGPU's online compiler, for the same opencl code.  I wonder if ther...
    fancyix
    last modified by fancyix
  • Disappointing opencl half-precision performance on vega - any advice?

    I bought a Vega 64 recently. From the specs, it has 23 TFLOPs fp16 throughput compared to 12 TFLOP fp32. so I converted portion of my Monte Carlo code to half, expecting to gain some noticeable speed up. Disappointing...
    FangQ
    last modified by FangQ
  • AMD FirePro S9100 Performance test

    Hi We want to measure the performance and power consumption for S9100 GPU card, as spec mentioned that should get 2.11 TFLOPS peak double-precision floating point and 4.22 TFLOPS peak single-precision floating point...
    vites
    last modified by vites
  • AMDGPU OpenCL Weird Results

    Hello AMD OpenCL Gurus. I am facing a problem when building and running an opencl example. Here are details of my setup: a.) I installed amdgpu-pro-install --opencl=legacy --headless b.) I get the output from clinfo...
    skn1975
    last modified by skn1975
  • OpenCL issues on EPYC 7551

    Hi All,   I am seeing extremely poor performance using OpenCL on an EPYC 7551. I can't imagine this is intended as it is far slower in our testing than my old E5-2695 v3 machines.   See the clinfo below. ...
    jameskap
    last modified by jameskap
  • Benchmarking float64 matrix multiplication performance

    My primary interest in GPUs is for "scientific computing", or more precisely speaking, float64 general matrix multiplications, also known as DGEMM. This is the speed determining factor in my applications - if DGEMM ru...
    drnil
    created by drnil