• OpenCL occupancy-performance nightmare

    These days I tried to squeeze some performance from a memory-intensive OCL kernel and went for GCN assembly. Saved a few registers here, few instructions there, got a nice occupancy and thought to have a perfect kerne...
    kbala
    last modified by kbala
  • How to abort clEnqueueWaitSignalAmd?

    We're developing software that uses a PCI data acquisition card to read blocks of data (records) from an external instrument. These records are transferred to a Radeon Pro WX7100 using "DirectGma", where a kernel proc...
    andyste1
    last modified by andyste1
  • clGgetDeviceIDs suddenly very slow

    We are currently developing an OpenCL application on Windows 10 (Visual Studio 2017) but have noticed that the OpenCL performance has recently degraded, with the call to clGetDeviceIDs now taking around 10 second...
    andyste1
    last modified by andyste1
  • Optimize LC0 - Leela Chess Zero - for AMD GPUs

    Heyho AMD community,   we are all aware about the neural network hype on gpus, and most have noticed that Nvidia has simply the forehand with their cuDNN framework.   Personally I am convinced that AMD mak...
    smato2018
    created by smato2018
  • Radeon vii and fft

    Hello, is there by any chance a recommended  ocl package of ffts for radeon vii? clfft was coded for previous generations of cards. --
    dns.on.gpu
    last modified by dns.on.gpu
  • GPUs: pick-n-mix

    Hello.   Is it possible to use ocl with 2 of more different gpus under linux? I am interested in mixing two Rad_vii, with two 280x and even one or two 7950. --
    dns.on.gpu
    last modified by dns.on.gpu
  • What's the best or the recommended way to copy the data from scalar registers to GDS?

    Perhaps, there's something that I'm not seeing in the docs, so I apologize in advance.   I've got 16 dwords in scalar registers s16-s31. I need to copy that data from the scalar registers to GDS at the GDS base ...
    sp314
    last modified by sp314
  • Getting stuck in a loop, does local variable not visible to other workitems in a work group?

    This is my kernel code: __kernel void test(__global int *input_vector,__global atomic_int *mem_flag) {     local int d[32];     if(get_local_id(0)==0) {      &#...
    avinashkrc
    last modified by avinashkrc
  • clEnqueueAcquireD3D11ObjectsKHR blocks for a long time

    In my application, I have a processing thread that enqueues an OpenCL kernel that writes to a ID3D11Texture2D object.   Everything works fine in terms of correctness. I can successfully acquire the shared O...
    elad
    last modified by elad
  • I am trying to testout how well atomicity performs on APU. But my sample program hangs the system

    I am trying to testout how well atomicity performs on APU. But my sample program does not update the variable properly hence whole system hangs as I check for updated value at either side (cpu and gpu)  in while ...
    avinashkrc
    last modified by avinashkrc
  • OpenCL 64 bit atomics under Vega 8 Integrated Graphics on Win10 ?

    I am working to compile an OpenCL program which needs 64bit atomics (atomic_xchg and atomic_add, with long datatype). I have added " #pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable" and the ...
    glupescu
    last modified by glupescu
  • Need tips to hide memory latency - 20x speed-loss when writing to memory

    I have an OpenCL code that does Monte Carlo photon transport simulations in a voxelated space (https://github.com/fangq/mcxcl). The code involves simulating a large number of random photon trajectories, each in a thre...
    FangQ
    last modified by FangQ
  • host-device latencies?

    Doing recently some benchmarks and wonder if my host-device latencies are bound to my older hardware or are similar on newer systems?   OS: Ubuntu 18.04 x86-64 Device: AMD Radeon HD 7750   OpenCL gpu kerne...
    smato2018
    last modified by smato2018
  • Error code -2 (Device not availaible) when running clCreateContextFromType

    Hello Everyone,   I'm currently retesting some OpenCL code and I recently had a problem on my code. When I'm trying to get the device list on my computer with the C++ Wrapper function ... I get a error...
    fyfy
    last modified by fyfy
  • Running OpenCL Work Groups with >256 Elements

    Hi all,   I am currently re-writing some OpenCL code of mine and would like to split the work of the group to more waves in order to have more waves in flight. The code is a OpenCL 1.2 code (because it needs to ...
    lolliedieb
    last modified by lolliedieb
  • OpenCL: Delay in inter-kernel execution when requesting callbacks

    Hi I have a problem with delays in kernel execution when I request callbacks from OpenCL. In my application, I need to execute kernels at a "very" high rate (around 300Hz), and I need a callback to my host applicati...
    nfogh
    last modified by nfogh
  • Kernel runs slower for local workgroup size greater than 64

    Hi bros, I'm a CS undergraduate student and I recently wrote a GPU path tracer using OpenCL. If you don't know what path tracing it's basically a method to generate photorealistic images by shooting rays through every...
    gallickgunner
    last modified by gallickgunner
  • OpenCL: repeat kernel execution?

    I'm queuing kernels that modify a buffer over and over again and am wondering if there's a more efficient way to do what I'm doing.   Here's pseudocode:   for (int q = 0; q < iterations; q++) {  ...
    ivanisavich
    last modified by ivanisavich
  • Wavefront and kernel occupancy

    I reduced number or vgpr from 88 to 84. The number of wavefront per compute unit increased from 8 to 12. However, I cannot see any performance gain. The vgpr reduce should not slow down the performance of each work it...
    fancyix
    last modified by fancyix
  • S_WAKEUP instruction

    The Vega Shader ISA doc (https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf) describes S_WAKEUP instruction as follows (I quote) -   Allow a wave to 'ping' all the other waves in its t...
    sp314
    last modified by sp314