cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

epvbergen
Adept I

OpenCL with SVM extensions on Linux for modern APUs?

Hi,

I'm evaluating OpenCL-accelerated OpenCV on V1807B (Raven Ridge APU) and am wondering what options I have to get SVM support on Linux on APU.

It seems there are multiple approaches:

- a fully open stack: Linux 4.18+ with raven ridge kfd patches, amdgpu, mesa 18.1.6 -> works, but only OpenCL 1.1 (clover), no SVM. No solution.

- the AMDGPU-PRO stack: Linux 4.18+ with raven ridge kfd patches, amdgpu-pro 18.30 -> works, but OpenCL 1.2 without SVM extensions. Slow in OpenCV.

- the official support for V1000: Linux 4.14 with AMDGPU driver 2018.20.818 -> works, but OpenCL 1.2 without SVM extensions. Idem.

- ROCm-based OpenCL -> Raven Ridge not supported in ROCm 1.8, APUs not in roadmap for ROCm, APU support seems to have ended with Carrizo and Kaveri.

If I understand the situation correctly then:

- support for OpenCL 2.0 on Linux has ended with the 2014 release of Catalyst 15.1, before the compiler in AMDGPU-PRO could offer OpenCL 2.0.

- support for OpenCL 1.2 with the SC compiler ended with AMDGPU-PRO 17.50, before the LLVM compiler offered the same performance and correctness (see the reports from the coin miners).

- support for packed FP16 is not planned anymore, see Disappointing opencl half-precision performance on vega - any advice?

- support for ROCm on APU ended with ROCm 1.6, before gfx902/gfx903 (Raven Ridge) was supported, the first mainstream APU in a long time with the Ryzen 2400G et al.

If I want to make a very depressing general summary for Linux, OpenCL goes from 2.0 to 1.2, to 1.2 with problems. Main attention for SVM support goes from AMD to Intel. Heterogenous computing with ROCm goes from APU to dGPU. Packed FP16 is dropped, despite support in the chips and the boost it can give to DL.

(It's such a pity regarding ROCm, everything seems in place, if only someone would update the closed source libhsa-ext-finalize64.so. I even got vector_copy to work by patching ROCr to fake a gfx900 instead of gfx902 to libhsa-ext-finalize64.so, but alas, ROCm-Tensorflow still crashed half way through).

So what's the plan? I am really enthusiastic about the promise that modern APUs could hold for accelerated DL inference and computer vision, but how do I convince my colleagues to avoid the Jetson and the Movidius to satisfy their appetite for AI-at-the-edge?

(gstoner​?)

1 Solution
gstoner
Staff

There were many reasons  AMD sunsetted the Catalyst Linux driver,  there was a decision by the VP of Engineering and the Corporate Fellows at the time to move common Linux driver core foundation all based on AMDGPU.  Which meant we have a multi-year rebuild of the foundation since it was missing capabilities.  Including depreciation and regressed, so the team had time to rebuild it.   OpenCL 2.0 was always planned to the made whole on Linux for APU & DGPU.   Remember Catalyst driver had its challenges with the Linux community and our customers.

On SVM, Catalyst had many shortcomings in its design, it supports Maximum of 4 GB of memory, on DGPU you slice the 4 GB by number of GPU in the System which was a bit of an issue as we built GPU 16 and 32 GB of Memory. One of the thing with ROCm we working was addressing this issue.  It was an issue with APU + dGPU combo.   We also have few more issue we had to deal with that were architected into Catalyst driver like it only supports 4 GB max allocation of memory, it was not until the last release where they fixed the driver to support Larger allocation by chaining multiple 4GB regions into one virtual larger allocation.  There number of other architectural challenges that impacted even what you desire on the APU, but I leave it there.

Over the last two and half year, we had to make some tough priority calls due to the size of the GPU compute engineering team.  Looking at the what OpenCL application that was in the market which also ran on Linux. Also the fact the market on POSIX based OS ( Mac OSX and Linux ( aka NVIDIA)  )  never advanced beyond Common Denominator of OpenCL 1.2 we work to make sure we delivered this at minimum 

 

     - AMDGPUpro is for broad-market support to support all CPU PCIe Gen1, Gen2, Gen3 etc

              - So GFX8 and older never moved to ROCm based driver foundation. Also, they never moved to new LLVM compiler they stayed on LLVM/HSAIL/SC compiler the same as Catalyst used in our  Windows driver.

              - GFX10 aka Vega10 is the only driver that supports the ROCm base driver foundation, 

                - With 17:50 moved the same compiler as Windows Driver LLVM/HSAIL/SC compiler and

               - With 18:20 moved OpenCL on PAL ( Same foundation as Vulkan) with LLVM/HSAIL/SC compiler that Windows Driver uses.

     - The ROCm project which primary focus was advanced GPU computing languages, HPC and Machine/Deep Learning. Because of this,  ROCm uses more advanced platform feature like PCI Gen PCIe Atomics to support Signals which why we need PCIe Gen3 lanes on the CPU PCIe Root Complex which where Server base GPU is placed 

     - ROCm AMGPU LLVM compiler supports  OpenCL 2.0 Kernel on OpenCL 1.2 runtime today and will support Full OpenCL 2.0 with Packed Math Float16 Operation.  

    - On ROCm we had strong drive to get to the pure opensource solution, which LLVM/HSAIL/SC compiler was a big issue, plus for this project, it had a number of shortcomings we were trying to address for HPC Deep Learning market with the new compiler.   Assembler support was critical for our library programs,  You see rocBLAS now hit 94% efficiency on Vega10 for large Square Matrixes on SGEMM and MIOpen it was critical get to performance level with MIOpen on Vega10

As you know, the community of OpenCL adoptors the common Denominator is OpenCL 1.2, not OpenCL 2.0.  Only AMD and Intel moved forward here, Which we do have full support Windows for OpenCL 2.0.  

Now HSA/ROCm and APU, Due to an early Architecture issue, before I ran the team,  they use a particular extension in the SBIOS that extends the SRAT with Topology info into a file called a CRAT.  We had a large number of issue due to OEM/ODM not correctly populated this out.  It is something the Linux and ROCr team have been revisiting. 

     I am sorry this impacted your work, but please be patient as we rebuild the core stack to get to the level of capabilities to meet your expectation. We are working to bring OpenCL 2.0 across both AMDGPUpro and ROCm on Linux, but remember it more then SVM, but we have many users who want Device Enqueue, this was feature the team has been working on since it was another feature that did not work well under catalyst.  

Also, the team has been working on Raven Support for ROCm it just taken a bit longer to get all the foundation we need in place

A lot of this taken longer then we wanted, but it all coming back with a better foundation.   A big thing is the GPU Computing Team and Linux team are now one team which should speed all this up now under a new VP of Engineering.    The one thing we should have done better communicate to the community the changes we doing and why earlier.   

I will leave you with OpenCL 2.0 full support will release within next 6 months

Thanks

Gregory Stoner

View solution in original post

6 Replies
gstoner
Staff

There were many reasons  AMD sunsetted the Catalyst Linux driver,  there was a decision by the VP of Engineering and the Corporate Fellows at the time to move common Linux driver core foundation all based on AMDGPU.  Which meant we have a multi-year rebuild of the foundation since it was missing capabilities.  Including depreciation and regressed, so the team had time to rebuild it.   OpenCL 2.0 was always planned to the made whole on Linux for APU & DGPU.   Remember Catalyst driver had its challenges with the Linux community and our customers.

On SVM, Catalyst had many shortcomings in its design, it supports Maximum of 4 GB of memory, on DGPU you slice the 4 GB by number of GPU in the System which was a bit of an issue as we built GPU 16 and 32 GB of Memory. One of the thing with ROCm we working was addressing this issue.  It was an issue with APU + dGPU combo.   We also have few more issue we had to deal with that were architected into Catalyst driver like it only supports 4 GB max allocation of memory, it was not until the last release where they fixed the driver to support Larger allocation by chaining multiple 4GB regions into one virtual larger allocation.  There number of other architectural challenges that impacted even what you desire on the APU, but I leave it there.

Over the last two and half year, we had to make some tough priority calls due to the size of the GPU compute engineering team.  Looking at the what OpenCL application that was in the market which also ran on Linux. Also the fact the market on POSIX based OS ( Mac OSX and Linux ( aka NVIDIA)  )  never advanced beyond Common Denominator of OpenCL 1.2 we work to make sure we delivered this at minimum 

 

     - AMDGPUpro is for broad-market support to support all CPU PCIe Gen1, Gen2, Gen3 etc

              - So GFX8 and older never moved to ROCm based driver foundation. Also, they never moved to new LLVM compiler they stayed on LLVM/HSAIL/SC compiler the same as Catalyst used in our  Windows driver.

              - GFX10 aka Vega10 is the only driver that supports the ROCm base driver foundation, 

                - With 17:50 moved the same compiler as Windows Driver LLVM/HSAIL/SC compiler and

               - With 18:20 moved OpenCL on PAL ( Same foundation as Vulkan) with LLVM/HSAIL/SC compiler that Windows Driver uses.

     - The ROCm project which primary focus was advanced GPU computing languages, HPC and Machine/Deep Learning. Because of this,  ROCm uses more advanced platform feature like PCI Gen PCIe Atomics to support Signals which why we need PCIe Gen3 lanes on the CPU PCIe Root Complex which where Server base GPU is placed 

     - ROCm AMGPU LLVM compiler supports  OpenCL 2.0 Kernel on OpenCL 1.2 runtime today and will support Full OpenCL 2.0 with Packed Math Float16 Operation.  

    - On ROCm we had strong drive to get to the pure opensource solution, which LLVM/HSAIL/SC compiler was a big issue, plus for this project, it had a number of shortcomings we were trying to address for HPC Deep Learning market with the new compiler.   Assembler support was critical for our library programs,  You see rocBLAS now hit 94% efficiency on Vega10 for large Square Matrixes on SGEMM and MIOpen it was critical get to performance level with MIOpen on Vega10

As you know, the community of OpenCL adoptors the common Denominator is OpenCL 1.2, not OpenCL 2.0.  Only AMD and Intel moved forward here, Which we do have full support Windows for OpenCL 2.0.  

Now HSA/ROCm and APU, Due to an early Architecture issue, before I ran the team,  they use a particular extension in the SBIOS that extends the SRAT with Topology info into a file called a CRAT.  We had a large number of issue due to OEM/ODM not correctly populated this out.  It is something the Linux and ROCr team have been revisiting. 

     I am sorry this impacted your work, but please be patient as we rebuild the core stack to get to the level of capabilities to meet your expectation. We are working to bring OpenCL 2.0 across both AMDGPUpro and ROCm on Linux, but remember it more then SVM, but we have many users who want Device Enqueue, this was feature the team has been working on since it was another feature that did not work well under catalyst.  

Also, the team has been working on Raven Support for ROCm it just taken a bit longer to get all the foundation we need in place

A lot of this taken longer then we wanted, but it all coming back with a better foundation.   A big thing is the GPU Computing Team and Linux team are now one team which should speed all this up now under a new VP of Engineering.    The one thing we should have done better communicate to the community the changes we doing and why earlier.   

I will leave you with OpenCL 2.0 full support will release within next 6 months

Thanks

Gregory Stoner

Thanks, that was absolutely helpful, and it's great to hear that Raven Ridge is coming in ROCm, as well as full OpenCL 2.0!

As you say, OpenCL 2.0 is more (and more work for you guys) than SVM, and for the short term I would be delighted if I could hack together a solution that gives me:

- OpenCL 1.2 plus SVM so that I can properly evaluate the potential of accelerated OpenCV filters and other image operators on APU

- Enough of ROCm so that I can test ROCmTensorflow or one of the other DL libraries, inference only.

Regarding the last point I'm not sure if we need HIP or MIOpen if we only want to do justice to the potential, since inference of course mainly needs GEMM, no solvers, or do we? OpenCV nowadays has its own DNN inference module on an OpenCL 1.2+SVM backend which may be sufficient, but it would be great if we could compare performance to one of the other libs.

It's OK if I need to combine components from different beta releases. But it would be terrible if I have to completely halt this effort for 6 months...?

One more thing, I hope nobody minds, but it seems so incredibly apt to quote part of Arthur C. Clarke's short Sci-Fi story 'Superiority' here. Just read 'computing' instead of warfare.

     'Norden: "What we want are new weapons - weapons totally different from any that have been employed before. Such weapons can be made: it will take time, of course, but since assuming charge I have replaced some of the older scientists with young men and have directed research into several unexplored fields which show great promise. I believe, in fact, that a revolution in warfare may soon be upon us."

We were skeptical. [...] We did not know, then, that he never promised anything that he had not already almost perfected in the laboratory. In the laboratory - that was the operative phrase.

Norden proved his case less than a month later, when he demonstrated the Sphere of Annihilation, which produced complete disintegration of matter over a radius of several hundred meters. We were intoxicated by the power of the new weapon, and were quite prepared to overlook one fundamental defect - the fact that it was a sphere and hence destroyed its rather complicated generating equipment at the instant of formation. This meant, of course, that it could not be used on warships but only on guided missiles, and a great program was started to convert all homing torpedoes to carry the new weapon. For the time being all further offensives were suspended.

We realize now that this was our first mistake. I still think that it was a natural one, for it seemed to us then that all our existing weapons had become obsolete overnight, and we already regarded them as almost primitive survivals. What we did not appreciate was the magnitude of the task we were attempting, and the length of time it would take to get the revolutionary super-weapon into battle. Nothing like this had happened for a hundred years and we had no previous experience to guide us.

The conversion problem proved far more difficult than anticipated. A new class of torpedo had to be designed, as the standard model was too small. This meant in turn that only the larger ships could launch the weapon, but we were prepared to accept this penalty. After six months, the heavy units of the Fleet were being equipped with the Sphere. Training maneuvers and tests had shown that it was operating satisfactorily and we were ready to take it into action. Norden was already being hailed as the architect of victory, and had half promised even more spectacular weapons.

Then two things happened. One of our battleships disappeared completely on a training flight, and an investigation showed that under certain conditions the ship's long-range radar could trigger the Sphere immediately after it had been launched. The modification needed to overcome this defect was trivial, but it caused a delay of another month and was the source of much bad feeling between the naval staff and the scientists. We were ready for action again - when Norden announced that the radius of effectiveness of the Sphere had now been increased by ten, thus multiplying by a thousand the chances of destroying an enemy ship.

So the modifications started all over again, but everyone agreed that the delay would be worth it. [...]'

The story ends, of course, with that the superior party loses the war.

(The full text is on-line, it was allegedly required reading at MIT in 1961: Short Story - Superiority - by Arthur C. Clarke)

0 Likes
gstoner
Staff

What your asking for would not be prudent at this time and would affect the delivery of OpenCL 2.0 since it is inside the next 6 month or shorter time windows for it's release.   MIOpen also run with OpenCL,  I would expect it still is faster then what you see with OpenCV.

0 Likes

I understand. Thanks, I will take a closer look at MIOpen.

And I won't have to fear that when the new stack comes out, the lowest ISA supported will be GFX10?

0 Likes

Just for the record, I tried the just released ROCm 1.9 despite its note regarding lack of support for Raven Ridge still being there, and I'm delighted to find with kernel 4.19-rc3 which includes the kfd with support for Raven Ridge it seems to 'just work', including OpenCL 1.2 with SVM!

The remaining question is now if the OpenCL 1.2 performance (when using SVM on the interface to the host) is representative for what's achievable with the chip, or if I still have to wait with evaluating its performance? Any bottlenecks the team is still working on?

0 Likes