8 Replies Latest reply on Aug 23, 2017 7:03 AM by dipak

    OpenCL issues on EPYC 7551

    jameskap

      Hi All,

       

      I am seeing extremely poor performance using OpenCL on an EPYC 7551. I can't imagine this is intended as it is far slower in our testing than my old E5-2695 v3 machines.

       

      See the clinfo below. The max work item size looks incredibly low. Likewise the max clock speed is 1200MHz.  I am running Ubuntu 17.04 with AMD-APP-SDK-v3.0.130.136-GA...

       

       

       

      Number of platforms                               1

        Platform Name                                   AMD Accelerated Parallel Processing

        Platform Vendor                                 Advanced Micro Devices, Inc.

        Platform Version                                OpenCL 2.0 AMD-APP (1800.8)

        Platform Profile                                FULL_PROFILE

        Platform Extensions                             cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

        Platform Extensions function suffix             AMD

       

       

        Platform Name                                   AMD Accelerated Parallel Processing

      Number of devices                                 1

        Device Name                                     AMD EPYC 7551 32-Core Processor

        Device Vendor                                   AuthenticAMD

        Device Vendor ID                                0x1002

        Device Version                                  OpenCL 1.2 AMD-APP (1800.8)

        Driver Version                                  1800.8 (sse2,avx)

        Device OpenCL C Version                         OpenCL C 1.2

        Device Type                                     CPU

        Device Profile                                  FULL_PROFILE

        Device Board Name (AMD)

        Device Topology (AMD)                           (n/a)

        Max compute units                               128

        Max clock frequency                             1200MHz

        Device Partition                                (core, cl_ext_device_fission)

          Max number of sub-devices                     128

          Supported partition types                     equally, by counts, by affinity domain

          Supported affinity domains                    L3 cache, L2 cache, L1 cache, next partitionable

          Supported partition types (ext)               equally, by counts, by affinity domain

          Supported affinity domains (ext)              L3 cache, L2 cache, L1 cache, next fissionable

        Max work item dimensions                        3

        Max work item sizes                             1024x1024x1024

        Max work group size                             1024

        Preferred work group size multiple              1

        Preferred / native vector sizes

          char                                                16 / 16

          short                                                8 / 8

          int                                                  4 / 4

          long                                                 2 / 2

          half                                                 4 / 4        (n/a)

          float                                                8 / 8

          double                                               4 / 4        (cl_khr_fp64)

        Half-precision Floating-point support           (n/a)

        Single-precision Floating-point support         (core)

          Denormals                                     Yes

          Infinity and NANs                             Yes

          Round to nearest                              Yes

          Round to zero                                 Yes

          Round to infinity                             Yes

          IEEE754-2008 fused multiply-add               Yes

          Support is emulated in software               No

          Correctly-rounded divide and sqrt operations  Yes

        Double-precision Floating-point support         (cl_khr_fp64)

          Denormals                                     Yes

          Infinity and NANs                             Yes

          Round to nearest                              Yes

          Round to zero                                 Yes

          Round to infinity                             Yes

          IEEE754-2008 fused multiply-add               Yes

          Support is emulated in software               No

          Correctly-rounded divide and sqrt operations  No

        Address bits                                    64, Little-Endian

        Global memory size                              270432862208 (251.9GiB)

        Error Correction support                        No

        Max memory allocation                           67608215552 (62.97GiB)

        Unified memory for Host and Device              Yes

        Minimum alignment for any data type             128 bytes

        Alignment of base address                       1024 bits (128 bytes)

        Global Memory cache type                        Read/Write

        Global Memory cache size                        32768

        Global Memory cache line                        64 bytes

        Image support                                   Yes

          Max number of samplers per kernel             16

          Max size for 1D images from buffer            65536 pixels

          Max 1D or 2D image array size                 2048 images

          Max 2D image size                             8192x8192 pixels

          Max 3D image size                             2048x2048x2048 pixels

          Max number of read image args                 128

          Max number of write image args                64

        Local memory type                               Global

        Local memory size                               32768 (32KiB)

        Max constant buffer size                        65536 (64KiB)

        Max number of constant args                     8

        Max size of kernel argument                     4096 (4KiB)

        Queue properties

          Out-of-order execution                        No

          Profiling                                     Yes

        Prefer user sync for interop                    Yes

        Profiling timer resolution                      1ns

        Profiling timer offset since Epoch (AMD)        1501862602024150536ns (Fri Aug  4 09:03:22 2017)

        Execution capabilities

          Run OpenCL kernels                            Yes

          Run native kernels                            Yes

          SPIR versions                                 1.2

        printf() buffer size                            65536 (64KiB)

        Built-in kernels

        Device Available                                Yes

        Compiler Available                              Yes

        Linker Available                                Yes

        Device Extensions                               cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_spir cl_khr_gl_event

       

       

      NULL platform behavior

        clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  AMD Accelerated Parallel Processing

        clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [AMD]

        clCreateContext(NULL, ...) [default]            Success [AMD]

        clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  Success (1)

          Platform Name                                 AMD Accelerated Parallel Processing

          Device Name                                   AMD EPYC 7551 32-Core Processor

        clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No devices found in platform

        clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform

        clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform

        clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)

          Platform Name                                 AMD Accelerated Parallel Processing

          Device Name                                   AMD EPYC 7551 32-Core Processor

       

       

      ICD loader properties

        ICD loader Name                                 OpenCL ICD Loader

        ICD loader Vendor                               OCL Icd free software

        ICD loader Version                              2.2.11

        ICD loader Profile                              OpenCL 2.1

        • Re: OpenCL issues on EPYC 7551
          whiskey-foxtrot

          I'm actually seeing the same. I'm also on Ubuntu 17.4 on the stock kernel as this is a brand new server. Going to do some kernel optimization/recompiling over the weekend/early next week to see what our options are.

          • Re: OpenCL issues on EPYC 7551
            jameskap

            See below for a comparison to my old E5 machine:

             

             

            Platform: AMD Accelerated Parallel Processing

              Device: AMD EPYC 7551 32-Core Processor

                Driver version  : 1800.8 (sse2,avx) (Linux x64)

                Compute units   : 128

                Clock frequency : 1200 MHz

             

             

                Global memory bandwidth (GBPS)

                  float   : 20.91

                  float2  : 18.55

                  float4  : 21.99

                  float8  : 58.94

                  float16 : 62.95

             

             

                Single-precision compute (GFLOPS)

                  float   : 106.92

                  float2  : 213.37

                  float4  : 418.79

                  float8  : 789.19

                  float16 : 1374.38

             

             

                No half precision support! Skipped

             

             

                Double-precision compute (GFLOPS)

                  double   : 92.89

                  double2  : 183.20

                  double4  : 368.21

                  double8  : 713.34

                  double16 : 266.88

             

             

                Integer compute (GIOPS)

                  int   : 160.82

                  int2  : 156.75

                  int4  : 496.46

                  int8  : 646.58

                  int16 : 643.35

             

             

                Transfer bandwidth (GBPS)

                  enqueueWriteBuffer         : 36.88

                  enqueueReadBuffer          : 13.28

                  enqueueMapBuffer(for read) : 5735.80

                    memcpy from mapped ptr   : 6.10

                  enqueueUnmap(after write)  : 12967.90

                    memcpy to mapped ptr     : 6.12

             

             

                Kernel launch latency : 27.35 us

             

             

            Platform: Intel(R) OpenCL

              Device: Genuine Intel(R) CPU @ 2.20GHz

                Driver version  : 1.2.0.25 (Linux x64)

                Compute units   : 56

                Clock frequency : 2200 MHz

             

             

                Global memory bandwidth (GBPS)

                  float   : 58.00

                  float2  : 56.24

                  float4  : 58.77

                  float8  : 62.97

                  float16 : 61.87

             

             

                Single-precision compute (GFLOPS)

                  float   : 418.39

                  float2  : 848.75

                  float4  : 1599.03

                  float8  : 418.64

                  float16 : 841.79

             

             

                No half precision support! Skipped

             

             

                Double-precision compute (GFLOPS)

                  double   : 413.08

                  double2  : 831.32

                  double4  : 223.81

                  double8  : 404.41

                  double16 : 790.25

             

             

                Transfer bandwidth (GBPS)

                  enqueueWriteBuffer         : 2.79

                  enqueueReadBuffer          : 7.00

                  enqueueMapBuffer(for read) : 5357.99

                    memcpy from mapped ptr   : 8.82

                  enqueueUnmap(after write)  : 3275.60

                    memcpy to mapped ptr     : 8.40

             

             

                Kernel launch latency : 22.13 us

              • Re: OpenCL issues on EPYC 7551
                dipak

                Hi James,

                 

                From the above clinfo output, it looks like reported cpu clock frequency is lower than mentioned in EPYC 7551's spec. That could be the reason for observing this poor performance.

                By default, OpenCL runtime fetches all these device information as reported by the lower level or system. Please check the clock frequency directly from your system information. If you see a significant mismatch, then OpenCL runtime might unable to extract the actual number.

                On the other hand, if you see a similar number on system information also, then it might be an issue with BIOS setting or anything else that causing this under clocking problem.

                 

                You mentioned:

                The max work item size looks incredibly low.

                However, I don't see any problem here because the clinfo shows:

                Max work item sizes                             1024x1024x1024

                Max work group size                             1024

                 

                Regards,

                  • Re: OpenCL issues on EPYC 7551
                    jameskap

                    Hi --

                     

                    I updated the kernel to 4.12 and my BIOS are as supermicro recommended.

                     

                    So I can force the cpu governor to performance mode which will cause the CPU to be pegged at 2GHz:

                     

                    Number of platforms                               1

                      Platform Name                                   AMD Accelerated Parallel Processing

                      Platform Vendor                                 Advanced Micro Devices, Inc.

                      Platform Version                                OpenCL 2.0 AMD-APP (1800.8)

                      Platform Profile                                FULL_PROFILE

                      Platform Extensions                             cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

                      Platform Extensions function suffix             AMD

                     

                     

                      Platform Name                                   AMD Accelerated Parallel Processing

                    Number of devices                                 1

                      Device Name                                     AMD EPYC 7551 32-Core Processor

                      Device Vendor                                   AuthenticAMD

                      Device Vendor ID                                0x1002

                      Device Version                                  OpenCL 1.2 AMD-APP (1800.8)

                      Driver Version                                  1800.8 (sse2,avx)

                      Device OpenCL C Version                         OpenCL C 1.2

                      Device Type                                     CPU

                      Device Profile                                  FULL_PROFILE

                      Device Board Name (AMD)

                      Device Topology (AMD)                           (n/a)

                      Max compute units                               128

                      Max clock frequency                             2000MHz

                      Device Partition                                (core, cl_ext_device_fission)

                        Max number of sub-devices                     128

                        Supported partition types                     equally, by counts, by affinity domain

                        Supported affinity domains                    L3 cache, L2 cache, L1 cache, next partitionable

                        Supported partition types (ext)               equally, by counts, by affinity domain

                        Supported affinity domains (ext)              L3 cache, L2 cache, L1 cache, next fissionable

                      Max work item dimensions                        3

                      Max work item sizes                             1024x1024x1024

                      Max work group size                             1024

                      Preferred work group size multiple              1

                      Preferred / native vector sizes

                        char                                                16 / 16

                        short                                                8 / 8

                        int                                                  4 / 4

                        long                                                 2 / 2

                        half                                                 4 / 4        (n/a)

                        float                                                8 / 8

                        double                                               4 / 4        (cl_khr_fp64)

                      Half-precision Floating-point support           (n/a)

                      Single-precision Floating-point support         (core)

                        Denormals                                     Yes

                        Infinity and NANs                             Yes

                        Round to nearest                              Yes

                        Round to zero                                 Yes

                        Round to infinity                             Yes

                        IEEE754-2008 fused multiply-add               Yes

                        Support is emulated in software               No

                        Correctly-rounded divide and sqrt operations  Yes

                      Double-precision Floating-point support         (cl_khr_fp64)

                        Denormals                                     Yes

                        Infinity and NANs                             Yes

                        Round to nearest                              Yes

                        Round to zero                                 Yes

                        Round to infinity                             Yes

                        IEEE754-2008 fused multiply-add               Yes

                        Support is emulated in software               No

                        Correctly-rounded divide and sqrt operations  No

                      Address bits                                    64, Little-Endian

                      Global memory size                              270430113792 (251.9GiB)

                      Error Correction support                        No

                      Max memory allocation                           67607528448 (62.96GiB)

                      Unified memory for Host and Device              Yes

                      Minimum alignment for any data type             128 bytes

                      Alignment of base address                       1024 bits (128 bytes)

                      Global Memory cache type                        Read/Write

                      Global Memory cache size                        32768

                      Global Memory cache line                        64 bytes

                      Image support                                   Yes

                        Max number of samplers per kernel             16

                        Max size for 1D images from buffer            65536 pixels

                        Max 1D or 2D image array size                 2048 images

                        Max 2D image size                             8192x8192 pixels

                        Max 3D image size                             2048x2048x2048 pixels

                        Max number of read image args                 128

                        Max number of write image args                64

                      Local memory type                               Global

                      Local memory size                               32768 (32KiB)

                      Max constant buffer size                        65536 (64KiB)

                      Max number of constant args                     8

                      Max size of kernel argument                     4096 (4KiB)

                      Queue properties

                        Out-of-order execution                        No

                        Profiling                                     Yes

                      Prefer user sync for interop                    Yes

                      Profiling timer resolution                      1ns

                      Profiling timer offset since Epoch (AMD)        1502913151479942118ns (Wed Aug 16 12:52:31 2017)

                      Execution capabilities

                        Run OpenCL kernels                            Yes

                        Run native kernels                            Yes

                        SPIR versions                                 1.2

                      printf() buffer size                            65536 (64KiB)

                      Built-in kernels

                      Device Available                                Yes

                      Compiler Available                              Yes

                      Linker Available                                Yes

                      Device Extensions                               cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_spir cl_khr_gl_event

                     

                     

                    NULL platform behavior

                      clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  AMD Accelerated Parallel Processing

                      clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [AMD]

                      clCreateContext(NULL, ...) [default]            Success [AMD]

                      clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  Success (1)

                        Platform Name                                 AMD Accelerated Parallel Processing

                        Device Name                                   AMD EPYC 7551 32-Core Processor

                      clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No devices found in platform

                      clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform

                      clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform

                      clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)

                        Platform Name                                 AMD Accelerated Parallel Processing

                        Device Name                                   AMD EPYC 7551 32-Core Processor

                     

                     

                    ICD loader properties

                      ICD loader Name                                 OpenCL ICD Loader

                      ICD loader Vendor                               OCL Icd free software

                      ICD loader Version                              2.2.11

                      ICD loader Profile                              OpenCL 2.1

                     

                    And re running benchmarks I get the same bad performance:

                     

                    Platform: AMD Accelerated Parallel Processing

                      Device: AMD EPYC 7551 32-Core Processor

                        Driver version  : 1800.8 (sse2,avx) (Linux x64)

                        Compute units   : 128

                        Clock frequency : 2000 MHz

                     

                     

                        Global memory bandwidth (GBPS)

                          float   : 17.54

                          float2  : 17.38

                          float4  : 78.93

                          float8  : 85.54

                          float16 : 168.14

                     

                     

                        Single-precision compute (GFLOPS)

                          float   : 107.17

                          float2  : 213.40

                          float4  : 417.33

                          float8  : 789.70

                          float16 : 1346.09

                     

                     

                        No half precision support! Skipped

                     

                     

                        Double-precision compute (GFLOPS)

                          double   : 91.38

                          double2  : 182.52

                          double4  : 357.25

                          double8  : 674.81

                          double16 : 264.95

                     

                     

                        Integer compute (GIOPS)

                          int   : 162.23

                          int2  : 156.74

                          int4  : 518.78

                          int8  : 646.18

                          int16 : 643.28

                     

                     

                        Transfer bandwidth (GBPS)

                          enqueueWriteBuffer         : 18.89

                          enqueueReadBuffer          : 13.05

                          enqueueMapBuffer(for read) : 13626.17

                            memcpy from mapped ptr   : 12.21

                          enqueueUnmap(after write)  : 45497.54

                            memcpy to mapped ptr     : 15.74

                     

                     

                        Kernel launch latency : 16.32 us

                      • Re: OpenCL issues on EPYC 7551
                        dipak

                        Thanks for this update. Indeed, it's little bit surprising to see almost similar GFLOP numbers even though clock frequency was set to a much higher value. It would be helpful if you share the benchmark code that was used to get the above numbers so we could check at our end.

                        Btw, did you try some other compute intensive kernels (may be from APP SDK samples) for this comparison. If not, please check. I would suggest to observe the cpu performance at real-time to see how close it is to the max. frequency?

                         

                        Regards,

                          • Re: OpenCL issues on EPYC 7551
                            jameskap

                            Hi --

                             

                            The response I got from AMD (through supermicro) says that it is (poorly) working as intended :

                            The low performance your customer is getting is due the lack of Zen (EPYC) support on the SDK for OCL.  The APP SDK available doesn’t currently have knobs that is optimized for EPYC.  AMD’s software group is more focused on ROCM to support our dGPU offering.

                             

                            This doesn’t mean we will not support a new release of the APP SDK for OpenCL in the coming months, but this is not high on the priority list.

                             

                            Are there any suggestions for how I should proceed with using openCL on EPYC?

                    • Re: OpenCL issues on EPYC 7551
                      ray_m

                      Moving this to the OpenCL forums where it will get an official response.