cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

KenDomino
Journeyman III

OpenCL performance on AMD GPU's

I’ve been experiencing some performance issues with OpenCL on AMD GPU’s.  While OpenCL implementations seemed to be better than serial CPU code, I didn't think much about the performance until the Developer Preview of Visual Studio 11 came out.  The runtimes of OpenCL on AMD GPU's is terrible compared to C++ AMP.  I'm trying to understand why.

I wrote a program to help me quantify the runtime of OpenCL and C++ AMP, which is available here (requires MS Visual Studio 11 Developer Preview). It performs matrix multiplication across all platforms and devices in OpenCL, and all accelerators in C++ AMP, using the same algorithms.  CUDA is not included in this implementation because it is not hardware independent, whereas OpenCL and C++ AMP are.  This problem is AMD GPU specific; I do not see the same problem with NVIDIA cards (OpenCL and C++ AMP perform roughly equivalent).

For the comparison, I decided to use an AMD Llano machine (AMD A8-3850, Gigabyte GA-A75-UD4H M.B., 8GB memory, Windows 8 Developer Preview).  This machine does not include NVIDIA software, only MS and the AMD APP SDK 2 version 1.2 (12/19/2011).

The output from the program is here:

has disp 1
mem 504936
dev PCI\VEN_1002&DEV_9640&SUBSYS_D0001458&REV_00\3&2B8E0B4B&0&08
dev AMD Radeon HD 6550D (Engineering Sample)
Starting serial... 1093.99 ms.
Starting serial... 1101.95 ms.
Starting serial... 1094.82 ms.
Starting simple... 70.6149 ms.
Starting simple... 60.582 ms.
Starting simple... 60.9937 ms.
Starting explicit... 58.338 ms.
Starting explicit... 58.1812 ms.
Starting explicit... 57.2286 ms.
Starting tile... 43.9958 ms.
Starting tile... 33.2602 ms.
Starting tile... 33.2874 ms.
has disp 0
mem 0
dev direct3d\warp
dev Microsoft Basic Render Driver
Starting serial... 1082.92 ms.
Starting serial... 1083.16 ms.
Starting serial... 1083.2 ms.
Starting simple... 1418.89 ms.
Starting simple... 1420 ms.
Starting simple... 1412.86 ms.
Starting explicit... 1062.79 ms.
Starting explicit... 1056.32 ms.
Starting explicit... 1064.37 ms.
Starting tile... 883.104 ms.
Starting tile... 881.774 ms.
Starting tile... 863.205 ms.
has disp 1
mem 1024
dev direct3d\ref
dev Software Adapter
has disp 0
mem 3667320
dev cpu
dev CPU accelerator
Number of platforms = 1
Platform profile: FULL_PROFILE
Platform version: OpenCL 1.1 AMD-APP (851.6)
Platform name: AMD Accelerated Parallel Processing
Platform vendor: Advanced Micro Devices, Inc.
Platform extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_
khr_d3d10_sharing
devices = 2
Device [0]
type = CL_DEVICE_TYPE_GPU
name = BeaverCreek
Starting serial... 1085.57 ms.
Starting serial... 1085.6 ms.
Starting serial... 1085.19 ms.
Starting explicit simple... 602.773 ms.
Starting explicit simple... 673.661 ms.
Starting explicit simple... 832.318 ms.
Starting tile... 659.472 ms.
Starting tile... 554.391 ms.
Starting tile... 581.086 ms.
Device [1]
type = CL_DEVICE_TYPE_CPU
name = AMD A8-3850 APU with Radeon(tm) HD Graphics
Starting serial... 1085.84 ms.
Starting serial... 1085.71 ms.
Starting serial... 1085.37 ms.
Starting explicit simple... 240.423 ms.
Starting explicit simple... 241.033 ms.
Starting explicit simple... 239.877 ms.
Starting tile... 1538.41 ms.
Starting tile... 1388.03 ms.
Starting tile... 2050.87 ms.

The results of the program indicate that C++ AMP typically computes the result of the multiplication of single precision floating point of input matrices A[450 rows, 640 cols] and B[640 rows, 960 cols] in 58 ms. In comparison, the OpenCL implementation solves the problem at best in 600 ms on the GPU.  A CPU OpenCL device is enumerated in the program, and does surprisingly better than the GPU, running in 240 ms.

This doesn’t make sense because almost all variables have been eliminated:

  • the size of the problem is the same;
  • the size of the tiling (16 by 16) is the same;
  • the allocation of device memory is the same;
  • the copies to/from the CPU memory space and GPU memory space is the same;
  • the kernel algorithms (of which there are two, “simple explicit” is a non-shared-memory implementation, “tiled” a shared-memory implementation) are the same;
  • the device that appears as a GPU in both OpenCL and C++ AMP, which should be the same.

On the chance that clCreateProgramWithSource / clBuildProgram compilation is the culprit, I do not include the runtime of those two steps in the overall runtime.  But, that does not help.

On another machine which has an NVIDIA card, there is neglible difference between OpenCL and C++ AMP, and CUDA (both runtime and driver implementations) for the NVIDIA GPU. However, that machine also has an AMD graphics card, which exhibits the same problem there as on the Llano machine: OpenCL for AMD GPU targets perform poorly compared to C++ AMP implementations that target the GPU.

I suspect that the reason for the poor performance is because C++ AMP targets DirectX11 which is implemented by the card itself, whereas OpenCL is translated into VLIW code. But, I don't know any details of how OpenCL is implemented for AMD GPU's.  Or, it could be something simple that I've overlooked in the installation of the AMD APP SDK. I just don't know.

Comments would be appreciated.

Ken

0 Likes
5 Replies

Without having the app to analyze it is hard to tell and I am not familiar with C++ AMP, but the DX11 path and the OpenCL path are quite different. That being said they both go through the same intermediate language before getting compiled into ISA. If you can provide either the intermediate language or ISA for the problematic shaders, then it will be easier to understand what the issue is.

Message was edited by: Micah Villmow

0 Likes

The code was posted on my web site, but I guess you read the post the few minutes before I ftp'ed the ZIP file to the site.  Sorry.   (It is http://codinggorilla.domemtech.com/code/Matrix_C_AMP_OpenCL.zip).  Anyhow, here is the C++ and OpenCL kernel source code.

The C++ AMP is pretty straight forward.  Data is copied to/from GPU memory through array_view<>'s; kernels are called from parallel_for_each function calls, which have a tile and grid specified as one of the parameters.  Unfortunately, I don't know how to see the HLSL source that is generated for C++ AMP.

Likewise, I don't know how to see the IL or ISA for OpenCL, presumably through clGetProgramInfo.  But last time I did that, several months ago, it appeared to be intelligible by human eyes (unlike NVIDIA "binary" which is actually PTX source code).  Is there a way to disassemble that, or another route to see a text representation of the IL/ISA, or a compiler flag to output something readable?

Thanks,

Ken

==============================

// main.cpp

#include <stdio.h>

#include <stddef.h>

#include <malloc.h>

#include <amp.h>

#include <sys/timeb.h>

#include <time.h>

#include <iostream>

#include <stdlib.h>

#include <stdio.h>

#include <math.h>

#include <cl/cl.h>

#include <string.h>

#include <stdlib.h>

#include <sys/timeb.h>

#include <time.h>

#include <iostream>

#include <iomanip>

#include <fstream>

using namespace concurrency;

#define TIMES 3

class Counter

{

private:

    static bool initialized;

    // Returns the overhead of the timer in ticks

    static LONGLONG GetOverhead()

    {

        Counter t;

        t.Start();

        t.Stop();

        return t.m_stop.QuadPart - t.m_start.QuadPart;

    }

    LARGE_INTEGER m_start;

    LARGE_INTEGER m_stop;

    static LARGE_INTEGER m_freq;

    static LONGLONG m_overhead;

public:

    Counter()

    {

        if (initialized == false)

        {

            initialized = true;

            m_freq = (QueryPerformanceFrequency(&Counter::m_freq), Counter::m_freq);

            m_overhead = GetOverhead();

        }

    }

    void Start()

    {

        QueryPerformanceCounter(&m_start);

    }

    double Stop()

    {

        QueryPerformanceCounter(&m_stop);

        return (m_stop.QuadPart - m_start.QuadPart - m_overhead) * 1000.0 / m_freq.QuadPart;

    }

};

bool Counter::initialized = false;

LARGE_INTEGER Counter::m_freq;

LONGLONG Counter::m_overhead;

typedef struct {

    cl_int rows;

    cl_int cols;

    cl_float data[];

} Matrix;

cl_float * Data(Matrix * matrix)

{

    return (cl_float*)&matrix->data;

}

Matrix * Create_Matrix(int rows, int cols)

{

    Matrix * ret = (Matrix*)malloc(2 * sizeof(cl_int) + rows * cols * sizeof(cl_float));

    ret->rows = rows;

    ret->cols = cols;

    for (int i = 0; i < rows * cols; ++i)

        Data(ret) = 0;

    return ret;

}

void MultiplySerial(Matrix * C, Matrix * A, Matrix * B)

{

    int wA = A->cols;

    int hA = A->rows;

    int wB = B->cols;

    int hB = B->rows;

    int wC = C->cols;

    int hC = C->rows;

    for (int gr = 0; gr < hA; ++gr) // row

        for (int gc = 0; gc < wB; ++gc) { // col

            float sum = 0;

            for (int k = 0; k < hB; ++k) {

                cl_float a = Data(A)[gr * wA + k];

                cl_float b = Data(B)[k * wB + gc];

                sum += a * b;

            }

            Data(C)[gr * wC + gc] = sum;

        }

}

void withTimeMultiplySerial(Matrix * C, Matrix * A, Matrix * B)

{

    Counter counter;

    std::cout << "Starting serial... ";

    counter.Start();

    int wA = A->cols;

    int hA = A->rows;

    int wB = B->cols;

    int hB = B->rows;

    int wC = C->cols;

    int hC = C->rows;

    for (int gr = 0; gr < hA; ++gr) // row

        for (int gc = 0; gc < wB; ++gc) { // col

            cl_float sum = 0;

            for (int k = 0; k < hB; ++k) {

                cl_float a = Data(A)[gr * wA + k];

                cl_float b = Data(B)[k * wB + gc];

                sum += a * b;

            }

            Data(C)[gr * wC + gc] = sum;

        }

    std::cout << counter.Stop() << " ms.\n";

}

#define MYSIZE 2000

char buffer[MYSIZE];

CHAR* wtoc(const WCHAR* Source)

{

    for (int j = 0; j < MYSIZE; ++j)

        buffer = 0;

    int i = 0;

    while(Source != '\0')

    {

        buffer = (CHAR)Source;

        ++i;

        if (i > 2000)

            break;

    }

    return buffer;

}

void aMultiplySimple(accelerator_view & acc, Matrix * C, Matrix * A, Matrix * B)

{

    int wA = A->cols;

    int hA = A->rows;

    int wB = B->cols;

    int hB = B->rows;

    int wC = C->cols;

    int hC = C->rows;

    Counter counter;

    std::cout << "Starting simple... ";

    counter.Start();

    {

        array_view<const float,1> a(hA * wA, Data(A));

        array_view<const float,1> b(hB * wB, Data(B));

        array_view<writeonly<float>,1> c(hC * wC, Data(C));

        extent<2> e(hC, wC);

        grid<2> g(e);

        parallel_for_each(acc, g,

            [=](index<2> idx) restrict(direct3d) {

                int gr = idx[0];

                int gc = idx[1];

                float sum = 0.0f;

                for(int k = 0; k < hB; k++)

                {

                    float aa = a[gr * wA + k];

                    float bb = b[k * wB + gc];

                    sum += aa * bb;

                }

                c[gr * wC + gc] = sum;

        });

    }

    std::cout << counter.Stop() << " ms.\n";

}

void aMultiplyExplicitSimple(accelerator_view & acc, Matrix * C, Matrix * A, Matrix * B)

{

    static const int TS = 16;

    int wA = A->cols;

    int hA = A->rows;

    int wB = B->cols;

    int hB = B->rows;

    int wC = C->cols;

    int hC = C->rows;

    Counter counter;

    std::cout << "Starting explicit... ";

    counter.Start();

    {

        array_view<const float,1> a(hA * wA, Data(A));

        array_view<const float,1> b(hB * wB, Data(B));

        array_view<writeonly<float>,1> c(hC * wC, Data(C));

        extent<2> e(hC, wC);

        grid<2> g(e);

        parallel_for_each(acc, g.tile<TS,TS>(),

            [=](tiled_index<TS,TS> idx) restrict(direct3d) {

                int gr = idx.global[0]; int gc = idx.global[1];

                float sum = 0.0f;

                for(int k = 0; k < hB; k++)

                {

                    float aa = a[gr * wA + k];

                    float bb = b[k * wB + gc];

                    sum += aa * bb;

                }

                c[gr * wC + gc] = sum;

        });

    }

    std::cout << counter.Stop() << " ms.\n";

}

void aMultiplyTile(accelerator_view & acc, Matrix * C, Matrix * A, Matrix * B)

{

    static const int TS = 16;

    int wA = A->cols;

    int hA = A->rows;

    int wB = B->cols;

    int hB = B->rows;

    int wC = C->cols;

    int hC = C->rows;

    Counter counter;

    std::cout << "Starting tile... ";

    counter.Start();

    {

        array_view<const float,1> a(hA * wA, Data(A));

        array_view<const float,1> b(hB * wB, Data(B));

        array_view<writeonly<float>,1> c(hC * wC, Data(C));

        extent<2> e(hC, wC);

        grid<2> g(e);

        parallel_for_each(acc, g.tile<TS,TS>(),

            [=](tiled_index<TS,TS> idx) restrict(direct3d) {

                int lr = idx.local[0]; int lc = idx.local[1];

                int gr = idx.global[0]; int gc = idx.global[1];

                float sum = 0.0f;

                for (int i = 0; i < hB; i += TS) {

                    tile_static float locA[TS][TS], locB[TS][TS];

                    locA[lr][lc] = a[gr * wA + lc + i];

                    locB[lr][lc] = b[(lr + i) * wB + gc];

                    idx.barrier.wait();

                    for (int k = 0; k < TS; k++)

                        sum += locA[lr] * locB[lc];           

                    idx.barrier.wait();

                }

                c[gr * wC + gc] = sum;

        });

    }

    std::cout << counter.Stop() << " ms.\n";

}

void CHECK(cl_int x)

{

    if (x != 0)

    {

        std::cout << "Error " << x << "\n";

        switch (x)

        {

        case CL_SUCCESS:

            std::cout << "C_SUCCESS";

            break;

        case CL_DEVICE_NOT_FOUND:

            std::cout << "CL_DEVICE_NOT_FOUND";

            break;

        case CL_DEVICE_NOT_AVAILABLE:

            std::cout << "CL_DEVICE_NOT_AVAILABLE";

            break;

        case CL_COMPILER_NOT_AVAILABLE:

            std::cout << "CL_COMPILER_NOT_AVAILABLE";

            break;

        case CL_MEM_OBJECT_ALLOCATION_FAILURE:

            std::cout << "CL_MEM_OBJECT_ALLOCATION_FAILURE";

            break;

        case CL_OUT_OF_RESOURCES:

            std::cout << "CL_OUT_OF_RESOURCES";

            break;

        case CL_OUT_OF_HOST_MEMORY:

            std::cout << "CL_OUT_OF_HOST_MEMORY";

            break;

        case CL_PROFILING_INFO_NOT_AVAILABLE:

            std::cout << "CL_PROFILING_INFO_NOT_AVAILABLE";

            break;

        case CL_MEM_COPY_OVERLAP:

            std::cout << "CL_MEM_COPY_OVERLAP";

            break;

        case CL_IMAGE_FORMAT_MISMATCH:

            std::cout << "CL_IMAGE_FORMAT_MISMATCH";

            break;

        case CL_IMAGE_FORMAT_NOT_SUPPORTED:

            std::cout << "CL_IMAGE_FORMAT_NOT_SUPPORTED";

            break;

        case CL_BUILD_PROGRAM_FAILURE:

            std::cout << "CL_BUILD_PROGRAM_FAILURE";

            break;

        case CL_MAP_FAILURE:

            std::cout << "CL_MAP_FAILURE";

            break;

        case CL_MISALIGNED_SUB_BUFFER_OFFSET:

            std::cout << "CL_MISALIGNED_SUB_BUFFER_OFFSET";

            break;

        case CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST:

            std::cout << "CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST";

            break;

        case CL_INVALID_VALUE:

            std::cout << "CL_INVALID_VALUE";

            break;

        case CL_INVALID_DEVICE_TYPE:

            std::cout << "CL_INVALID_DEVICE_TYPE";

            break;

        case CL_INVALID_PLATFORM:

            std::cout << "CL_INVALID_PLATFORM";

            break;

        case CL_INVALID_DEVICE:

            std::cout << "CL_INVALID_DEVICE";

            break;

        case CL_INVALID_CONTEXT:

            std::cout << "CL_INVALID_CONTEXT";

            break;

        case CL_INVALID_QUEUE_PROPERTIES:

            std::cout << "CL_INVALID_QUEUE_PROPERTIES";

            break;

        case CL_INVALID_COMMAND_QUEUE:

            std::cout << "CL_INVALID_COMMAND_QUEUE";

            break;

        case CL_INVALID_HOST_PTR:

            std::cout << "CL_INVALID_HOST_PTR";

            break;

        case CL_INVALID_MEM_OBJECT:

            std::cout << "CL_INVALID_MEM_OBJECT";

            break;

        case CL_INVALID_IMAGE_FORMAT_DESCRIPTOR:

            std::cout << "CL_INVALID_IMAGE_FORMAT_DESCRIPTOR";

            break;

        case CL_INVALID_IMAGE_SIZE:

            std::cout << "CL_INVALID_IMAGE_SIZE";

            break;

        case CL_INVALID_SAMPLER:

            std::cout << "CL_INVALID_SAMPLER";

            break;

        case CL_INVALID_BINARY:

            std::cout << "CL_INVALID_BINARY";

            break;

        case CL_INVALID_BUILD_OPTIONS:

            std::cout << "CL_INVALID_BUILD_OPTIONS";

            break;

        case CL_INVALID_PROGRAM:

            std::cout << "CL_INVALID_PROGRAM";

            break;

        case CL_INVALID_PROGRAM_EXECUTABLE:

            std::cout << "CL_INVALID_PROGRAM_EXECUTABLE";

            break;

        case CL_INVALID_KERNEL_NAME:

            std::cout << "CL_INVALID_KERNEL_NAME";

            break;

        case CL_INVALID_KERNEL_DEFINITION:

            std::cout << "CL_INVALID_KERNEL_DEFINITION";

            break;

        case CL_INVALID_KERNEL:

            std::cout << "CL_INVALID_KERNEL";

            break;

        case CL_INVALID_ARG_INDEX:

            std::cout << "CL_INVALID_ARG_INDEX";

            break;

        case CL_INVALID_ARG_VALUE:

            std::cout << "CL_INVALID_ARG_VALUE";

            break;

        case CL_INVALID_ARG_SIZE:

            std::cout << "CL_INVALID_ARG_SIZE";

            break;

        case CL_INVALID_KERNEL_ARGS:

            std::cout << "CL_INVALID_KERNEL_ARGS";

            break;

        case CL_INVALID_WORK_DIMENSION:

            std::cout << "CL_INVALID_WORK_DIMENSION";

            break;

        case CL_INVALID_WORK_GROUP_SIZE:

            std::cout << "CL_INVALID_WORK_GROUP_SIZE";

            break;

        case CL_INVALID_WORK_ITEM_SIZE:

            std::cout << "CL_INVALID_WORK_ITEM_SIZE";

            break;

        case CL_INVALID_GLOBAL_OFFSET:

            std::cout << "CL_INVALID_GLOBAL_OFFSET";

            break;

        case CL_INVALID_EVENT_WAIT_LIST:

            std::cout << "CL_INVALID_EVENT_WAIT_LIST";

            break;

        case CL_INVALID_EVENT:

            std::cout << "CL_INVALID_EVENT";

            break;

        case CL_INVALID_OPERATION:

            std::cout << "CL_INVALID_OPERATION";

            break;

        case CL_INVALID_GL_OBJECT:

            std::cout << "CL_INVALID_GL_OBJECT";

            break;

        case CL_INVALID_BUFFER_SIZE:

            std::cout << "CL_INVALID_BUFFER_SIZE";

            break;

        case CL_INVALID_MIP_LEVEL:

            std::cout << "CL_INVALID_MIP_LEVEL";

            break;

        case CL_INVALID_GLOBAL_WORK_SIZE:

            std::cout << "CL_INVALID_GLOBAL_WORK_SIZE";

            break;

        case CL_INVALID_PROPERTY:

            std::cout << "CL_INVALID_PROPERTY";

            break;

        default:

            std::cout << "Unknown";

            break;

        }

        throw new std::string(" Error " + x);

    }

}

void get_device_info(cl_device_id device_id, cl_device_info device_info, std::string* value, cl_int * err)

{

    size_t size = 0;

    //  Get all params for the given platform id, first query their size, then get the actual data

    *err = clGetDeviceInfo(device_id, device_info, 0, NULL, &size);

    value->resize(size);

    *err = clGetDeviceInfo(device_id, device_info, size, &((*value)[0]), NULL);

}

void get_platform_info(cl_platform_id platform_id, cl_platform_info platform_info, std::string* value, cl_int * err)

{

    ::size_t size = 0;

    //  Get all params for the given platform id, first query their size, then get the actual data

    *err = clGetPlatformInfo(platform_id, platform_info, 0, NULL, &size);

    value->resize(size);

    *err = clGetPlatformInfo(platform_id, platform_info, size, &((*value)[0]), NULL);

}

std::string LoadProgram(char * file_name)

{

    std::ifstream input_file;

    input_file.open (file_name, std::ios::binary | std::ios::in);

    if (! input_file.is_open())

        return 0;

    //  Read contents

    std::istreambuf_iterator<char> begin(input_file.rdbuf());

    std::istreambuf_iterator<char> end;

    //  Store in std::string object

    std::string file_content(begin, end);

    //  Save source of program.

    return file_content;

}

void clMultiplyExplicitSimple(cl_device_id device, cl_context clGPUContext, Matrix * C, Matrix * A, Matrix * B)

{

    cl_int errcode;

    cl_command_queue clCommandQue;

    cl_program clProgram;

    cl_kernel clKernel;

    cl_mem d_A;

    cl_mem d_B;

    cl_mem d_C;

    Counter counter1;

    Counter counter2;

    std::cout << "Starting explicit simple... ";

    counter1.Start();

    clCommandQue = clCreateCommandQueue(clGPUContext, device, 0, &errcode);

    CHECK(errcode);

    d_C = clCreateBuffer(clGPUContext, CL_MEM_WRITE_ONLY, 2 * sizeof(cl_int) + C->cols * C->rows * sizeof(cl_float), 0, &errcode);

    CHECK(errcode);

    {

        // write input array

        errcode = clEnqueueWriteBuffer(clCommandQue, d_C, CL_TRUE, 0, 2 * sizeof(cl_int) + C->cols * C->rows * sizeof(cl_float), C, 0, NULL, NULL);

        CHECK(errcode);

    }

    d_A = clCreateBuffer(clGPUContext, CL_MEM_READ_ONLY, 2 * sizeof(cl_int) + A->cols * A->rows * sizeof(cl_float), 0, &errcode);

    CHECK(errcode);

    {

        // write input array

        errcode = clEnqueueWriteBuffer(clCommandQue, d_A, CL_TRUE, 0, 2 * sizeof(cl_int) + A->cols * A->rows * sizeof(cl_float), A, 0, NULL, NULL);

        CHECK(errcode);

    }

    d_B = clCreateBuffer(clGPUContext, CL_MEM_READ_ONLY, 2 * sizeof(cl_int) + B->cols * B->rows * sizeof(cl_float), 0, &errcode);

    CHECK(errcode);

    {

        // write input array

        errcode = clEnqueueWriteBuffer(clCommandQue, d_B, CL_TRUE, 0, 2 * sizeof(cl_int) + B->cols * B->rows * sizeof(cl_float), B, 0, NULL, NULL);

        CHECK(errcode);

    }

    double cv1 = counter1.Stop();

    std::string clMatrixMul = LoadProgram("kernel.cl");

    char * kernel_code = (char*)clMatrixMul.c_str();

    std::string new_str = "";

    // remove all \n's, \r's.

    //size_t len = strlen(kernel_code);

    //for (int i = 0; i < len; ++i)

    //{

    //  // Ignore carriage returns

    //  //if (kernel_code == '\r') continue;

    //  //if (kernel_code == '\n') continue;

    //  new_str += kernel_code;

    //}

    //kernel_code = (char*)new_str.c_str();

    clProgram = clCreateProgramWithSource(clGPUContext, 1, (const char **)&kernel_code, 0, &errcode);

    CHECK(errcode);

    errcode = clBuildProgram(clProgram, 0, NULL, NULL, NULL, NULL);

    //  Check for build errors

    if (errcode != CL_SUCCESS)

    {

        printf("error = %d\n", errcode);

        cl_build_status status;

        //  Retrieve build status

        CHECK(clGetProgramBuildInfo(clProgram, device, CL_PROGRAM_BUILD_STATUS, sizeof(cl_build_status), &status, NULL));

        //  On error, get the log

        if (status == CL_BUILD_ERROR)

        {

            size_t log_size = 0;

            //  Get size of log

            CHECK(clGetProgramBuildInfo(clProgram, device, CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size));

            //  Allocate log

            std::string device_build_log;

            device_build_log.resize(log_size, 0);

            //  Get log

            CHECK(clGetProgramBuildInfo(clProgram, device, CL_PROGRAM_BUILD_LOG, log_size, &device_build_log[0], NULL));

            //  Print log

            printf(" build error: %s\n", device_build_log.c_str());

        }

        else

        {

            printf("Unknown\n");

        }

        return;

    }

    clKernel = clCreateKernel(clProgram, "kernelSimple", &errcode);

    CHECK(errcode);

    counter2.Start();

    size_t localWorkSize[2], globalWorkSize[2];

    errcode = clSetKernelArg(clKernel, 0, sizeof(cl_mem), (void *)&d_C);

    CHECK(errcode);

    errcode = clSetKernelArg(clKernel, 1, sizeof(cl_mem), (void *)&d_A);

    CHECK(errcode);

    errcode = clSetKernelArg(clKernel, 2, sizeof(cl_mem), (void *)&d_B);

    CHECK(errcode);

    localWorkSize[0] = 16;

    localWorkSize[1] = 16;

    globalWorkSize[0] = C->rows;

    globalWorkSize[1] = C->cols;

    cl_event my_event;

    errcode = clEnqueueNDRangeKernel(clCommandQue, clKernel, 2, NULL, globalWorkSize, localWorkSize, 0, NULL, &my_event);

    CHECK(errcode);

    errcode = clWaitForEvents(1, &my_event);

    CHECK(errcode);

    errcode = clEnqueueReadBuffer(clCommandQue, d_C, CL_TRUE, 0,

        2 * sizeof(cl_int) + C->cols * C->rows * sizeof(cl_float),

        &C->rows,

        0, NULL, NULL);

    CHECK(errcode);

    clReleaseMemObject(d_A);

    clReleaseMemObject(d_C);

    clReleaseMemObject(d_B);

    clReleaseKernel(clKernel);

    clReleaseProgram(clProgram);

    clReleaseCommandQueue(clCommandQue);

    std::cout << (cv1 + counter2.Stop()) << " ms.\n";

}

void clMultiplyTile(cl_device_id device, cl_context clGPUContext, Matrix * C, Matrix * A, Matrix * B)

{

    cl_int errcode;

    cl_command_queue clCommandQue;

    cl_program clProgram;

    cl_kernel clKernel;

    cl_mem d_A;

    cl_mem d_B;

    cl_mem d_C;

    Counter counter1;

    Counter counter2;

    std::cout << "Starting tile... ";

    counter1.Start();

    clCommandQue = clCreateCommandQueue(clGPUContext, device, 0, &errcode);

    CHECK(errcode);

    d_C = clCreateBuffer(clGPUContext, CL_MEM_WRITE_ONLY, 2 * sizeof(cl_int) + C->cols * C->rows * sizeof(cl_float), 0, &errcode);

    CHECK(errcode);

    {

        // write input array

        errcode = clEnqueueWriteBuffer(clCommandQue, d_C, CL_TRUE, 0, 2 * sizeof(cl_int) + C->cols * C->rows * sizeof(cl_float), C, 0, NULL, NULL);

        CHECK(errcode);

    }

    d_A = clCreateBuffer(clGPUContext, CL_MEM_READ_ONLY, 2 * sizeof(cl_int) + A->cols * A->rows * sizeof(cl_float), 0, &errcode);

    CHECK(errcode);

    {

        // write input array

        errcode = clEnqueueWriteBuffer(clCommandQue, d_A, CL_TRUE, 0, 2 * sizeof(cl_int) + A->cols * A->rows * sizeof(cl_float), A, 0, NULL, NULL);

        CHECK(errcode);

    }

    d_B = clCreateBuffer(clGPUContext, CL_MEM_READ_ONLY, 2 * sizeof(cl_int) + B->cols * B->rows * sizeof(cl_float), 0, &errcode);

    CHECK(errcode);

    {

        // write input array

        errcode = clEnqueueWriteBuffer(clCommandQue, d_B, CL_TRUE, 0, 2 * sizeof(cl_int) + B->cols * B->rows * sizeof(cl_float), B, 0, NULL, NULL);

        CHECK(errcode);

    }

    double cv1 = counter1.Stop();

    std::string clMatrixMul = LoadProgram("kernel.cl");

    char * kernel_code = (char*)clMatrixMul.c_str();

    clProgram = clCreateProgramWithSource(clGPUContext, 1, (const char **)&kernel_code, 0, &errcode);

    CHECK(errcode);

    errcode = clBuildProgram(clProgram, 0, NULL, NULL, NULL, NULL);

    CHECK(errcode);

    clKernel = clCreateKernel(clProgram, "kernelTile", &errcode);

    CHECK(errcode);

    counter2.Start();

    size_t localWorkSize[2], globalWorkSize[2];

    errcode = clSetKernelArg(clKernel, 0, sizeof(cl_mem), (void *)&d_C);

    CHECK(errcode);

    errcode = clSetKernelArg(clKernel, 1, sizeof(cl_mem), (void *)&d_A);

    CHECK(errcode);

    errcode = clSetKernelArg(clKernel, 2, sizeof(cl_mem), (void *)&d_B);

    CHECK(errcode);

    localWorkSize[0] = 16;

    localWorkSize[1] = 16;

    globalWorkSize[0] = C->rows;

    globalWorkSize[1] = C->cols;

    cl_event my_event;

    errcode = clEnqueueNDRangeKernel(clCommandQue, clKernel, 2, NULL, globalWorkSize, localWorkSize, 0, NULL, &my_event);

    CHECK(errcode);

    errcode = clWaitForEvents(1, &my_event);

    CHECK(errcode);

    errcode = clEnqueueReadBuffer(clCommandQue, d_C, CL_TRUE, 0,

        2 * sizeof(cl_int) + C->cols * C->rows * sizeof(cl_float),

        &C->rows,

        0, NULL, NULL);

    CHECK(errcode);

    clReleaseMemObject(d_A);

    clReleaseMemObject(d_C);

    clReleaseMemObject(d_B);

    clReleaseKernel(clKernel);

    clReleaseProgram(clProgram);

    clReleaseCommandQueue(clCommandQue);

    std::cout << (cv1 + counter2.Stop()) << " ms.\n";

}

void part1()

{

    Matrix * A = Create_Matrix(16*30, 16*40);

    Matrix * B = Create_Matrix(16*40, 16*60);

    Matrix * C = Create_Matrix(A->rows, B->cols);

    Matrix * C2 = Create_Matrix(A->rows, B->cols);

    Matrix * C3 = Create_Matrix(A->rows, B->cols);

    Matrix * C4 = Create_Matrix(A->rows, B->cols);

    std::vector<concurrency::accelerator> accelerators = concurrency::get_accelerators();

    std::vector<concurrency::accelerator>::iterator it;

    for (it = accelerators.begin(); it != accelerators.end(); ++it)

    {

        std::cout << "has disp " << (*it).get_has_display() << "\n";

        std::cout << "mem " << (*it).get_dedicated_memory() << "\n";

        std::cout << "dev " << wtoc((*it).get_device_path().c_str()) << "\n";

        std::cout << "dev " << wtoc((*it).get_description().c_str()) << "\n";

        if (strcmp(wtoc((*it).get_description().c_str()), "Software Adapter") == 0)

            continue;

        if (strcmp(wtoc((*it).get_description().c_str()), "CPU accelerator") == 0)

            continue;

       

        accelerator_view acc = it->create_view();

        for (int i = 0; i < TIMES; ++i)

        {

            for (int i = 0; i < A->rows * A->cols; ++i) Data(A) = rand() % 10;

            for (int i = 0; i < B->rows * B->cols; ++i) Data(B) = rand() % 10;

            for (int i = 0; i < C->rows * C->cols; ++i) Data(C) = rand() % 10;

            withTimeMultiplySerial(C, A, B);

        }

        for (int i = 0; i < TIMES; ++i)

        {

            for (int i = 0; i < A->rows * A->cols; ++i) Data(A) = rand() % 10;

            for (int i = 0; i < B->rows * B->cols; ++i) Data(B) = rand() % 10;

            for (int i = 0; i < C->rows * C->cols; ++i) Data(C) = rand() % 10;

            for (int i = 0; i < C2->rows * C2->cols; ++i) Data(C2) = rand() % 10;

            MultiplySerial(C, A, B);

            aMultiplySimple(acc, C2, A, B);

            for (int i = 0; i < C->rows * C->cols; ++i)

                if (fabs(Data(C) - Data(C2)) > 0.0001)

                {

                    std::cout << "diff C2\n";

                    break;

                }

        }

        for (int i = 0; i < TIMES; ++i)

        {

            for (int i = 0; i < A->rows * A->cols; ++i) Data(A) = rand() % 10;

            for (int i = 0; i < B->rows * B->cols; ++i) Data(B) = rand() % 10;

            for (int i = 0; i < C->rows * C->cols; ++i) Data(C) = rand() % 10;

            for (int i = 0; i < C3->rows * C3->cols; ++i) Data(C3) = rand() % 10;

            MultiplySerial(C, A, B);

            aMultiplyExplicitSimple(acc, C3, A, B);

            for (int i = 0; i < C3->rows * C3->cols; ++i)

                if (fabs(Data(C3) - Data(C3)) > 0.0001)

                {

                    std::cout << "diff C4\n";

                    break;

                }

        }

        for (int i = 0; i < TIMES; ++i)

        {

            for (int i = 0; i < A->rows * A->cols; ++i) Data(A) = rand() % 10;

            for (int i = 0; i < B->rows * B->cols; ++i) Data(B) = rand() % 10;

            for (int i = 0; i < C->rows * C->cols; ++i) Data(C) = rand() % 10;

            for (int i = 0; i < C4->rows * C4->cols; ++i) Data(C4) = rand() % 10;

            MultiplySerial(C, A, B);

            aMultiplyTile(acc, C4, A, B);

            for (int i = 0; i < C4->rows * C4->cols; ++i)

                if (fabs(Data(C) - Data(C4)) > 0.0001)

                {

                    std::cout << "diff C4\n";

                break;

                }

        }

    }

}

void part2()

{

    Matrix * A = Create_Matrix(16*30, 16*40);

    Matrix * B = Create_Matrix(16*40, 16*60);

    Matrix * C = Create_Matrix(A->rows, B->cols);

    Matrix * C2 = Create_Matrix(A->rows, B->cols);

    Matrix * C4 = Create_Matrix(A->rows, B->cols);

    cl_context clGPUContext;

    size_t dataBytes;

    size_t kernelLength;

    cl_int errcode;

    cl_platform_id plat_id[20];

    cl_uint numplat;

    errcode = clGetPlatformIDs(20, plat_id, &numplat);

    CHECK(errcode);

    printf("Number of platforms = %d\n", numplat);

    for (int p = 0; p < numplat; ++p)

    {

        char buf[500];

        size_t sz;

        errcode = clGetPlatformInfo(plat_id

, CL_PLATFORM_PROFILE, 500, buf, &sz);

        CHECK(errcode);

        printf("Platform profile: %s\n", buf);

        errcode = clGetPlatformInfo(plat_id

, CL_PLATFORM_VERSION, 500, buf, &sz);

        CHECK(errcode);

        printf("Platform version: %s\n", buf);

        errcode = clGetPlatformInfo(plat_id

, CL_PLATFORM_NAME, 500, buf, &sz);

        CHECK(errcode);

        printf("Platform name: %s\n", buf);

        char vendor[500];

        errcode = clGetPlatformInfo(plat_id

, CL_PLATFORM_VENDOR, 500, vendor, &sz);

        CHECK(errcode);

        printf("Platform vendor: %s\n", vendor);

        errcode = clGetPlatformInfo(plat_id

, CL_PLATFORM_EXTENSIONS, 500, buf, &sz);

        CHECK(errcode);

        printf("Platform extensions: %s\n", buf);

        cl_platform_id platform = plat_id

;

        cl_device_id clDevices[10];

        cl_uint num;

        cl_int err;

        err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 10, clDevices, &num);

        CHECK(err);

        printf("devices = %d\n", num);

        for (int d = 0; d < num; ++d)

        {

            char tbuf[500];

            size_t sz;

            std::cout << " Device [" << d << "]" << std::endl;

            std::cout << " type = ";

            cl_device_type type;

            errcode = clGetDeviceInfo(clDevices, CL_DEVICE_TYPE, sizeof(type), &type, NULL);

            CHECK(errcode);

            if (type & CL_DEVICE_TYPE_DEFAULT ) std::cout << "CL_DEVICE_TYPE_DEFAULT " ;

            if (type & CL_DEVICE_TYPE_CPU ) std::cout << "CL_DEVICE_TYPE_CPU " ;

            if (type & CL_DEVICE_TYPE_GPU ) std::cout << "CL_DEVICE_TYPE_GPU " ;

            if (type & CL_DEVICE_TYPE_ACCELERATOR ) std::cout << "CL_DEVICE_TYPE_ACCELERATOR ";

            std::cout << std::endl;

            errcode = clGetDeviceInfo(clDevices, CL_DEVICE_NAME, sizeof(tbuf), tbuf, NULL);

            CHECK(errcode);

            std::cout << " name = " << tbuf << std::endl;

//          if (strcmp(tbuf, "BeaverCreek") == 0)

//              continue;

//          if ( !(type & CL_DEVICE_TYPE_GPU ))

//              continue;

            // Choose device.

            cl_device_id device = clDevices;

            // create the OpenCL context on a GPU device

            cl_context_properties props[4];

            props[0] = CL_CONTEXT_PLATFORM;

            props[1] = (cl_context_properties)platform;

            props[2] = 0;

            cl_context context = clCreateContext(props, 1, &device, NULL, NULL, &err);

            CHECK(err);

            for (int i = 0; i < TIMES; ++i)

            {

                for (int i = 0; i < A->rows * A->cols; ++i) Data(A) = rand() % 10;

                for (int i = 0; i < B->rows * B->cols; ++i) Data(B) = rand() % 10;

                for (int i = 0; i < C->rows * C->cols; ++i) Data(C) = rand() % 10;

                withTimeMultiplySerial(C, A, B);

            }

            for (int i = 0; i < TIMES; ++i)

            {

                for (int i = 0; i < A->rows * A->cols; ++i) Data(A) = rand() % 10;

                for (int i = 0; i < B->rows * B->cols; ++i) Data(B) = rand() % 10;

                for (int i = 0; i < C->rows * C->cols; ++i) Data(C) = rand() % 10;

                for (int i = 0; i < C2->rows * C2->cols; ++i) Data(C2) = rand() % 10;

                MultiplySerial(C, A, B);

                clMultiplyExplicitSimple(device, context, C2, A, B);

                for (int i = 0; i < C->rows * C->cols; ++i)

                    if (fabs(Data(C) - Data(C2)) > 0.0001)

                    {

                        std::cout << "diff C2\n";

                        break;

                    }

            }

            for (int i = 0; i < TIMES; ++i)

            {

                for (int i = 0; i < A->rows * A->cols; ++i) Data(A) = rand() % 10;

                for (int i = 0; i < B->rows * B->cols; ++i) Data(B) = rand() % 10;

                for (int i = 0; i < C->rows * C->cols; ++i) Data(C) = rand() % 10;

                for (int i = 0; i < C4->rows * C4->cols; ++i) Data(C4) = rand() % 10;

                MultiplySerial(C, A, B);

                clMultiplyTile(device, context, C4, A, B);

                for (int i = 0; i < C->rows * C->cols; ++i)

                    if (fabs(Data(C) - Data(C4)) > 0.0001)

                    {

                        std::cout << "diff C4\n";

                        break;

                    }

            }

        }

    }

}

int main()

{

    part1();

    part2();

    return 0;

}

================================

// kernel.cl

// Multiply two matrices A * B = C

// Device code.

typedef struct {

    int rows;

    int cols;

    float _data;

} Matrix;

__global float * Data(__global Matrix * matrix)

{

    return (__global float*)&matrix->_data;

}

__kernel void kernelSimple(__global Matrix* C, __global Matrix* A, __global Matrix* B)

{

    int i = get_global_id(0);

    int j = get_global_id(1);

    int hA = A->rows;

    int wA = A->cols;

    int wB = B->cols;

    int wC = C->cols;

    float sum = 0.0;

    for (int k = 0; k < wA; ++k)

    {

        float a = Data(A)[i * wA + k];

        float b = Data(B)[k * wB + j];

        sum += a * b;

    }

    Data(C)[i * wC + j] = sum;

          //Data(C) = i;

}

__kernel void kernelTile(__global Matrix * C, __global Matrix * A, __global Matrix * B)

{

    int TS = get_local_size(0);

    int wA = A->cols;

    int hA = A->rows;

    int wB = B->cols;

    int hB = B->rows;

    int wC = C->cols;

    int hC = C->rows;

#define AS(i, j) As

#define BS(i, j) Bs

    int gr = get_global_id(0);

    int gc = get_global_id(1);

    int lr = get_local_id(0);

    int lc = get_local_id(1);

    float sum = 0.0;

    for (int i = 0; i < hB; i += TS)

    {

#define MAX_BLOCK_SIZE 30

        __local float As[MAX_BLOCK_SIZE][MAX_BLOCK_SIZE];

        __local float Bs[MAX_BLOCK_SIZE][MAX_BLOCK_SIZE];

        AS(lr, lc) = Data(A)[gr * wA + lc + i];

        BS(lr, lc) = Data(B)[(lr + i) * wB + gc];

        barrier(CLK_LOCAL_MEM_FENCE);

        for (int k = 0; k < TS; k++)

            sum += AS(lr, k) * BS(k, lc);

        barrier(CLK_LOCAL_MEM_FENCE);

    }

    Data(C)[gr * wC + gc] = sum;

};

0 Likes

you count also data transfers to computation. you have also error in your kernel code or at least it is sloppy to cast single float in struct into float pointer treat it as array.

0 Likes

Yes, the runtime for the test includes data transfers.  In fact, it has to otherwise it is not a fair comparison.  I used array_view<>'s for GPU data in C++ AMP.  They perform the data transfer in addition to allocating GPU memory.  (I could have used array<>'s but I didn't because that too has issues, like not running in debug mode.)  The template class array_view<> performs two copies in each direction, first from the buffer in the program to a "staging area", and a second transfer from the "staging area" to the GPU via DMA.  It is possible that Microsoft allocates a pinned area on the host for the DMA transfer, but I just don't know.  I don't know how AMD implements clEnqueueWriteBuffer, but it probably does not do the same because CL_MEM_ALLOC_HOST_PTR is provided.  To test whether data transfer can improve the situation, I implemented a pinned host zero-copy (see the updated ZIP file on my website).  Unfortunately, while that change improves on the runtime, it is still far slower (~ 360 ms) than the C++ AMP implementation (~58 ms) for the "simple" no shared memory implementation.

There is no error in the kernel code, otherwise it would not compile and the program would halt.  But, I'll agree it is "sloppy" for different reasons.The cast is a very old trick used to deal with variable-length aggregate structures: the struct is actually a buffer of two integers (for the number of rows, and the number of columns), followed by the actual data.  While I could have implemented the third field using a pointer to an allocated piece of memory containing the matrix elements, I chose to just stick it on the end and use type casting.  There is a check after calling C++ AMP or OpenCL to verify the result using a serial CPU implementation, so the code works.

But, there are two issues. there are two (different) definitions of struct Matrix. The other problem is that the OpenCL kernel reads and uses matrix C, while allocating the buffer using CL_MEM_WRITE_ONLY.  I fixed those problems, but CL_MEM_WRITE_ONLY seems to be unchecked in OpenCL kernels.

0 Likes

Have you really measured the read-back with the AMP code? Does it change if you add:

int a = c[0]; after the loop?

I forget what the synchronization behaviour for array_view was with the developer preview (Microsoft has been optimising it for their final release).

The code isn't tremendously efficient, but I can't see anything obvious that would cause behaviour differences. It's more likely that runtime overhead is the problem. You're including almost all of the setup code in your timings, by the look of it, and the AMD OpenCL runtime may be less efficient at dealing with memory allocations (because of the way it does allocation lazily, say) than the AMP or NVIDIA versions. OpenCL certainly isn't optimized for doing malloc, launch, free in a tight loop like that. If you pre-run the kernel before starting counter2, and time the second run do NVIDIA and AMD then become closer in behaviour? I realise that AMP might then differ because of the way you're constructing the array_views.

0 Likes