15 Replies Latest reply on Sep 1, 2009 5:15 AM by genaganna

    Messing Around With MatrixMultiplication sample, change float to double

    riza.guntur

      I want to test it, can I just replace all occurences of cl_float to cl_double along with those in kernel?

      I've compiled and run it fine, commenting the compare for verification

        • Messing Around With MatrixMultiplication sample, change float to double
          genaganna

          cl_double is supported as cl_khr_fp64 extension as per the OpenCL 1.0 spec.

           

           

           

           

            • Messing Around With MatrixMultiplication sample, change float to double
              riza.guntur

               

              Originally posted by: genaganna cl_double is supported as cl_khr_fp64 extension as per the OpenCL 1.0 spec.

              Which means? Don't use hard words >.<

              I tried cl_khr_fp64 no luck, it won't compile.

                • Messing Around With MatrixMultiplication sample, change float to double
                  genaganna

                  are you able to compile without cl_khr_fp64?

                    • Messing Around With MatrixMultiplication sample, change float to double
                      riza.guntur

                      Why not?

                      I see it in MatrixMultiplication.hpp 

                        • Messing Around With MatrixMultiplication sample, change float to double
                          genaganna

                          OpenCL compiler should complian that double  is supported under extension.

                           

                          could you please post your code including .cl file?

                            • Messing Around With MatrixMultiplication sample, change float to double
                              riza.guntur

                              Here

                              I really only change float to double, cl_float to cl_double

                              MatrixMultiplication.cpp /* ============================================================ Copyright (c) 2009 Advanced Micro Devices, Inc. All rights reserved. Redistribution and use of this material is permitted under the following conditions: Redistributions must retain the above copyright notice and all terms of this license. In no event shall anyone redistributing or accessing or using this material commence or participate in any arbitration or legal action relating to this material against Advanced Micro Devices, Inc. or any copyright holders or contributors. The foregoing shall survive any expiration or termination of this license or any agreement or access or use related to this material. ANY BREACH OF ANY TERM OF THIS LICENSE SHALL RESULT IN THE IMMEDIATE REVOCATION OF ALL RIGHTS TO REDISTRIBUTE, ACCESS OR USE THIS MATERIAL. THIS MATERIAL IS PROVIDED BY ADVANCED MICRO DEVICES, INC. AND ANY COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" IN ITS CURRENT CONDITION AND WITHOUT ANY REPRESENTATIONS, GUARANTEE, OR WARRANTY OF ANY KIND OR IN ANY WAY RELATED TO SUPPORT, INDEMNITY, ERROR FREE OR UNINTERRUPTED OPERA TION, OR THAT IT IS FREE FROM DEFECTS OR VIRUSES. ALL OBLIGATIONS ARE HEREBY DISCLAIMED - WHETHER EXPRESS, IMPLIED, OR STATUTORY - INCLUDING, BUT NOT LIMITED TO, ANY IMPLIED WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, ACCURACY, COMPLETENESS, OPERABILITY, QUALITY OF SERVICE, OR NON-INFRINGEMENT. IN NO EVENT SHALL ADVANCED MICRO DEVICES, INC. OR ANY COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, PUNITIVE, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, REVENUE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED OR BASED ON ANY THEORY OF LIABILITY ARISING IN ANY WAY RELATED TO THIS MATERIAL, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. THE ENTIRE AND AGGREGATE LIABILITY OF ADVANCED MICRO DEVICES, INC. AND ANY COPYRIGHT HOLDERS AND CONTRIBUTORS SHALL NOT EXCEED TEN DOLLARS (US $10.00). ANYONE REDISTRIBUTING OR ACCESSING OR USING THIS MATERIAL ACCEPTS THIS ALLOCATION OF RISK AND AGREES TO RELEASE ADVANCED MICRO DEVICES, INC. AND ANY COPYRIGHT HOLDERS AND CONTRIBUTORS FROM ANY AND ALL LIABILITIES, OBLIGATIONS, CLAIMS, OR DEMANDS IN EXCESS OF TEN DOLLARS (US $10.00). THE FOREGOING ARE ESSENTIAL TERMS OF THIS LICENSE AND, IF ANY OF THESE TERMS ARE CONSTRUED AS UNENFORCEABLE, FAIL IN ESSENTIAL PURPOSE, OR BECOME VOID OR DETRIMENTAL TO ADVANCED MICRO DEVICES, INC. OR ANY COPYRIGHT HOLDERS OR CONTRIBUTORS FOR ANY REASON, THEN ALL RIGHTS TO REDISTRIBUTE, ACCESS OR USE THIS MATERIAL SHALL TERMINATE IMMEDIATELY. MOREOVER, THE FOREGOING SHALL SURVIVE ANY EXPIRATION OR TERMINATION OF THIS LICENSE OR ANY AGREEMENT OR ACCESS OR USE RELATED TO THIS MATERIAL. NOTICE IS HEREBY PROVIDED, AND BY REDISTRIBUTING OR ACCESSING OR USING THIS MATERIAL SUCH NOTICE IS ACKNOWLEDGED, THAT THIS MATERIAL MAY BE SUBJECT TO RESTRICTIONS UNDER THE LAWS AND REGULATIONS OF THE UNITED STATES OR OTHER COUNTRIES, WHICH INCLUDE BUT ARE NOT LIMITED TO, U.S. EXPORT CONTROL LAWS SUCH AS THE EXPORT ADMINISTRATION REGULATIONS AND NATIONAL SECURITY CONTROLS AS DEFINED THEREUNDER, AS WELL AS STATE DEPARTMENT CONTROLS UNDER THE U.S. MUNITIONS LIST. THIS MATERIAL MAY NOT BE USED, RELEASED, TRANSFERRED, IMPORTED, EXPORTED AND/OR RE-EXPORTED IN ANY MANNER PROHIBITED UNDER ANY APPLICABLE LAWS, INCLUDING U.S. EXPORT CONTROL LAWS REGARDING SPECIFICALLY DESIGNATED PERSONS, COUNTRIES AND NATIONALS OF COUNTRIES SUBJECT TO NATIONAL SECURITY CONTROLS. MOREOVER, THE FOREGOING SHALL SURVIVE ANY EXPIRATION OR TERMINATION OF ANY LICENSE OR AGREEMENT OR ACCESS OR USE RELATED TO THIS MATERIAL. NOTICE REGARDING THE U.S. GOVERNMENT AND DOD AGENCIES: This material is provided with "RESTRICTED RIGHTS" and/or "LIMITED RIGHTS" as applicable to computer software and technical data, respectively. Use, duplication, distribution or disclosure by the U.S. Government and/or DOD agencies is subject to the full extent of restrictions in all applicable regulations, including those found at FAR52.227 and DFARS252.227 et seq. and any successor regulations thereof. Use of this material by the U.S. Government and/or DOD agencies is acknowledgment of the proprietary rights of any copyright holders and contributors, including those of Advanced Micro Devices, Inc., as well as the provisions of FAR52.227-14 through 23 regarding privately developed and/or commercial computer software. This license forms the entire agreement regarding the subject matter hereof and supersedes all proposals and prior discussions and writings between the parties with respect thereto. This license does not affect any ownership, rights, title, or interest in, or relating to, this material. No terms of this license can be modified or waived, and no breach of this license can be excused, unless done so in a writing signed by all affected parties. Each term of this license is separately enforceable. If any term of this license is determined to be or becomes unenforceable or illegal, such term shall be reformed to the minimum extent necessary in order for this license to remain in effect in accordance with its terms as modified by such reformation. This license shall be governed by and construed in accordance with the laws of the State of Texas without regard to rules on conflicts of law of any state or jurisdiction or the United Nations Convention on the International Sale of Goods. All disputes arising out of this license shall be subject to the jurisdiction of the federal and state courts in Austin, Texas, and all defenses are hereby waived concerning personal jurisdiction and venue of these courts. ============================================================ */ #include "MatrixMultiplication.hpp" int MatrixMultiplication::setupMatrixMultiplication() { /* allocate and init memory used by host input0[width][height]*/ cl_uint inputSizeBytes0 = width * height * sizeof(cl_double); input0 = (cl_double *) malloc(inputSizeBytes0); if(input0==NULL) { sampleCommon->error("Failed to allocate host memory. (input0)"); return 0; } /* allocate and init memory used by host input1[depth][width]*/ cl_uint inputSizeBytes1 = depth * width * sizeof(cl_double); input1 = (cl_double *) malloc(inputSizeBytes1); if(input1==NULL) { sampleCommon->error("Failed to allocate host memory. (input1)"); return 0; } /* random initialisation of input */ sampleCommon->fillRandom<cl_double>(input0, width, height, 0, 10); sampleCommon->fillRandom<cl_double>(input1, depth, width, 0, 10); /* allocate memory for output[depth][height] */ cl_uint outputSizeBytes = height * depth * sizeof(cl_double); output = (cl_double *) malloc(outputSizeBytes); if(output==NULL) { sampleCommon->error("Failed to allocate host memory. (output)"); return 0; } /* allocate memory for output[depth][height] of reference implemenation*/ if(verify) { verificationOutput = (cl_double *) malloc(outputSizeBytes); if(verificationOutput==NULL) { sampleCommon->error("Failed to allocate host memory. (verificationOutput)"); return 0; } memset(verificationOutput, 0, outputSizeBytes); } /* * Unless quiet mode has been enabled, print the INPUT arrays */ if(!quiet) { sampleCommon->printArray<cl_double>( "Input0", input0, width, 1); sampleCommon->printArray<cl_double>( "Input1", input1, depth, 1); } return 1; } int MatrixMultiplication::setupCL(void) { cl_int status = 0; size_t deviceListSize; context = clCreateContextFromType( 0, CL_DEVICE_TYPE_CPU, NULL, NULL, &status); if(!sampleCommon->checkVal(status, CL_SUCCESS, "clCreateContextFromType failed.")) return 0; /* First, get the size of device list data */ status = clGetContextInfo( context, CL_CONTEXT_DEVICES, 0, NULL, &deviceListSize); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clGetContextInfo failed.")) return 0; /* Now allocate memory for device list based on the size we got earlier */ devices = (cl_device_id *)malloc(deviceListSize); if(devices==NULL) { sampleCommon->error("Failed to allocate memory (devices)."); return 0; } /* Now, get the device list data */ status = clGetContextInfo( context, CL_CONTEXT_DEVICES, deviceListSize, devices, NULL); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clGetGetContextInfo failed.")) return 0; { /* The block is to move the declaration of prop closer to its use */ cl_command_queue_properties prop = 0; if(timing) prop |= CL_QUEUE_PROFILING_ENABLE; commandQueue = clCreateCommandQueue( context, devices[0], prop, &status); if(!sampleCommon->checkVal( status, 0, "clCreateCommandQueue failed.")) return 0; } inputBuffer0 = clCreateBuffer( context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, sizeof(cl_double) * width * height, input0, &status); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clCreateBuffer failed. (inputBuffer0)")) return 0; inputBuffer1 = clCreateBuffer( context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, sizeof(cl_double) * depth * width, input1, &status); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clCreateBuffer failed. (inputBuffer1)")) return 0; outputBuffer = clCreateBuffer( context, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, sizeof(cl_double) * height * depth, output, &status); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clCreateBuffer failed. (outputBuffer)")) return 0; /* create a CL program using the kernel source */ streamsdk::SDKFile kernelFile; kernelFile.open("MatrixMultiplication_Kernels.cl"); const char * source = kernelFile.source().c_str(); size_t sourceSize[] = { strlen(source) }; program = clCreateProgramWithSource( context, 1, &source, sourceSize, &status); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clCreateProgramWithSource failed.")) return 0; /* create a cl program executable for all the devices specified */ status = clBuildProgram(program, 1, devices, NULL, NULL, NULL); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clBuildProgram failed.")) return 0; /* get a kernel object handle for a kernel with the given name */ kernel = clCreateKernel(program, "matrixMultiplication", &status); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clCreateKernel failed.")) return 0; return 1; } int MatrixMultiplication::runCLKernels(void) { cl_int status; cl_event events[2]; /* * Kernel runs over complete output matrix with blocks of blockSize x blockSize * running concurrently */ size_t globalThreads[2]= {depth, height}; size_t localThreads[2] = {blockSize, blockSize}; long long kernelsStartTime; long long kernelsEndTime; /*** Set appropriate arguments to the kernel ***/ /* output array as the 1st argument : stores product of input0 and input1 */ status = clSetKernelArg( kernel, 0, sizeof(cl_mem), (void *)&outputBuffer); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clSetKernelArg failed. (outputBuffer)")) return 0; /* the input matrix as 2nd argument - input0 */ status = clSetKernelArg( kernel, 1, sizeof(cl_mem), (void *)&inputBuffer0); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clSetKernelArg failed. (inputBuffer0)")) return 0; /* the input matrix as 3rd argument - input1 */ status = clSetKernelArg( kernel, 2, sizeof(cl_mem), (void *)&inputBuffer1); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clSetKernelArg failed. (inputBuffer1)")) return 0; /* local memory as 4th argument of size blockSize x blockSize - local0*/ status = clSetKernelArg( kernel, 3, blockSize*blockSize*sizeof(cl_double), NULL); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clSetKernelArg failed. (local0)")) return 0; /* local memory as 5th argument of size blockSize x blockSize - local1*/ status = clSetKernelArg( kernel, 4, blockSize*blockSize*sizeof(cl_double), NULL); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clSetKernelArg failed. (local1)")) return 0; /* width of the input0 matrix as 6th argument - width */ status = clSetKernelArg( kernel, 5, sizeof(cl_int), (void*)&width); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clSetKernelArg failed. (width)")) return 0; /* width of the input1 matrix as 7th argument - depth*/ status = clSetKernelArg( kernel, 6, sizeof(cl_int), (void*)&depth); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clSetKernelArg failed. (depth)")) return 0; /* blockSize as 8th argument - blockSize */ status = clSetKernelArg( kernel, 7, sizeof(cl_int), (void*)&blockSize); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clSetKernelArg failed. (blockSize)")) return 0; /*Enqueue a kernel run call */ status = clEnqueueNDRangeKernel( commandQueue, kernel, 2, NULL, globalThreads, localThreads, 0, NULL, &events[0]); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clEnqueueNDRangeKernel failed.")) return 0; /* wait for the kernel call to finish execution */ status = clWaitForEvents(1, &events[0]); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clWaitForEvents failed.")) return 0; if(timing) { status = clGetEventProfilingInfo( events[0], CL_PROFILING_COMMAND_START, sizeof(long long), &kernelsStartTime, NULL); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clGetEventProfilingInfo failed.")) return 0; } if(timing) { status = clGetEventProfilingInfo( events[0], CL_PROFILING_COMMAND_END, sizeof(long long), &kernelsEndTime, NULL); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clGetEventProfilingInfo failed.")) return 0; /* Compute total time (also convert from nanoseconds to seconds) */ totalTime = (double)(kernelsEndTime - kernelsStartTime)/1e9; } clReleaseEvent(events[0]); /* Enqueue readBuffer*/ status = clEnqueueReadBuffer( commandQueue, outputBuffer, CL_TRUE, 0, height * depth * sizeof(cl_double), output, 0, NULL, &events[1]); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clEnqueueReadBuffer failed.")) return 0; /* Wait for the read buffer to finish execution */ status = clWaitForEvents(1, &events[1]); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clWaitForEvents failed.")) return 0; clReleaseEvent(events[1]); return 1; } /* * This is a naive O(N^3) CPU implementatio of matrix multiplication */ void MatrixMultiplication::matrixMultiplicationCPUReference( cl_double * output, cl_double * input0, cl_double * input1, const cl_uint height, const cl_uint width, const cl_uint depth) { for(cl_uint i=0; i < height; i++) { for(cl_uint j=0; j < depth; j++) { for(cl_uint k=0; k < width; k++) { output[i*depth + j] += (input0[i*width + k]*input1[k*depth + j]); } } } } int MatrixMultiplication::initialize() { /* Call base class Initialize to get default configuration */ if(!this->SDKSample::initialize()) return 0; /* add an option for getting blockSize from commandline */ streamsdk::Option* blockSizeParam = new streamsdk::Option; if(!blockSizeParam) { sampleCommon->error("Memory Allocation error.\n"); return 0; } blockSizeParam->_sVersion = "b"; blockSizeParam->_lVersion = "blockSize"; blockSizeParam->_description = "Use local memory of dimensions blockSize x blockSize"; blockSizeParam->_type = streamsdk::CA_ARG_INT; blockSizeParam->_value = &blockSize; sampleArgs->AddOption(blockSizeParam); /* default depth */ depth = 64; return 1; } int MatrixMultiplication::setup() { /* Make sure the dimensions are multiples of blockSize */ if(width%blockSize !=0) { width = (width/blockSize + 1)*blockSize; } if(height%blockSize !=0) { height = (height/blockSize + 1)*blockSize; } if(depth%blockSize !=0) { depth = (depth/blockSize + 1)*blockSize; } if(!setupMatrixMultiplication()) return 0; if(!setupCL()) return 0; return 1; } int MatrixMultiplication::run() { /* Arguments are set and execution call is enqueued on command buffer */ if(!runCLKernels()) return 0; if(!quiet) { sampleCommon->printArray<cl_double>("Output", output, depth, 1); } return 1; } int MatrixMultiplication::verifyResults() { if(verify) { /* reference implementation */ int refTimer = sampleCommon->createTimer(); sampleCommon->resetTimer(refTimer); sampleCommon->startTimer(refTimer); matrixMultiplicationCPUReference(verificationOutput, input0, input1, height, width, depth); sampleCommon->stopTimer(refTimer); referenceKernelTime = sampleCommon->readTimer(refTimer); ///* compare the results and see if they match */ //if(sampleCommon->compare(output, verificationOutput, height*depth)) //{ // std::cout<<"Passed!\n"; // return 1; //} //else //{ // std::cout<<"Failed\n"; // return 0; //} } return -1; } void MatrixMultiplication::printStats() { this->SDKSample::printStats(); } int MatrixMultiplication::cleanup() { /* Releases OpenCL resources (Context, Memory etc.) */ cl_int status; status = clReleaseKernel(kernel); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clReleaseKernel failed.")) return 0; status = clReleaseProgram(program); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clReleaseProgram failed.")) return 0; status = clReleaseMemObject(inputBuffer0); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clReleaseMemObject failed.")) return 0; status = clReleaseMemObject(inputBuffer1); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clReleaseMemObject failed.")) return 0; status = clReleaseMemObject(outputBuffer); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clReleaseMemObject failed.")) return 0; status = clReleaseCommandQueue(commandQueue); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clReleaseCommandQueue failed.")) return 0; status = clReleaseContext(context); if(!sampleCommon->checkVal( status, CL_SUCCESS, "clReleaseContext failed.")) return 0; /* release program resources (input memory etc.) */ if(input0) free(input0); if(input1) free(input1); if(output) free(output); if(verificationOutput) free(verificationOutput); /* release device list */ if(devices) free(devices); return 1; } int main(int argc, char * argv[]) { MatrixMultiplication clMatrixMultiplication("OpenCL Matrix Multiplication"); clMatrixMultiplication.initialize(); if(!clMatrixMultiplication.parseCommandLine(argc, argv)) return 0; clMatrixMultiplication.setup(); clMatrixMultiplication.run(); clMatrixMultiplication.verifyResults(); clMatrixMultiplication.cleanup(); clMatrixMultiplication.printStats(); return 0; } MatrixMultiplication.hpp /* ============================================================ Copyright (c) 2009 Advanced Micro Devices, Inc. All rights reserved. Redistribution and use of this material is permitted under the following conditions: Redistributions must retain the above copyright notice and all terms of this license. In no event shall anyone redistributing or accessing or using this material commence or participate in any arbitration or legal action relating to this material against Advanced Micro Devices, Inc. or any copyright holders or contributors. The foregoing shall survive any expiration or termination of this license or any agreement or access or use related to this material. ANY BREACH OF ANY TERM OF THIS LICENSE SHALL RESULT IN THE IMMEDIATE REVOCATION OF ALL RIGHTS TO REDISTRIBUTE, ACCESS OR USE THIS MATERIAL. THIS MATERIAL IS PROVIDED BY ADVANCED MICRO DEVICES, INC. AND ANY COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" IN ITS CURRENT CONDITION AND WITHOUT ANY REPRESENTATIONS, GUARANTEE, OR WARRANTY OF ANY KIND OR IN ANY WAY RELATED TO SUPPORT, INDEMNITY, ERROR FREE OR UNINTERRUPTED OPERA TION, OR THAT IT IS FREE FROM DEFECTS OR VIRUSES. ALL OBLIGATIONS ARE HEREBY DISCLAIMED - WHETHER EXPRESS, IMPLIED, OR STATUTORY - INCLUDING, BUT NOT LIMITED TO, ANY IMPLIED WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, ACCURACY, COMPLETENESS, OPERABILITY, QUALITY OF SERVICE, OR NON-INFRINGEMENT. IN NO EVENT SHALL ADVANCED MICRO DEVICES, INC. OR ANY COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, PUNITIVE, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, REVENUE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED OR BASED ON ANY THEORY OF LIABILITY ARISING IN ANY WAY RELATED TO THIS MATERIAL, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. THE ENTIRE AND AGGREGATE LIABILITY OF ADVANCED MICRO DEVICES, INC. AND ANY COPYRIGHT HOLDERS AND CONTRIBUTORS SHALL NOT EXCEED TEN DOLLARS (US $10.00). ANYONE REDISTRIBUTING OR ACCESSING OR USING THIS MATERIAL ACCEPTS THIS ALLOCATION OF RISK AND AGREES TO RELEASE ADVANCED MICRO DEVICES, INC. AND ANY COPYRIGHT HOLDERS AND CONTRIBUTORS FROM ANY AND ALL LIABILITIES, OBLIGATIONS, CLAIMS, OR DEMANDS IN EXCESS OF TEN DOLLARS (US $10.00). THE FOREGOING ARE ESSENTIAL TERMS OF THIS LICENSE AND, IF ANY OF THESE TERMS ARE CONSTRUED AS UNENFORCEABLE, FAIL IN ESSENTIAL PURPOSE, OR BECOME VOID OR DETRIMENTAL TO ADVANCED MICRO DEVICES, INC. OR ANY COPYRIGHT HOLDERS OR CONTRIBUTORS FOR ANY REASON, THEN ALL RIGHTS TO REDISTRIBUTE, ACCESS OR USE THIS MATERIAL SHALL TERMINATE IMMEDIATELY. MOREOVER, THE FOREGOING SHALL SURVIVE ANY EXPIRATION OR TERMINATION OF THIS LICENSE OR ANY AGREEMENT OR ACCESS OR USE RELATED TO THIS MATERIAL. NOTICE IS HEREBY PROVIDED, AND BY REDISTRIBUTING OR ACCESSING OR USING THIS MATERIAL SUCH NOTICE IS ACKNOWLEDGED, THAT THIS MATERIAL MAY BE SUBJECT TO RESTRICTIONS UNDER THE LAWS AND REGULATIONS OF THE UNITED STATES OR OTHER COUNTRIES, WHICH INCLUDE BUT ARE NOT LIMITED TO, U.S. EXPORT CONTROL LAWS SUCH AS THE EXPORT ADMINISTRATION REGULATIONS AND NATIONAL SECURITY CONTROLS AS DEFINED THEREUNDER, AS WELL AS STATE DEPARTMENT CONTROLS UNDER THE U.S. MUNITIONS LIST. THIS MATERIAL MAY NOT BE USED, RELEASED, TRANSFERRED, IMPORTED, EXPORTED AND/OR RE-EXPORTED IN ANY MANNER PROHIBITED UNDER ANY APPLICABLE LAWS, INCLUDING U.S. EXPORT CONTROL LAWS REGARDING SPECIFICALLY DESIGNATED PERSONS, COUNTRIES AND NATIONALS OF COUNTRIES SUBJECT TO NATIONAL SECURITY CONTROLS. MOREOVER, THE FOREGOING SHALL SURVIVE ANY EXPIRATION OR TERMINATION OF ANY LICENSE OR AGREEMENT OR ACCESS OR USE RELATED TO THIS MATERIAL. NOTICE REGARDING THE U.S. GOVERNMENT AND DOD AGENCIES: This material is provided with "RESTRICTED RIGHTS" and/or "LIMITED RIGHTS" as applicable to computer software and technical data, respectively. Use, duplication, distribution or disclosure by the U.S. Government and/or DOD agencies is subject to the full extent of restrictions in all applicable regulations, including those found at FAR52.227 and DFARS252.227 et seq. and any successor regulations thereof. Use of this material by the U.S. Government and/or DOD agencies is acknowledgment of the proprietary rights of any copyright holders and contributors, including those of Advanced Micro Devices, Inc., as well as the provisions of FAR52.227-14 through 23 regarding privately developed and/or commercial computer software. This license forms the entire agreement regarding the subject matter hereof and supersedes all proposals and prior discussions and writings between the parties with respect thereto. This license does not affect any ownership, rights, title, or interest in, or relating to, this material. No terms of this license can be modified or waived, and no breach of this license can be excused, unless done so in a writing signed by all affected parties. Each term of this license is separately enforceable. If any term of this license is determined to be or becomes unenforceable or illegal, such term shall be reformed to the minimum extent necessary in order for this license to remain in effect in accordance with its terms as modified by such reformation. This license shall be governed by and construed in accordance with the laws of the State of Texas without regard to rules on conflicts of law of any state or jurisdiction or the United Nations Convention on the International Sale of Goods. All disputes arising out of this license shall be subject to the jurisdiction of the federal and state courts in Austin, Texas, and all defenses are hereby waived concerning personal jurisdiction and venue of these courts. ============================================================ */ #ifndef MATRIXMULTIPLICATION_H_ #define MATRIXMULTIPLICATION_H_ #include <CL/cl.h> #include <stdio.h> #include <stdlib.h> #include <assert.h> #include <string.h> #include <SDKUtil/SDKCommon.hpp> #include <SDKUtil/SDKApplication.hpp> #include <SDKUtil/SDKCommandArgs.hpp> #include <SDKUtil/SDKFile.hpp> /** * MatrixMultiplication * Class implements OpenCL Matrix Multiplication sample * Derived from SDKSample base class */ class MatrixMultiplication : public SDKSample { cl_uint seed; /**< Seed value for random number generation */ cl_double totalKernelTime; /**< Time for kernel execution */ cl_double totalProgramTime; /**< Time for program execution */ cl_double referenceKernelTime; /**< Time for reference implementation */ cl_double *input0; /**< Input array */ cl_double *input1; /**< Input array */ cl_double *output; /**< Output Array */ cl_double *verificationOutput; /**< Output array for reference implementation */ cl_uint blockSize; /**< Size of the block used for shared memory */ cl_context context; /**< CL context */ cl_device_id *devices; /**< CL device list */ cl_mem inputBuffer0; /**< CL memory buffer for matrix input0*/ cl_mem inputBuffer1; /**< CL memory buffer for matrix input1*/ cl_mem outputBuffer; /**< CL memory buffer for storing the output*/ cl_command_queue commandQueue; /**< CL command queue */ cl_program program; /**< CL program */ cl_kernel kernel; /**< CL kernel */ public: /** * Constructor * Initialize member variables * @param name name of sample (string) */ MatrixMultiplication(std::string name) : SDKSample(name){ seed = 123; input0 = NULL; input1 = NULL; output = NULL; verificationOutput = NULL; blockSize = 16; } /** * Constructor * Initialize member variables * @param name name of sample (const char*) */ MatrixMultiplication(const char* name) : SDKSample(name){ seed = 123; input0 = NULL; input1 = NULL; output = NULL; verificationOutput = NULL; blockSize = 16; } /** * Allocate and initialize host memory array with random values * @return 1 on success and 0 on failure */ int setupMatrixMultiplication(); /** * OpenCL related initialisations. * Set up Context, Device list, Command Queue, Memory buffers * Build CL kernel program executable * @return 1 on success and 0 on failure */ int setupCL(); /** * Set values for kernels' arguments, enqueue calls to the kernels * on to the command queue, wait till end of kernel execution. * Get kernel start and end time if timing is enabled * @return 1 on success and 0 on failure */ int runCLKernels(); /** * Reference CPU implementation of Matrix Multiplication * @param output stores the output of the multipliced matrices depthxheight * @param input0 input matrix of size width x height * @param input1 input matrix of size depth x width * @param height height of the output matrix * @param width length of the common dimension of the matrices input0 and input1 * @param depth width of the output matrix */ void matrixMultiplicationCPUReference( cl_double * output, cl_double * input0, cl_double * input1, const cl_uint height, const cl_uint width, const cl_uint depth); /** * Override from SDKSample. Print sample stats. */ void printStats(); /** * Override from SDKSample. Initialize * command line parser, add custom options */ int initialize(); /** * Override from SDKSample, adjust width and height * of execution domain, perform all sample setup */ int setup(); /** * Override from SDKSample * Run OpenCL Matrix Multiplication */ int run(); /** * Override from SDKSample * Cleanup memory allocations */ int cleanup(); /** * Override from SDKSample * Verify against reference implementation */ int verifyResults(); }; #endif MatrixMultiplication_Kernels.cl /* * Calculates the naive matrix multiplication using the local memory * loads the blocks along the common dimension to local memories * performs naive matrix multiplication on the blocks that are loaded * */ __kernel void matrixMultiplication(__global double * output, __global double * input0, __global double * input1, __local double * local0, __local double * local1, const uint width0, const uint width1, const uint blockSize) { /* get the block ids in both the directions */ int bx = get_group_id(0); int by = get_group_id(1); /* get the local ids within the block */ int tx = get_local_id(0); int ty = get_local_id(1); /* * width0 is the common dimension between both the matrices * calculating number of blocks along the common dimension */ int n = width0/blockSize; /* initialize accumulator */ double acc = 0; /* For each block along the common dimension for both the matrices */ for(int b= 0; b< n ; ++b) { /* Copy elements of block with blockIds(b,by) of input0 to local memory local0 */ /* each thread writes only one element and waits for others to sync */ /* calculate global ids of the element */ int globalIdx0 = b *blockSize + tx; int globalIdy0 = by*blockSize + ty; int globalId0 = globalIdy0*width0 + globalIdx0; /* copy input0 to the local0 */ local0[ty*blockSize + tx] = input0[globalId0]; /* Copy elements of block with blockIds(bx,b) of input1 to local memory local1 */ /* each thread writes only one element and waits for others to sync */ /* calculate global ids of the element */ int globalIdx1 = bx * blockSize + tx; int globalIdy1 = b * blockSize + ty; int globalId1 = globalIdy1*width1 + globalIdx1; /* copy input1 to the local1 */ local1[ty*blockSize + tx] = input1[globalId1]; /* sync threads of the work group for the blocks to get copied to local memory */ barrier(CLK_LOCAL_MEM_FENCE); /* perform naive matrix multiplication using local memory blocks */ for(int k=0; k < blockSize; ++k) { acc += local0[ty*blockSize + k]*local1[k*blockSize + tx]; } /* sync before the next set of blocks are copied into local memory */ barrier(CLK_LOCAL_MEM_FENCE); } /* find the global location in output */ int x = get_global_id(0); int y = get_global_id(1); int outIndex = y*width1 + x; /* write the accumulator to the output */ output[outIndex] = acc; }

                    • Messing Around With MatrixMultiplication sample, change float to double
                      MicahVillmow
                      riza,
                      in order to support cl_double, the runtime must expose it in the extension string when queried for CL_DEVICE_EXTENSIONS via the clGetDeviceInfo system call. If the string cl_khr_fp64 does not exist, then double is not supported on the runtime.
                      • Messing Around With MatrixMultiplication sample, change float to double
                        MicahVillmow
                        Riza,
                        This is specified in section 6.9 of the OpenCL spec. The usual way to specify is to put #prgam OPENCL EXTENSION cl_khr_fp64 : enable, but that will only work if the runtime reports that it supports that extension.
                            • Messing Around With MatrixMultiplication sample, change float to double
                              riza.guntur

                              Can I put pragma right before kernel call?

                              int MatrixMultiplication::run() { /* Arguments are set and execution call is enqueued on command buffer */ #pragma OPENCL EXTENSION cl_khr_fp64 : enable if(!runCLKernels()) return 0; if(!quiet) { sampleCommon->printArray<cl_double>("Output", output, depth, 1); } return 1; }

                                • Messing Around With MatrixMultiplication sample, change float to double
                                  genaganna

                                  You have to add #pragma in .cl file

                                   #pragma OPENCL EXTENSION cl_khr_fp64 : enable

                                   

                                   

                                    • Messing Around With MatrixMultiplication sample, change float to double
                                      riza.guntur

                                      Like this right?

                                      /* * Calculates the naive matrix multiplication using the local memory * loads the blocks along the common dimension to local memories * performs naive matrix multiplication on the blocks that are loaded * */ #pragma OPENCL EXTENSION cl_khr_fp64 : enable __kernel void matrixMultiplication(__global float * output, __global float * input0, __global float * input1, __local float * local0, __local float * local1, const uint width0, const uint width1, const uint blockSize) { /* get the block ids in both the directions */ int bx = get_group_id(0); int by = get_group_id(1); /* get the local ids within the block */ int tx = get_local_id(0); int ty = get_local_id(1); /* * width0 is the common dimension between both the matrices * calculating number of blocks along the common dimension */ int n = width0/blockSize; /* initialize accumulator */ float acc = 0; /* For each block along the common dimension for both the matrices */ for(int b= 0; b< n ; ++b) { /* Copy elements of block with blockIds(b,by) of input0 to local memory local0 */ /* each thread writes only one element and waits for others to sync */ /* calculate global ids of the element */ int globalIdx0 = b *blockSize + tx; int globalIdy0 = by*blockSize + ty; int globalId0 = globalIdy0*width0 + globalIdx0; /* copy input0 to the local0 */ local0[ty*blockSize + tx] = input0[globalId0]; /* Copy elements of block with blockIds(bx,b) of input1 to local memory local1 */ /* each thread writes only one element and waits for others to sync */ /* calculate global ids of the element */ int globalIdx1 = bx * blockSize + tx; int globalIdy1 = b * blockSize + ty; int globalId1 = globalIdy1*width1 + globalIdx1; /* copy input1 to the local1 */ local1[ty*blockSize + tx] = input1[globalId1]; /* sync threads of the work group for the blocks to get copied to local memory */ barrier(CLK_LOCAL_MEM_FENCE); /* perform naive matrix multiplication using local memory blocks */ for(int k=0; k < blockSize; ++k) { acc += local0[ty*blockSize + k]*local1[k*blockSize + tx]; } /* sync before the next set of blocks are copied into local memory */ barrier(CLK_LOCAL_MEM_FENCE); } /* find the global location in output */ int x = get_global_id(0); int y = get_global_id(1); int outIndex = y*width1 + x; /* write the accumulator to the output */ output[outIndex] = acc; }