2 Replies Latest reply on Jul 18, 2010 6:17 AM by Meteorhead

    MPI and OpenCL: Invalid devices

    Meteorhead

      Hi!

      My problem is that for some unknown reason OpenCL commandQueues don't want to be cerated under mpi from slave processes.

      I am making a simple OpenCL-MPI example which implements core functionalities of MPI mixed with OpenCL. However after MPI initialization and OpenCL platform query and program build I try to create the commandQueues on different devices each and all threads other than the master fail to do so

      After branching there is an MPI_BARRIER where all threads wait for each other, and after that master thread broadcasts a global variable to all threads. (MPI_BCAST) After that comes buffer creation, kernel argument setting and commandQueue creation with the crash.

      After the barrier master thread issues std::cout << "Initial MPI_BARRIER passed!", so there should be no problem up until that. CommandQueue creation is indexed the same was as context creation (my_rank % numDevices) and I have checked, indexes run from 0-5 with 6 proper devices (3 5970), so this shouldn't be a problem. Error code is invalid device.

      Code is provided in next post.

      (Post modified 2010.07.18 09:50 GMT+1)

      mnagy@grid249:~/Develop/cl_mpi_test$ mpirun -n 6 cl_mpi_test Initial MPI_BARRIER passed! ERROR: clCreateCommandQueue() (-33) ERROR: clCreateCommandQueue() (-33) ERROR: clCreateCommandQueue() (-33) ERROR: clCreateCommandQueue() (-33) ERROR: clCreateCommandQueue() (-33) -------------------------------------------------------------------------- mpirun has exited due to process rank 1 with PID 21699 on node grid249 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). --------------------------------------------------------------------------

        • MPI and OpenCL: Invalid devices
          Meteorhead

          commandQueue creation is completely done the same way on all threads.

          #include <stdio.h> #include <stdlib.h> #include <string.h> #include <time.h> #include <mpi.h> #include <CL/cl.h> #include <fstream> inline void checkErr(cl_int err, const char * name) { if (err != CL_SUCCESS) { std::cerr << "ERROR: " << name << " (" << err << ")" << std::endl; exit(EXIT_FAILURE); } } void printBuffer(cl_uint* input , int number) { for (int i = 0 ; i < number ; ++i) std::cout << "\n" << input[i]; std::cout << "\n"; } int main(int argc, char *argv[]){ const int max_message_length = 100; const int server_rank = 0; char message[max_message_length]; char name[MPI_MAX_PROCESSOR_NAME]; MPI_Status status; int my_rank, num_procs, length, source, destination, MPI_err; int tag[3] = {0,1,2}; MPI_err = MPI_Init(&argc, &argv); MPI_err = MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_err = MPI_Comm_size(MPI_COMM_WORLD, &num_procs); MPI_err = MPI_Get_processor_name(name, &length); cl_int numDevices, CL_err; cl_uint numPlatforms; cl_platform_id platform = NULL; CL_err = clGetPlatformIDs(0, NULL, &numPlatforms); checkErr(CL_err, "clGetPlatformIDs(numPlatforms)"); if (0 < numPlatforms) { cl_platform_id* platforms = new cl_platform_id[numPlatforms]; CL_err = clGetPlatformIDs(numPlatforms, platforms, NULL); checkErr(CL_err, "clGetPlatformIDs(platforms)"); for (unsigned i = 0; i < numPlatforms; ++i) { char pbuf[100]; CL_err = clGetPlatformInfo(platforms[i], CL_PLATFORM_VENDOR, sizeof(pbuf), pbuf, NULL); checkErr(CL_err, "clGetPlatformInfo()"); platform = platforms[i]; if (!strcmp(pbuf, "Advanced Micro Devices, Inc.")) {break;} } delete[] platforms; } // If we could find our platform, use it. Otherwise pass a NULL and get whatever the // implementation thinks we should be using. cl_context_properties cps[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)platform, 0}; // Use NULL for backward compatibility cl_context_properties* cprops = (NULL == platform) ? NULL : cps; CL_err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, 0, (cl_uint*)&numDevices); checkErr(CL_err, "clGetDeviceIDs"); cl_device_id* devices = (cl_device_id*)malloc(numDevices * sizeof(cl_device_id)); CL_err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, numDevices, devices, 0); checkErr(CL_err, "clGetDeviceIDs"); // End of platform layer // Begin context creation cl_context context = clCreateContext(cprops, 1, &devices[my_rank % numDevices], NULL, NULL, &CL_err); checkErr(CL_err, "clCreateContext()"); // End of context creation // Begin loading kernel source code std::ifstream file("/home/mnagy/Develop/cl_mpi_test/cl_mpi_test.cl"); checkErr(file.is_open() ? CL_SUCCESS : -1, "ifstream() cannot access file"); std::string prog( std::istreambuf_iterator<char>(file), (std::istreambuf_iterator<char>())); // End of loading // Begin program build cl_program program; const char* kernelcode = prog.c_str(); size_t progsize = prog.size(); program = clCreateProgramWithSource(context, 1, &kernelcode, &progsize, &CL_err); checkErr(CL_err, "clCreateProgramWithSource()"); // Build program CL_err = clBuildProgram(program, 1, &devices[my_rank % numDevices], NULL, NULL, NULL); //checkErr(CL_err, "clBuildProgram()"); if( (CL_err != CL_SUCCESS) && (my_rank == server_rank)) { if(CL_err == CL_BUILD_PROGRAM_FAILURE) { cl_int logStatus; char * buildLog = NULL; size_t buildLogSize = 0; logStatus = clGetProgramBuildInfo(program, devices[0], CL_PROGRAM_BUILD_LOG, buildLogSize, buildLog, &buildLogSize); checkErr(logStatus, "clGetProgramBuildInfo() failed"); buildLog = (char*)malloc(buildLogSize); if(buildLog == NULL) {checkErr(CL_err, "Failed to allocate host memory.(buildLog)");} memset(buildLog, 0, buildLogSize); logStatus = clGetProgramBuildInfo(program, devices[0], CL_PROGRAM_BUILD_LOG, buildLogSize, buildLog, NULL); if(logStatus != CL_SUCCESS) { free(buildLog); checkErr(CL_err, "clGetProgramBuildInfo() failed"); } std::cout << " \n\t\t\tBUILD LOG\n"; std::cout << " ************************************************\n"; std::cout << buildLog << std::endl; std::cout << " ************************************************\n"; free(buildLog); } checkErr(CL_err, "clBuildProgram()"); } cl_kernel kernel; kernel = clCreateKernel(program, "test", &CL_err); checkErr(CL_err, "clCreateKernel()"); if(my_rank == server_rank) { // Memory allocation cl_uint* result = (cl_uint*)malloc( 4 * num_procs * sizeof(cl_uint)); cl_uint* partial = (cl_uint*)malloc( 4 * sizeof(cl_uint)); cl_uint globalVar = 1000; MPI_Barrier(MPI_COMM_WORLD); std::cout << "Initial MPI_BARRIER passed!\n\n"; MPI_Bcast(&globalVar, 1, MPI_INT, server_rank, MPI_COMM_WORLD); //for (int i = 0 ; i < 4 ; ++i) result[i] = globalVar * my_rank + i; int bufferCount = 3; cl_mem* buffers = (cl_mem*)malloc( bufferCount * sizeof(cl_mem*)); buffers[0] = clCreateBuffer( context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_uint), &globalVar, &CL_err); checkErr(CL_err, "clCreateBuffer(globalVar)"); buffers[1] = clCreateBuffer( context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_int), &my_rank, &CL_err); checkErr(CL_err, "clCreateBuffer(my_rank)"); buffers[2] = clCreateBuffer( context, CL_MEM_WRITE_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_uint) * 4, partial, &CL_err); checkErr(CL_err, "clCreateBuffer(result)"); cl_uint globalWorkDim = 2; size_t globalWorkSize[2] = {4,1}; size_t localWorkSize[2] = {2,1}; CL_err = clSetKernelArg(kernel, 0, sizeof(buffers[0]), &buffers[0]); checkErr(CL_err, "kernel.setArg(0)"); CL_err = clSetKernelArg(kernel, 1, sizeof(buffers[1]), &buffers[1]); checkErr(CL_err, "kernel.setArg(1)"); CL_err = clSetKernelArg(kernel, 2, sizeof(buffers[2]), &buffers[2]); checkErr(CL_err, "kernel.setArg(2)"); cl_command_queue commandQueue = clCreateCommandQueue(context, devices[num_procs % numDevices], 0, &CL_err); checkErr(CL_err, "clCreateCommandQueue()"); cl_event readStat, writeStat, kernelStat; CL_err = clEnqueueWriteBuffer(commandQueue, buffers[0], CL_TRUE, 0, sizeof(cl_uint), &globalVar, 0, NULL, &writeStat); checkErr(CL_err, "clEnqueueWriteBuffer(buffer[0])"); CL_err = clEnqueueWriteBuffer(commandQueue, buffers[1], CL_TRUE, 0, sizeof(cl_uint), &my_rank, 0, NULL, &writeStat); checkErr(CL_err, "clEnqueueWriteBuffer(buffer[1])"); CL_err = clEnqueueNDRangeKernel( commandQueue, kernel, globalWorkDim, 0, globalWorkSize, localWorkSize, 0, NULL, &kernelStat); checkErr(CL_err, "clEnqueueNDRangeKernel()"); clWaitForEvents(1, &kernelStat); CL_err = clEnqueueReadBuffer(commandQueue, buffers[1], CL_TRUE, 0, sizeof(cl_uint) * 4, partial, 0, NULL, &readStat); checkErr(CL_err, "clEnqueueReadBuffer(buffers[1])"); for (int i = 0 ; i < 4 ; ++i) result[i] = partial[i]; for (source = 1 ; source < num_procs ; source++) { MPI_err = MPI_Recv(message, max_message_length, MPI_CHAR, MPI_ANY_SOURCE, tag[0], MPI_COMM_WORLD, &status); printf("%s\n", message); MPI_err = MPI_Recv(message, max_message_length, MPI_CHAR, MPI_ANY_SOURCE, tag[1], MPI_COMM_WORLD, &status); printf("%s\n", message); MPI_err = MPI_Recv(&result[source * 4], 4 * sizeof(cl_uint), MPI_BYTE, MPI_ANY_SOURCE, tag[2], MPI_COMM_WORLD, &status); printf("Partial result recieved from %d.\n", source); } printBuffer(result, num_procs * 4); } else { // Memory allocation cl_uint* partial = (cl_uint*)malloc( 4 * sizeof(cl_uint)); int globalVar; MPI_Barrier(MPI_COMM_WORLD); MPI_Bcast(&globalVar, 1, MPI_INT, server_rank, MPI_COMM_WORLD); destination = server_rank; sprintf(message, "Process #%d on %s recieved globalVar.", my_rank, name); MPI_err = MPI_Send(message, strlen(message), MPI_CHAR, destination, tag[0], MPI_COMM_WORLD); //for (int i = 0 ; i < 2 ; ++i) part[i] = globalVar * my_rank + i; int bufferCount = 3; cl_mem* buffers = (cl_mem*)malloc( bufferCount * sizeof(cl_mem*)); buffers[0] = clCreateBuffer( context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_uint), &globalVar, &CL_err); checkErr(CL_err, "clCreateBuffer(globalVar)"); buffers[1] = clCreateBuffer( context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_int), &my_rank, &CL_err); checkErr(CL_err, "clCreateBuffer(my_rank)"); buffers[2] = clCreateBuffer( context, CL_MEM_WRITE_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_uint) * 4, partial, &CL_err); checkErr(CL_err, "clCreateBuffer(result)"); cl_uint globalWorkDim = 2; size_t globalWorkSize[2] = {4,1}; size_t localWorkSize[2] = {2,1}; CL_err = clSetKernelArg(kernel, 0, sizeof(buffers[0]), &buffers[0]); checkErr(CL_err, "kernel.setArg(0)"); CL_err = clSetKernelArg(kernel, 1, sizeof(buffers[1]), &buffers[1]); checkErr(CL_err, "kernel.setArg(1)"); CL_err = clSetKernelArg(kernel, 2, sizeof(buffers[2]), &buffers[2]); checkErr(CL_err, "kernel.setArg(2)"); cl_command_queue commandQueue = clCreateCommandQueue(context, devices[num_procs % numDevices], 0, &CL_err); checkErr(CL_err, "clCreateCommandQueue()"); cl_event readStat, writeStat, kernelStat; CL_err = clEnqueueWriteBuffer(commandQueue, buffers[0], CL_TRUE, 0, sizeof(cl_uint), &globalVar, 0, NULL, &writeStat); checkErr(CL_err, "clEnqueueWriteBuffer(buffer[0])"); CL_err = clEnqueueWriteBuffer(commandQueue, buffers[1], CL_TRUE, 0, sizeof(cl_uint), &my_rank, 0, NULL, &writeStat); checkErr(CL_err, "clEnqueueWriteBuffer(buffer[1])"); CL_err = clEnqueueNDRangeKernel( commandQueue, kernel, globalWorkDim, 0, globalWorkSize, localWorkSize, 0, NULL, &kernelStat); checkErr(CL_err, "clEnqueueNDRangeKernel()"); clWaitForEvents(1, &kernelStat); CL_err = clEnqueueReadBuffer(commandQueue, buffers[1], CL_TRUE, 0, sizeof(cl_uint) * 4, partial, 0, NULL, &readStat); checkErr(CL_err, "clEnqueueReadBuffer(buffers[1])"); sprintf(message, "Process #%d on %s finished calculation.", my_rank, name); MPI_err = MPI_Send(message, strlen(message), MPI_CHAR, destination, tag[1], MPI_COMM_WORLD); MPI_err = MPI_Send(partial, 4 * sizeof(cl_uint), MPI_BYTE, destination, tag[2], MPI_COMM_WORLD); } MPI_err = MPI_Finalize(); }

            • MPI and OpenCL: Invalid devices
              Meteorhead

              Ok, I did have the commandQueue creation line messed up. It should have been:

              [num_procs % numDevices] ==> [my_rank % numDevices]

              If you change that it is a working multi-gpu implementation that works even arcing over network (such as a cluster, only mpirun should be issued elsewise).

              For a more convenient way (and sometimes better) way of distributing workload and getting back results to master thread read MPI_SCATTER and MPI_GATHER commands which automatically do the even distribution of workload and should be used the same way as broadcast (MPI_BCAST). All threads must have the same command written, and the magic is done.

              If you were patient enough to read my posts and messy code than your reward is a working multigpu code. (needs tuning though)