16 Replies Latest reply on Sep 9, 2009 10:27 AM by dinaharchery

    Gather Behavior

    dinaharchery

      Please somebody help with this "gather()" operation I am having real issues with

      I am trying to learn Brook+ and as such have implemented the sparse matrix-vector example that comes with the beta 1.4 download within Visual Studio 8.  However I am encountering some strange behavior with the gather method - perhaps I don't understand the operation.

      Let Size be 6333 - this is the total number of NON-ZERO elements of the 'a' matrix called 'ahat'. Let Length be Size * NzWidth where NzWidth is the maximum number of NON-ZERO elements of ALL rows of the original 'a' matrix. In the case of my 'a' matrix, NzWidth is 12.

      The result of calling the gather operation should be at least one value within the result array that is NON-ZERO, but I get NO NON-ZERO elements in the result array.  I have no idea why, but could use some help from the experts.

      Thanks in advance for ANY hints/ideas

      Below is the relavent code:

       

      kernel void gather(float index<>, float x[], out float result<>) { result = x[index]; } void reshuffleData(float *&nz, int *&cols, int *&rowStart, float *&Anz, float *&Acols, unsigned int size, unsigned int nzWidth){ unsigned int i; int j; for (i = 0; i < size; i++){ unsigned int offset = 0; for (j = rowStart[i]; j < rowStart[i + 1]; j++) { Anz[nzWidth * i + offset] = nz[j]; Acols[nzWidth * i + offset] = (float)cols[j]; offset++; } // must pad the rest of the row while (offset < nzWidth) { Anz[nzWidth * i + offset] = 0.0f; Acols[nzWidth * i + offset] = (float)0.0f; // this should be an invalid index.... but doesn't have to be since x multiplied by a zero here offset++; } }//OUTER FOR-LOOP }//reshuffleData() void gpuMatVecMult(unsigned int size, unsigned int length, float *&cIdx, float *&aNz, float *&x, float *&y){ unsigned int i; // System Memory: Stream<float> AStrm(1, &length); // Non-Zeros of A Stream<float> AStrm2(1, &length); // Non-Zeros of A Stream<float> indices(1, &length); // Column Indices Stream<float> tmp_indices(1, &length); // Temp. Indices Stream<float> xStream(1, &size); Stream<float> yStream(1, &size); // CPU->GPU: indices.read(cIdx); xStream.read(x); // Kernel Calls: gather(indices, xStream, tmp_indices); // GPU->CPU: indices.write(cIdx); for(i = 0; i < length; i++){ float fv = cIdx[i]; if(fv != 0.0f){ // I should get at least ONE non-zero, but I get // no output here!! cout << "Column[" << i << "]=" << fv << endl; } } } // Is size of NON-ZERO array of original 'a' matrix - i.e., 'ahat': unsigned int Size = nn; unsigned int Length = Size * nzWidth; float *cIdx = new float[Length]; float *Anz = new float[Length]; // "Reshuffle" Data for STREAMING: reshuffleData(ahat, csrCols, csrRows, Anz, cIdx, Size, nzWidth); // Use GPU to compute Matrix-Vector Multiplication: gpuMatVecMult(Size, Length, cIdx, ahat, p, u);

        • Gather Behavior
          gaurav.garg

          What results do you see with CPU backend?

            • Gather Behavior
              dinaharchery

              Thank you for the quick reply.

              When I run the standard CPU-based code I get the following output - I am only counting the NON-Zero output from the X array:

              x[2459] -> 253045

              x[2459] -> 253045

              x[2459] -> 253045

              When I run the Acols (column vector) from the "reshuffleData()" above (without following up with the kernel "gather" call), the following is a sample of the output:

              aCols[75977]->6332

              ...

              aCols[75990]->2446

              aCols[75991]->6331

              aCols[75992]->6332

               

              The code that implements the CPU-based output is pasted below:

               

              double Parameters::matVecMult(int nn, float *&p, float *&u){ try{ if(nn <= 0){ throw FERTMException("Exception matVecMult(): Invalid number of Nodes!\n"); } double time = 0.0; Start(1); for(int i = 0; i < nn; i++){ float t = 0.0f; int lb = csrRows[i]; int ub = csrRows[i + 1]; for(int j = lb; j < ub; j++){ int index = csrCols[j]; t += ahat[j]*p[index]; // OUTPUT for TESTING, where 'p' is actually // 'x' vector: if(p[index] != 0.0f) cout << "x[" << i << "]->" << p[index] << endl; }//INNER FOR-LOOP u[i] = t; }//OUTER FOR-LOOP Stop(1); time = GetElapsedTime(0); return time; }catch(...){ throw FERTMException("Exception matVecMult(): Something went WRONG!\n"); } }//matVecMult()

                • Gather Behavior
                  dinaharchery

                  Any ideas?

                  Could the fact that I am using Stream<float> as indices play any part? What about the fact that initially all values accessed by the index<> to the x[] stream are 0.0f, except for one index, would the others overwrite the value that was not 0.0f?

                  Maybe someone can give me a step-by-step explanation (or point me to one) of how exactly the kernel function defined as a gather works? I have read the user manual with beta 1.4 but I must be missing something.

                  For the record the Length variable is 75996 and the Size variable is 6333, the maximum Non-Zero length in any row point to by the csrRows array is 12.

                  Please, any ideas. Have I ran into a bug?

                  Thank you.

                    • Gather Behavior
                      gaurav.garg

                      I meant the CPU backend of Brook+ runtime. You can run your program with it if you set environment variable BRT_RUNTIME=cpu.

                       

                      Also can you check error on your output streams? Streams support method error() and errorLog() for the same purpose. You can look at any sample coming with SDK for it's use.

                        • Gather Behavior
                          dinaharchery

                          Sorry about the misunderstanding.

                          I set the backend to be "cpu" and got the following output when reading back from the gather kernel:

                          cIdx[29508]=253045

                          cIdx[39240]=253045

                          cIdx[40488]=253045

                          When setting the backend to be "gpu" I get no output - all values from the gather kernel are zero.

                          I ran the errorLog() on both of the input and the one output STREAM immediately after the gather kernel and got the following error:

                          "Error in tmp_indices: Kernel Execution: Error with input streams"

                          Any ideas? I am new to GPGPU programming, so any help would be fantastic.

                          The relevent code is pasted below:

                          double gpuMatVecMult(unsigned int size, unsigned int length, float *&aNz, float *&cIdx, float *&x, float *&y){ double time = 0.0; unsigned int i; // Where: // length = size*nzWidth (Max. Non-Zero Width of ALL rows original 'a' matrix) // size = size of Non-Zero 'a' array ('aNz'), 'x' array, and 'y' array // System Memory: Stream<float> AStrm(1, &length); // Non-Zeros of A Stream<float> AStrm2(1, &length); // Non-Zeros of A Stream<float> indices(1, &length); // Column Indices Stream<float> tmp_indices(1, &length); // Temp. Indices Stream<float> xStream(1, &size); Stream<float> yStream(1, &size); // CPU->GPU: indices.read(cIdx); xStream.read(x); AStrm.read(aNz); yStream.read(y); // Kernel Call: gather(indices, xStream, tmp_indices); ///////////////////////////////////////////////////// // Check for STREAM Error(s) in Kernel: // ///////////////////////////////////////////////////// if(tmp_indices.error()){ std::cerr << "Error in tmp_indices: " << tmp_indices.errorLog() << std::endl; } if(xStream.error()){ std::cerr << "Error in xStream: " << xStream.errorLog() << std::endl; } if(indices.error()){ std::cerr << "Error in indices: " << indices.errorLog() << std::endl; } // GPU->CPU: tmp_indices.write(cIdx); for(i = 0; i < length; i++){ float fv = cIdx[i]; if(fv != 0.0f){ cout << "cIdx[" << i << "]=" << fv << endl; } } return time; }//gpuMatVecMult() kernel void gather(float index<>, float x[], out float result<>) { result = x[index]; }

                            • Gather Behavior
                              gaurav.garg

                              That means there are some errors with kernel input streams. Now you should check error on input streams.

                                • Gather Behavior
                                  dinaharchery

                                  Thank you for the reply.

                                  I did check the input streams and found nothing however I fixed the problem I uninstalled the version of Catalyst 9.7 and installed Catalyst 9.2 and everything worked. 

                                  Could it be that the input stream, a 1D array, was too large for the 9.7 version of Catalyst? If so, do you know if there is a way around this in version 9.8 - I don't like using a lower version and would like to avoid it if possible.

                                  Thank you VERY much for your assistance with my problem. I learned a lot that I am sure will be useful in the future.

                                    • Gather Behavior
                                      gaurav.garg

                                      Yes, there is a regression with recent catalyst versions. It is not fixed in 9.8 also.

                                      Hardware has limitation of size 8192 with 1D buffers. Brook+ tries to virtualize it with technique called address translation. AT is not working with recent Catalyst version.

                                        • Gather Behavior
                                          dinaharchery

                                          Thank you very much for the information.

                                          I have some more questions/concerns about Brook+ Kernel calls. With regards to the "gather" function, the reason I wanted to use it was I am trying to implement a conjugate gradient solver with assistance from the GPU.  I have found that the largest time in the solver is the Matrix-Vector Multiplication and thought that if I implemented this in GPU it should speed-up overall execution (use the GPU as a coprocessor).  Thanks to your help, I got the Matrix-Vector Multiplication working, however the performance is sad - much slower that straight CPU.

                                          The timing of the GPU implemented Matrix-Vector Multiplication (along with the total solver time) follows:

                                          Total Solver: 109.588

                                          Total Matrix-Vector Mult.: 99.6158

                                          Total GPU Read/Write from/to STREAMS: 30.1526

                                          The individual Kernels being called have the following times:

                                          Gather: 4.04329

                                          Multiply: 14.414

                                          SumRows: 51.3612

                                          Do you, or anyone, have any idea why this performance is so bad on the GPU?  I thought it might be structure of the memory accesses but don't know - non sequential cache accesses?.  The total size of data being manipulated for Non-Zero 'a' matrix is 75996, and the total size for the 'x' and 'y' vectors is 75996 and 75996 respectively. I am sending the GPU the data "reshuffled" as ITPACK format (same as the matrix-vector multiply sample).  Code is pasted below:

                                          Once again, thank you so much for your help.

                                          void CPUGPUConnect::gpuMatVecMult(unsigned int size, unsigned int length, float *&aNz, float *&cIdx, float *&x, float *&y){ Stream<float> AStrm(1, &length); // Non-Zeros of A Stream<float> AStrm2(1, &length); // Non-Zeros of A Stream<float> indices(1, &length); // Column Indices Stream<float> tmp_indices(1, &length); // Temp. Indices Stream<float> xStream(1, &size); Stream<float> yStream(1, &size); indices.read(cIdx); xStream.read(x); AStrm.read(aNz); yStream.read(y); // Kernel Calls: gather(indices, xStream, tmp_indices); mult(AStrm, tmp_indices, AStrm2); sumRows(AStrm2, yStream); yStream.write(y); }//gpuMatVecMult() // Kernel Calls from compiled Brook file: kernel void gather(float index<>, float x[], out float result<>) { result = x[index]; } kernel void mult(float a<>, float b<>, out float c<>) { c = a*b; } reduce void sumRows(float nzValues<>, reduce float result<>) { result += nzValues; }

                                            • Gather Behavior
                                              gaurav.garg

                                              Two suggestions-

                                              1. It is better to join multiple kernels in single kernel if possible. It helps increasing the arithmetic intensity of the kernel as well as help in avoiding kernel setup time required by Brook+ runtime. Join kernel gather and mult together like this-

                                              kernel void mult(float a<>, int index, float b[], out float c<> ) 
                                              {
                                                  c = a*b[index];
                                              }

                                              2. The hardware texture units are capable of fetching float4 data in single instruction. Usually texture load is a bottleneck in kernel, vectorizing your datatypes is a good way to fully utilize the hardware. Also, because ATI hardware has 5-way superscalar shader cores, using vectorized data helps in fully utilizing shader cores. Change your kernel and streams to use float4 like this-

                                              kernel void mult(float4 a<>, int index, float4 b[], out float4 c<> ) 
                                              {
                                                  c = a*b[index];
                                              }

                                              reduce void sumRows(float4 nzValues<>, reduce float4 result<> ) 
                                              {
                                                  result += nzValues;
                                              }

                                                • Gather Behavior
                                                  dinaharchery

                                                  Thank for both suggestions, I will give them a shot.

                                                  You are very knowledgable on the GPU, can you give some suggested reading that can help develop a better understanding?

                                                   

                                                    • Gather Behavior
                                                      gaurav.garg

                                                      Stream computing user guide might be a good start. You can go through these slides as well-

                                                      http://www.pdc.kth.se/education/historical/2008/Stream2008

                                                        • Gather Behavior
                                                          dinaharchery

                                                          Thank you so much for your assistance I combined the mult and gather Kernels into one and the speed was somewhat better but still very bad. I want to change the float to float4 vectors but am having some difficulty.

                                                          I apologize if this is too dumb a question, but how would I go about converting the original floats to and from float4 data types? The floats are coming from the C++ half of the program and I get problems when I try to read/write the floats using the Stream float4 objects.

                                                          e.g.,

                                                          ...

                                                          Stream<float4> fv4(1, &length);

                                                          fv4.read(fv);  // Where fv is float array

                                                           

                                                          I am hoping there is a built-in Brook+ operation to translate the float and float4 data types.

                                                          Thanks again.

                                                            • Gather Behavior
                                                              gaurav.garg

                                                              float4 is similar to linear array of 4 floats (float[4] ). When you read or write data in stream, make sure number of bytes is same. e.g. in your case, fv should be array of 4* length floats.

                                                                • Gather Behavior
                                                                  dinaharchery

                                                                   Thanks for all your help gaurav.garg, I apologize for being a pain with my questions.

                                                                  I am still getting a BIG slow-down in performance using the GPU. I looked at the brook file via StreamKernelAnalyzer and noticed that the bottleneck for gather/mult (combined kernels) was texture fetching and for sumRows kernel it was the ALU.

                                                                  Perhaps some info on the card I am using could help? I am using ATI Mobility Radeon HD 4530/4570 with driver 8.5820.0. It has a core speed of 500 mhz, shader speed 500 mhz, memory speed 700 mhz, memory bus width of 64-bit and has no shared memory.

                                                                  Like I said, I am a newbie - so if this has nothing to do with the issue sorry.

                                                                  Thanks again.