cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

dinaharchery
Journeyman III

Reducing Stream Read/Writes

Please Help,

I have a Conjugate Gradient Solver written for CPU and am trying to translate the most computationally heavy aspect to use on GPU - i.e., Matrix-Vector multiplication. I have done this thanks to gaurav.garg, however the performance is just BAD. I looked at KernelStreamAnalyzer and it said the bottleneck is the texture fetches - assumed this is sort of assocated with Stream read/writes?

The data is unstructured so I won't know ahead of time the specific size of the matrix and vector that is being solved.  Does anyone know of an effective way to reduce the need to call the Stream read/write? I like the idea of using the GPU as a coprocessor but the performance is horrid.

Applicable code is attached - sorry for the amount.

Any ideas would be wonderful. Thank you in advance.

 

0 Likes
8 Replies
dinaharchery
Journeyman III

Oops,

I forgot to attach the aforementioned code

void Parameters::reshuffleGPUData(){ // Allocate Memory - if necessary: if(cIdx == NULL) cIdx = new float[Length]; if(Anz == NULL) Anz = new float[Length]; unsigned int i; int j; for (i = 0; i < Size; i++){ int offset = 0; for (j = csrRows; j < csrRows[i + 1]; j++) { Anz[nzWidth * i + offset] = ahat; cIdx[nzWidth * i + offset] = (float)csrCols; offset++; } // Must pad the rest of the row: while (offset < nzWidth){ Anz[nzWidth * i + offset] = 0.0f; // This should be an invalid index.... // but doesn't have to be since x multiplied by a zero here cIdx[nzWidth * i + offset] = 0.0f; offset++; } }//OUTER FOR-LOOP }//reshuffleGPUData() double Parameters::gpuMatVecMult(int nn, float *&p, float *&u){ // To interface with the GPU Device: CPUGPUConnect gpuConnect; // Compute Matrix-Vector Multiplication: double gpuTime = gpuConnect.gpuMatVecMult(Size, Length, cIdx, Anz, p, u); // Return total GPU time: return gpuTime; }//gpuMatVecMult() bool Parameters::asCGDiag(int &totiter){ float *r, *u, *p, *s, *rehat; r = u = p = s = rehat = NULL; try{ if(nn == 0)return false; // Local variables: int iter, i, echocounter; float gamma_new, r_0, gamma_0, r_iter, gratio, alphainv, gamma_old; float sigma_e, rho_e, gamma_old_inv; r = new float[nn]; u = new float[nn]; p = new float[nn]; s = new float[nn]; rehat = new float[nn]; for(i = 0; i < nn; i++) r = u = p = s = rehat = 0.0f; // ========================================================== // We have a pre-conditioner based on L^-1 and L is W^{1/2}. // This is the subdomain level nodal array dpc(ndfnodes) // ========================================================== // Initialization: for(i = 0; i < nn; i++){ presoln = 0.0f; r = ldhat; } for(i = 0; i < nn; i++){ rehat = adiagpre*r; } iter = 0; echocounter = 0; // Compute gamma_0: Initial gamma: // ============================================================== // Obtain the accumulated vector s: // This involves a single communication operation where the // accumulated vector 's' is obtained from the assembled form of // the shared node vectors. // ============================================================== for(i = 0; i < nn; i++) s = rehat; rho_e = dotProduct(s, rehat, nn); gamma_0 = rho_e; for(i = 0; i < nn; i++) p = adiagpre*s; // Initialize gammaold: gamma_old = gamma_0; gamma_old_inv = 1.0f / gamma_old; // Initial Residual Norm: r_0 r_0 = abs(sqrt(gamma_0)); // "Reshuffle" Data for proper STREAMING; reshuffleGPUData(); // ================================================================ // Start Iterations: // ================================================================ do { iter++; // Generate warning messages when the iteration count exceeds 500. if((iter%500) == 0){ cout << "Warning: Iteration count exceeds " << iter << endl; } // =================================================================== // START IMPLEMENT GPU-BASED CODE HERE: // =================================================================== //////////////////////////////////////////////////// // Determine the matrix vector product Kp: // //////////////////////////////////////////////////// // Using GPU Matrix-Vector Multiplication: gpuMatVecMult(nn, p, u); //////////////////////////////////////////////////// //////////////////////////////////////////////////// sigma_e = dotProduct(p, u, nn)*gamma_old_inv; alphainv = sigma_e; alphainv = 1.0f / alphainv; for(i = 0; i < nn; i++){ presoln = presoln + p*alphainv; r = r - u*alphainv; rehat = adiagpre*r; s = rehat; }//FOR-LOOP rho_e = dotProduct(s, rehat, nn); gamma_new = rho_e; // Correct for very small negative numbers from the reduction: if (gamma_new < 0.0f && abs(gamma_new) > 1.0e6f){ cout << "Warning: Large negative gamma_new computed in asCGDiag: " << gamma_new << endl; } r_iter = sqrt(abs(gamma_new)); // Test for convergence: if ((r_iter / r_0) <= tol){ cout << "iterations to converge " << iter << endl; totiter += iter; break; } gratio = gamma_new * gamma_old_inv; // Additional step for the pre-conditioner. The "s-vector" has to // be multiplied by the preconditioner L^{-1}. Hence s(i) is // multiplied by adiagpre(i) in the next line. for(i = 0; i < nn; i++){ p = s*adiagpre + p*gratio; } // =================================================================== // END IMPLEMENT GPU-BASED CODE HERE: // =================================================================== gamma_old = gamma_new; gamma_old_inv = 1.0f / gamma_old; }while(true); // ================================================================ // End Iterations: // ================================================================ // Free Memory: if(r != NULL) delete [] r; if(u != NULL) delete [] u; if(p != NULL) delete [] p; if(s != NULL) delete [] s; if(rehat != NULL) delete [] rehat; r = u = p = s = rehat = NULL; return true; }catch(...){ // Free Memory: if(r != NULL) delete [] r; if(u != NULL) delete [] u; if(p != NULL) delete [] p; if(s != NULL) delete [] s; if(rehat != NULL) delete [] rehat; r = u = p = s = rehat = NULL; return false; } }//asCGDiag() double CPUGPUConnect::gpuMatVecMult(unsigned int size, unsigned int length, float *&cIdx, float *&aNz, float *&x, float *&y){ double time = 0.0; /////////////////////////////////////////////////////////// // System Memory: // /////////////////////////////////////////////////////////// Stream<float> AStrm(1, &length); // Non-Zeros of A Stream<float> AStrm2(1, &length); // Non-Zeros of A Stream<float> indices(1, &length); // Column Indices Stream<float> xStream(1, &size); Stream<float> yStream(1, &size); // Start GPU Timer: Start(0); ////////////////////////////////////////////////////////// // CPU->GPU: // ////////////////////////////////////////////////////////// indices.read(cIdx); xStream.read(x); AStrm.read(aNz); yStream.read(y); ////////////////////////////////////////////////////////// // Kernel Calls: // ////////////////////////////////////////////////////////// gatherMult(AStrm, xStream, indices, AStrm2); sumRows(AStrm2, yStream); ////////////////////////////////////////////////////////// // GPU->CPU: // ////////////////////////////////////////////////////////// yStream.write(y); // Stop GPU Timer and compute difference: Stop(0); time = GetElapsedTime(0); return time; }//gpuMatVecMult() kernel void gatherMult(float a<>, float b[], float index<>, out float result<>) { result = a*b[index]; } reduce void sumRows(float nzValues<>, reduce float result<>) { result += nzValues; }

0 Likes

First of all, your kernel has very less arithmetic intensity and it is not expected to give very high performance. You can probably get some performance gain if you can call multiple kernels with very less stream read/write call.

One issue in your code is continous data transfer over PCIe in each iteration. It would be good if you keep the data on GPU as much as possible. You should avoid continous stream Read/write and try to reuse the same stream in multiple iteration. That would mean transferring not just matrix-vactor multiplication to GPU, but also things like dotProduct and othe computaion done in each iteration.

0 Likes

Thank you gaurav.garg I really owe you.

Can you give a quick example of this? Not necessarily what I am doing but a similar example where the same stream can be reused across the multiple iterations?

0 Likes

The following are ALL kernel calls.

streamRead(fs1to4_1, f1to4); streamRead(fs5to8_1, f5to8); streamRead(fs9_1, f9); streamRead(Fs1to4_1, F1to4); streamRead(Fs5to8_1, F5to8); streamRead(Fs9_1, F9); streamRead(ss, s); streamRead(GEOs, GEO); step = 1; Norm1=1.0; Norm2=1.0; error1=1.0; //Init L2-Norm error for velocity error2=1.0; //Init L2-Norm error for density while (step < TIMING) { mcollid_adv2_s(Fs1to4_1, Fs5to8_1, Fs9_1, fs1to4_1, fs5to8_1, fs9_1, GEOs, ss, G, mx, my, Fs9_1, Fs5to8_1, Fs1to4_1); Fs1to4_1.error(); advection2_s(Fs1to4_1, Fs5to8_1, Fs9_1, gx, mx, my, bk, Fs9_2, Fs5to8_2, Fs1to4_2); Fs9_2.error(); advection3_s(Fs1to4_2, Fs5to8_2, Fs9_2, gx, mx, my, bk, Fs9_1, Fs5to8_1, Fs1to4_1); Fs9_1.error(); stream_macro_org3_s(Fs1to4_1, Fs5to8_1, Fs9_1, GEOs, fs9_1, fs9_1, fs5to8_1, fs1to4_1); fs9_1.error(); ................. ......................... }

0 Likes

Exactly. Whereas, you are doing something like this-

while (step < TIMING)

{

   streamRead(...)

   kernel call

  streamwrite

}

Try to move streamRead/write out of loop. Or, try to minimize the data that you transfer. As, I can see in your program you only need a single value rho_e to be calcualted for each iteration. You can calculate that value on GPU and then transfer it. Or may be rather than transferring data in each iteration, transfer data only after 10 or 20 iteration and based on that value decide if your result is within error limits.

0 Likes

You all are right. I will move the read/writes out from the iterations. I should have seen that

So, I am correct in the assumption that the Texture fetches are associated with the stream read/writes in the code that I posted? Sorry if this is dumb - I am just learning.

Thanks again. I will post results

0 Likes

Texture fetches rae related to read from input streams inside kernel, not Stream Read/write. SKA only analyzes kernel code.

0 Likes

I implemented the code by moving the stream read/write outside of the convergence loop on the Conjugate Gradient Solver but still it was slow. I did some further investigations into all kernels involved and found the performance culpurit - the reduction kernel called just after the "gatherMult" kernel (i.e., "sumRows" kernel). I even did a test and simply called the reduction kernel without writting back from the stream and it was slow.

Does anyone have any information as to how to speed-up the reduction of the "sumRows" kernel? I realize that there is not a whole lot of computationally heavy value in the Matrix-Vector multiplication but I think the speed should be a lot better (it is only 13% of the speed of a standard CPU regardless of the data size

Thanks again.

0 Likes