AnsweredAssumed Answered


Question asked by monoton on May 28, 2012
Latest reply on Aug 9, 2012 by kcarney



I was happy to see that there is more information on the 7970 after the new SDK arrived 18th of May. I am trying to optimize a SpMV routine for the 7970, however I am having difficulties. I would be glad for some tips to get some better results.

While running on Nvidia hardware (older ones, such as the GTX 280) the performance is up in the 10-15 GFLOP/s, single precision. This kernel was originally written for that architecture. When running this kernel unmodified (yes, I know that is not "fair") it is sub 1 GFLOP/s on the HD 7970. I thought the performance would suffer as it is not yet optimized for the 7970, however, was a bit surprised, as this new card has a very impressive memory bandwidth. This kernel is highly memory bandwidth sensitive.


I am quite sure the biggest reason for the performance loss is the memory access pattern, where the CUDA hardware would read the 2D arrays in coalesced manners, a column at the time. I suspect the stride causes too big jumpes, and that when running on the 7970 the memory access pattern gets chaotic, i.e. channel and bank conflicts.


Could not find any information on coalesced reads in the new documentation, maybe something I am missing? Have tried to rewrite the kernel, so that it, to a higher degree, reads continuous memory locations inside each work-group. However, duo to some bugs it is still not showing results. I would be glad for some pointers of what would be the best approaches.


The matrices this kernel is set to compute are of dimensions 80800 and larger, and with 500,000 non-zero elements and more. I am running with local size = 64 and up to 256. One thread per row of the matrix.


Best regards,




#define large_grid_thread_id(void) (((uint)mul24((uint)get_local_size(0),(uint)get_group_id(0) + (uint)mul24((uint)get_group_id(1),(uint)get_num_groups(0))) + (uint)get_local_id(0)))


__kernel void kernel_sspmv_ell(

        const int rows,     

        const float alpha,

        const int ell_nz_row,

        const int ell_stride,

        __global const int *ell_idx,

        __global const float *ell_val,

        const float beta,

        __global float *d_y,

        __global float *d_x



    const int row = large_grid_thread_id();


    if(row >= rows){




    float sum = 0;

    if (beta)

        sum = beta * d_y[row];

    ell_idx += row;

    ell_val += row;

    for(int n = 0; n < ell_nz_row; n++){

        const float A_ij = *ell_val;


        if(A_ij != 0){

            const int col = *ell_idx - 1;

            sum += A_ij * d_x[col];



        ell_idx += ell_stride;

        ell_val += ell_stride;


    d_y[row] = sum;