Archives Discussions

monoton · ‎05-28-2012

Hi,

I was happy to see that there is more information on the 7970 after the new SDK arrived 18th of May. I am trying to optimize a SpMV routine for the 7970, however I am having difficulties. I would be glad for some tips to get some better results.

While running on Nvidia hardware (older ones, such as the GTX 280) the performance is up in the 10-15 GFLOP/s, single precision. This kernel was originally written for that architecture. When running this kernel unmodified (yes, I know that is not "fair") it is sub 1 GFLOP/s on the HD 7970. I thought the performance would suffer as it is not yet optimized for the 7970, however, was a bit surprised, as this new card has a very impressive memory bandwidth. This kernel is highly memory bandwidth sensitive.

I am quite sure the biggest reason for the performance loss is the memory access pattern, where the CUDA hardware would read the 2D arrays in coalesced manners, a column at the time. I suspect the stride causes too big jumpes, and that when running on the 7970 the memory access pattern gets chaotic, i.e. channel and bank conflicts.

Could not find any information on coalesced reads in the new documentation, maybe something I am missing? Have tried to rewrite the kernel, so that it, to a higher degree, reads continuous memory locations inside each work-group. However, duo to some bugs it is still not showing results. I would be glad for some pointers of what would be the best approaches.

The matrices this kernel is set to compute are of dimensions 80800 and larger, and with 500,000 non-zero elements and more. I am running with local size = 64 and up to 256. One thread per row of the matrix.

Best regards,

Olav

#define large_grid_thread_id(void) (((uint)mul24((uint)get_local_size(0),(uint)get_group_id(0) + (uint)mul24((uint)get_group_id(1),(uint)get_num_groups(0))) + (uint)get_local_id(0)))

__kernel void kernel_sspmv_ell(

const int rows,

const float alpha,

const int ell_nz_row,

const int ell_stride,

__global const int *ell_idx,

__global const float *ell_val,

const float beta,

__global float *d_y,

__global float *d_x

)

{

const int row = large_grid_thread_id();

if(row >= rows){

return;

}

float sum = 0;

if (beta)

sum = beta * d_y[row];

ell_idx += row;

ell_val += row;

for(int n = 0; n < ell_nz_row; n++){

const float A_ij = *ell_val;

if(A_ij != 0){

const int col = *ell_idx - 1;

sum += A_ij * d_x[col];

}

ell_idx += ell_stride;

ell_val += ell_stride;

}

d_y[row] = sum;

}

kcarney · ‎05-31-2012

Howdy,

Below is a reply that I got from Ted Barragy here at AMD.

---

I have not done anything with SpMV on GPUs, though I’ve done a lot of it on CPUs in the past.

So be very wary of my comments below

1. Non-unit strides are ugly on 7970, not sure about other devices

2. I have not looked into any of the more complex ‘image’ type access mechanisms. I’ve tested straight OpenCL ‘global’ with saxpy type ops – what I saw was that I had to have 64 threads accessing 64 x floats in parallel in contiguous memory locations. Take a look at the APP SDK – it should have (under benchmarks?) some MemoryBandwidth & BufferBandwidth test code.

3. Use of the LDS is fairly important on AMD gpus – have never worked with Nvidia – this is very high bandwidth local memory – 32xfloats per clock You get at this by creating a temp array typed as OpenCL local, and explicitly copying to it from the incoming global memory. So you might copy your input vector X into this – looks like you have only 5 – 7 non-zeroes per row, the more the better for reuse.

4. You may need to change your matrix data structure – if structured, then storing by diagonals might work well, going back to Cray vector data structures. Each thread takes ‘a row’, diagonal is read in from contig mem locs & x vector is pulled from LDS

monoton · ‎07-30-2012

Hi again,

Thanks for previous answers.

For reading the data I have managed to get fairly good speeds, however, when writing the result back to global memory the overall performance is severely crippled (i.e. when including the lines for the data-writing in the kernel the performance drops down to ~600 MFLOPS(!), commenting out the writing to global memory it increases to ~11-15 GFLOPS (keep in mind this is memory-bandwidth limited, and in this case the kernel does not complete its job, as the result is not written back)). I have tried several approaches here, for writing the data back to global memory.

This can be either for an SpMV kernel (written for this AMD card), or a more simple AYPX / AXPY kernel.

-> What is the fool-proof method of getting good write performance to global memory on the 7970? I have looked in the programming guide without getting further.. it seems very vague on this matter.

Regards,

Olav

nou · ‎07-30-2012

beware. compiler will optimize away any unneeded reads and computations.

monoton · ‎07-30-2012

Thanks. Good point.

Another thought, the AYPX is called hundreds of times, so I wonder if the launch-overhead plays a big part here. Wish there were good graphical profiling tools for the AMD cards, under Linux. That platform is after-all what is used for clusters.
These problems are kind of frustrating. Due to lack of documentation in the Programming Guide for this architecture (the needed information did not arrive until mid-May - and still it seems too little.. for this card released back in January) I was already very delayed. I think the profiling tools have been suffering too long and the proper documentation also. Just having graphical profiling tools to be used with Microsoft Visual SDK package is not good enough. Need real tools for Linux, if these cards are going to have any momentum for HPC use.

So I guess this problem is related both to the complexity of getting the software properly profiled, and lack of documentation detailed enough (too late arrival), to let people who invested in these cards really utilize it best.

dmeiser · ‎07-30-2012

I've found sprofile pretty useful. The latest version works well with 7970s.

Is your SpMV kernel supposed to be general purpose or are you targeting matrices of a certain type? In the latter case you may have to use structural properties of the matrices to optimize your memory access patterns. I've worked a bit on SpMV on AMD gpus. If you can give us some info on the structure of the matrices I may be able to give more specific tips.

Cheers,

Dominic

monoton · ‎08-06-2012

thanks,

I am having a hard time of finding the parameters/information that is put out from sprofile explained thoroughly.

I keep the SpMV matrices in a hybrid format that is good for GPUs. There is no reason that format should perform badly. I think there must be some substantial latency when kernels are launched hundreds of times on these hardware.. maybe a runtime overhead.

I have attached the output from sprofile when running AYPX. This one knowingly not tuned for AMD, however, can you from this data see the bottlenecks, and explain me the reasoning. Looks like the memory units and memory writes stalls a lot in many cases..

Regards,

Olav

Message was edited by: Olav Fagerlund

I found this, so it is actually available (thankfully):
http://developer.amd.com/tools/AMDAPPProfiler/html/settingscounterselection.html
so I got an explanation for the output now.

However, your opinion on the interpretation of the performance bottle-neck is still something I would like to hear!

Thanks.

dmeiser · ‎08-07-2012

From the profile it looks like there is an issue with how you use the ALUs (specificially the vector ALUs). They are only a few percent busy while the ALU utilization is close to 100%. At the same time the global memory units are only busy at the few percent level, meaning that the ALUs aren't just sitting there waiting for data. The sustained memory bandwidth is only about 1GB/s.

Is it possible that you're doing a lot of integer arithmetic (divisions, mod) for indexing? Those instructions can be surprisingly expensive (a couple hundred clock cycles). If possible you should replace those with bitshift and masking operations.

I assume you're storing your sparse matrix data in column major order? If not, that will probably improve your access patterns for reading the sparse matrix data.

How do you lay out your threads? One thread per row?

How are you using the shared memory for this computation? Did you try without shared memory?

If you post your kernel code I can give more specific tips.

Cheers,

Dominic

kcarney · ‎08-07-2012

I'll pass on your feedback that the Programming Guide needs to include

"parameters/information that is put out from sprofile explained thoroughly."

Do you have any other specific feedback that I can give to the documentation writer in addition the above (which is perfect by the way! It's really hard to improve on "not enough documentation," but specific feedback like you've give are super helpful!)

dmeiser · ‎08-07-2012

I think that the documentation of the AMD APP profiler is in general quite good. In addition to the link posted by monoton I find the material at the following location very valuable:

http://developer.amd.com/tools/AMDAPPProfiler/html/index.html

An issue that you might want to address is that this info appears to be specifically targeted at using the profiler from within visual studio. There is however a lot of information in there that applies equally to the command line version as well as the profiler on different platforms e.g. linux. People ending up at that website may think it doesn't apply to their situation. Perhaps there needs to be a dedicated page for linux or it needs to be made more explicit which parts of the documentation are general and which parts are visual studio specific.

A possible improvement to the compiler output could be to add the information at http://developer.amd.com/tools/AMDAPPProfiler/html/index.html (i.e. a short description of the meaning of the different counters) to the header comments in the csv file. That way it's pretty easy to view the documentation and the counters in the same editor.

Cheers,

Dominic

kcarney · ‎08-09-2012

Fantastic. Thank you very much!

Archives Discussions

ELL SpMV on AMD HD 7970 GPU