I have multiple AMD Radeons in the machine, one is cypress and other is tahiti. I started running clAmdBlasTune and it appears it is only tuning for first card. How can I tune for tahiti which is the 2nd card?
Unfortunately, at this time clAmdBlasTune does not take any parameters to select the device to tune for. This will be an enhancement request that I will file into our tracker.
What is the purpose of --store-kernel option exactly? What does storing kernel help?
OpenCL is based on a on-line compilation model; you pass the kernel as a string in the clCreateProgramWithSource() call and compile the program on the fly with clBuildProgram(). The very first time that the user calls a BLAS API, the clAmdBlas runtime examines the parameters passed into the API and generates a kernel on the fly and compiles it. This is a fixed compile cost, which sometimes can be measured in seconds. --store-kernels will take the best kernels that clAmdBlas has found, and serialize the binary kernel to disk in the .kdb file. The next time you as the user run your program, the binary kernel will be read off of disk from the file instead of compiling again, saving the compiling overhead. If you don't stream the kernels to disk, 'optimal parameters' are serialized to disk, but the kernel compile will happen the first time a BLAS call is made with that parameter set.
Does it tune based purely on the execution performance or the PCIe bus speed (8x vs 16x) would effect the results of tuning?
It only tunes vs. the kernel execution time, so PCIe bus speed will not affect the results of timing.
Does the .kdb file gets used by the library always automatically when AMD_CLBLAS_STORAGE_PATH is set?
Yes, when the library initializes itself, it checks for the existence of the env. variable and will attempt to read the file pointed by it. If the file exists, it loads all the optimized information within, including any binary serialized kernels that may have been stored in it.