This is the standard way of measuring execution time of any OpenCL command (kernel execution, memory read/write etc) based on your device
CL_DEVICE_PROFILING_TIMER_RESOLUTION which is nano-second for both AMD GPU and CPU. Section 5.4 of this http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf recommends to use clFinish(command_queue) after clEnqueueNDRangeKernel() returns.
Regarding measuring clAmdFt timing, i think this is the right way to get fft-only time i.e. excluding any memory transfer timings to/from device.
More the clAmdFft developers could say.
The clMath forum folks may have more insights. I am moving this thread there.