For your first question of whether using the binary is faster that compiling from the source code, may I suggest that you profile it? Things could behave differently on different devices, with different drivers, and even with different compiler options (such as -cl-opt-disable). When creating a program from binary, the driver won't have to parse the source code, generate the machine code, optimize it, etc., so it should be faster. Still, profile it.
For your second question on how to generate and load the binary, the following is the exact sequence of OpenCL calls that I use (on Windows) to generate, save, and load the binary, which seems to match what you're doing.
save this to disc
load the binary from disc
and off we go with clCreateKernel() and so on.
Please note that this generates an ELF within an ELF binary, which is an ELF with multiple sections, and one section will have an an ELF with the actual machine code in it. The 'real' binary that you've mentioned will be in one of those sections. Also note that the compiled code will be specific to your device (or multiple devices). This is to say that when I create a binary for, say, R290x that I use most of the time, it won't run on later generations of GPUs, such as Fiji or Vega, because the instruction encoding has changed.
Finally, for compiling OpenCL source code to binary offline and using it later, may I suggest that you try ROCm, if your platform and business constraints allow it? You can find more info about it at https://gpuopen.com/compute-product/rocm/ . With ROCm, if your constraints allow you to use it, you can definitely compile your code to a standalone binary which can be later loaded and dealt with as any other binary.
I think, S P has already answered most of your questions. Just want to add one point about the title question.
The program binary can consist of either or both of device-specific code and/or implementation-specific intermediate representation (IR). In the latter case (i.e. binary having IR code), clBuildProgram() must be called after the clCreateProgramWithBinary() to convert the IR to device code. So, depending on the binary representation, the binary may or may not work without rebuilding it again (in cases, calling clBuildProgram may effectively be a no operation and skipped silently).
While the binary may work directly, it is not recommended to skip the 2nd clBuildProgram because the application can loose portability and may fail on same platform if the implementation decides to change the behavior in future.
Anyway, whatever the case, building a kernel program from a binary is generally faster than building the same from a source.