For your first question of whether using the binary is faster that compiling from the source code, may I suggest that you profile it? Things could behave differently on different devices, with different drivers, and even with different compiler options (such as -cl-opt-disable). When creating a program from binary, the driver won't have to parse the source code, generate the machine code, optimize it, etc., so it should be faster. Still, profile it.
For your second question on how to generate and load the binary, the following is the exact sequence of OpenCL calls that I use (on Windows) to generate, save, and load the binary, which seems to match what you're doing.
clCreateProgramWithSource()
clBuildProgram()
clGetProgramInfo(...CL_PROGRAM_BINARY_SIZES...)
clGetProgramInfo(...CL_PROGRAM_BINARIES...)
save this to disc
load the binary from disc
clCreateProgramWithBinary()
clBuildProgram()
and off we go with clCreateKernel() and so on.
Please note that this generates an ELF within an ELF binary, which is an ELF with multiple sections, and one section will have an an ELF with the actual machine code in it. The 'real' binary that you've mentioned will be in one of those sections. Also note that the compiled code will be specific to your device (or multiple devices). This is to say that when I create a binary for, say, R290x that I use most of the time, it won't run on later generations of GPUs, such as Fiji or Vega, because the instruction encoding has changed.
Finally, for compiling OpenCL source code to binary offline and using it later, may I suggest that you try ROCm, if your platform and business constraints allow it? You can find more info about it at https://gpuopen.com/compute-product/rocm/ . With ROCm, if your constraints allow you to use it, you can definitely compile your code to a standalone binary which can be later loaded and dealt with as any other binary.