Constant elements can only be allocated in constant cache(BW~2TB/s) in the direct addressed mode i.e before compilation. If the address of element needs computation they are fetched from global memory at runtime.
But constant elements are both L1 & L2 cached while global variables are just L1 cached.So if you have appropriate access pattern you can get much larger performance using __constant identifier.
Noteo not exceed the limit of allocating constant buffers.(Use clinfo sample find out for your GPU).
Can you post the outputs of Globalbandwidth sample and constant bandwidth sample. You can notice the difference there.