If you're trying to initialize __local memory, just have each consecutive thread write 0.0f to each consecutive addresss in the local array. I'm a little concerned that you're asking about calloc() though, because you can't dynamically allocate memory from within a kernel...
@ rick.weber: not really need dynamic allocation but just want to have a method that can initialize a chunk of memory to zero. Yr solution will work with the number of memory element = number of thread but in my case I have number of memory element more than number of thread (each thread has to run a loop).
Have you checked out the openCL Spec
6.11.10 Async Copies from Global to Local Memory, Local
to Global Memory, and Prefetch
It may be somewhat faster using async_work_group_copy.