I don't understand. Does it like LDS? How about its dimension?
Take for example matrix multiplication, why the block is 2D? And why the the function has global_x and global_y function to get thread ID? These are using *input, I think they should be 1D