I wanted to know whether my code( CSR Matrix multiplication ) will give optimum performance if I use all of the below optimizations together??

1. float to float4 (current implementation)

2. Blocking (Yet to add. i.e., grouping into warp sized blocks)

 Does the 2nd optimization matter much in terms of performance?.