This content has been marked as final. Show 2 replies
In this case, 2 possibilities here:
- First, you are using exp() which is a transcendental. As a result, you are going to be confined to the t unit of the thread processors. (which means the float4 isn't going to get you more parallelism in a single thread processor since you've run out of functional units to process your instruction)
- Second, sometimes the compiler will also do transformations like that for you (float4 instead of float). Depends on how easy it is for the compiler to discover that.
I suspect it is the first case that is your bottleneck.
Thanks a lot
However when i compare the two projects simple_matmult and optimized_matmult with float4 and four blocks with 100 iterations.
Simple_matmul is faster than optimized_matmult that's very strange