With that shared memory configuration, I think accessing 8-byte primtive data type or any packed data type that have 8 byte in size will result in doubled shared memory bandwidth, thats why I said packed fp32.
I think that was Arun's point. You need to pack your 32-bit LDS accesses in order to maximize bandwidth. This was not the case on prior architectures so it's an indication that single precision took a back seat on Kepler.