OK, my mind is blown. This would appear to imply that Titan is at about the same SGEMM performance as HD5870 (nearly 2 TFLOPS).
DGEMM should be fine though (should be 90%+ of theoretical).
Hmmmm, actually Titan throws a wrench into my theory. It gets ~3.2 TFLOPS in CUBLAS SGEMM. That's ~72% of peak, similar to K20. Maybe there's more to GK110 than nVidia is letting on but I can't find anything on observed peak instruction throughput.
http://on-demand.gputechconf.com/gtc-express/2012/presentations/inside-tesla-kepler-k20-family.pdf
http://www.anandtech.com/show/6774/nvidias-geforce-gtx-titan-part-2-titans-performance-unveiled/3