The way I understand the article is:
4 DP FLOPS per "core" => 8 SP FLOPS => 4 SP FMAs => he's actually talking about 4 SIMD lanes when he says "core".
He said 8 per SM, so that's 32SPs/SM, just like today but apparently with a different organization.
He also said 128SMs total, so that's 4096SPs total. You'd need 2.5GHz to get 10TFLOPs. That won't be easy, but it might be doable.
At 40nm, you get 512SPs. So you should get about 1024 at 28nm, 2048 at 20nm, and finally 4096 at 14nm.
28nm will be available in latish 2011, 20nm should follow in late 2013, and finally 14nm in 2015~2016. Let's call it 2017 for delays, and 2018 because things don't always scale perfectly, especially when you add more functionality, plus NVIDIA needs pretty high clocks. So I think 10TFLOPS from NVIDIA in 2018 is realistic. That is if NVIDIA still exists in 2018.
Obviously, AMD should be there a lot sooner, since Cayman will probably break 3TFLOPS next month.