PCGH held a pretty long interview with Nvidia's top scientist.
http://www.pcgameshardware.com/aid,...chnology-DirectX-11-and-Intels-Larrabee/News/
http://www.pcgameshardware.com/aid,...chnology-DirectX-11-and-Intels-Larrabee/News/
Our understanding of Larrabee, which is based on their paper at Siggraph last summer and the two presentations at the Game Developers Conference in April, is that they have fixed function hardware for texture filtering, but they do not have any fixed function hardware either for rasterization or compositing and I think that that puts them at a very serious disadvantage. Because for those parts of the graphics pipeline they're gonna have to pay 20 times or more energy than we will for those computations.
The texturing and FLOPS actually tends to hold a pretty constant ratio and that's driven by what the shaders we consider important are using. We're constantly benchmarking against different developers‘ shaders and see what our performance bottlenecks are. If we're gonna be texture limited on our next generation, we pop another texture unit down. Our architecture is very modular and that makes it easy to re-balance.
The ratio of FLOPS to bandwidth, off-chip bandwidth is increasing. This is, I think, driven by two things. One is fortunately the shaders are becoming more complex. That's what they want anyway. The other is, it's just much less expensive to provide FLOPS than it is [to provide] bandwidth. So you tend to provide more of the thing which is less expensive and then try to completely saturate the critical expensive resource which is the memory bandwidth.
It is critically important to the people who do scientific computing on our GPUs to have double precision. So going forward, the GPUs that we aim at the scientific computing market will have even better floating point double precision than what's in GT200. That ratio of double precision to single precision which is now one double precision operation per eight single precision operations will get closer. An ultimate ration to target is something like two to one.
I think that we're increasingly becoming limited by memory bandwidth on both the graphics and the compute side. And I think there's an opportunity from the hundreds of processors we're at today to the thousands of cores we're gonna be at in the near future to build more robust memory hierarchies on chip to make better use of the off-chip bandwidth.