What is a reasonable resolution, sub 1080p?
I'd say anything up to and including 190x1200. Beyond that can be considered "enthusiast" resolution IMO.
And I guess several people said already, that the API overhead is usually signficantly lower on closed console systems enabling quite a bit more draw calls with the same computing resources.
True, it's obviously not all about draw call but I take on board what your saying there. i.e. in a console the ratio of a particular CPU to GPU will be a little more in favour of the CPU than those same components in a PC, thus for a "balanced" system you don't necessarily need as much CPU power as you would in a PC.
And even if you are right, the devs are not stupid. If there is GPU power to spare (that is also what CPU limited means), they will crank up the effects, the AA, whatever. Typically nobody cares about how much higher than 60 fps you could run on consoles.
I completely agree. I've never really thought much of the balanced system concept anyway to be honest. If you have an 'excess' of CPU power in a console then the CPU will be given more work to do like helping out with graphics or additional physics etc... and same for the GPU. Even in the PC the concept doesn't hold water since you can pile on image quality, resolution, 3d etc....
But outside of that, even FMA brings BD/PD just to parity with Jaguar in a pure FP throughput per clock scenario (what may be attainable is an SMT like speed up of 20% or something through better usage of the available resources). PD gets 16 Flops per module, i.e. 8 flops per core peak with FMA, but only 8 flops per module or 4 flops per core with MULs and ADDs (but flexible in all relative abundances). Jaguar gets 8 flops per core with MULs/ADDs (50:50 mix) and dips to 4 flops per core for solely MULs or ADDs. In any case, throughput per clock is basically never lower on Jaguar than on PD save for the exceptions mentioned above.
Thanks for improved understanding. So assuming 2 PD/SR modules running at 3.2Ghz and 8 Jaguar cores running at 1.6Ghz you effetively have the following extreme scenarios:
All ADD or All MUL code: Even
All FMADD code: Even
A perfect 50-50 split between ADD and MUL: Jaguar has twice the performance
The reality is of course going to be a mix of the above. Starting from the extreme of the Jaguar based CPU being twice as fast, adding FMADDs into the codebase will close the gap, then a lack of balance between ADDs and MULs in the rest of the code will further close the gap. Now consider that PD in Richmond is running at 4.1Ghz so we could assume Steamroller which would have been the console equivilent to run at a similar speed (given that the expected total power draw of Kaveri at this speed with 8CU's is still only 100w that seems reasonable). That's a 28% speed increase over the baseline position of 3.2Ghz.
When you add it all up it looks like a 2 module steamroller could have been fairly comparable in SIMD throughput while still being faster in scalar code and much faster in any kind of single threaded code.
Then if you consider that the console configuration of Jaguar doubles the PC APU configuration (4 -> 8 cores) so if they would have done the same for Steamroller we'd have been looking at 4 modules rather than 2. And there just wouldn't have been a comparison in that case, even if those 4 modules were running at a much slower clock speed than the desktop varients to save power (say 3.2Ghz).
Of course one could argue that PD clocks higher. But what is left from that if you have to fit 8 PD cores in 25W? Or do you really get 4 PD cores in 25W to clock more than twice as high as Jaguar? What does it cost in die size in comparison? Can you port PD easily to the 28nm process of your choice? How does this effect the clocks and power consumption?
This is obviously the key decision point - could you get 2 or 4 PD/SR modules into the APU within an aceptable power and die size envelope. I honestly don't know much about the relative die size or power output of each core type other than to assume PD/SR would be much larger and hotter. Although as I mentioned above, 2 modules fits nicely on a 32nm APU at 4.1Ghz with 6 CU's running at 844Mhz within 100w so there certanly seems to be plenty of power budget there for an additional 12 CU's all running at 800Mhz on a 28nm APU. Adding another 3 SR modules might be pushing things a bit though.
Note Steamroller is the 28nm version of Piledriver so we should really be talking in SR terms rather than PD.