NVIDIA Kepler speculation thread

Performance/mm^2 seems indeed quite high, but these are also some of the games where Nvidia fared better in the past (and not at the highest resolution available). No word on real power consumption and overclockability (it is even possible?).

Yeah, those benchmarks are a bit odd. Few games, all with 8XMSAA, mostly Nvidia slanted benchmarks, and all at 1080P none at 2560 where 680's bandwidth deficit should come in to play. Wish the review was more thorough and varied.

Plus Dave hinted a 1ghz 7970. That alone would make 3Dmark kind of a tie and well, BF3 and LP2 are heavily Nvidia slanted benchmarks so...
 
I'm sure someone will compare some widescreen (multi-monitor) gaming soon...only benchies I care about to be honest. 19x10 or 19x12 is so 2010.
 
Maybe the RF/Scheduler block takes more area than we think? Especially if the schedulers are fully associative to the SIMD lanes, and not bounded to a subset, like in CGN, e.g. every scheduler can issue an instruction to any SIMD. That would be really a huge overhead, if true. :???:
And it would be a huge power overhead. Considering nVidia's focus to performance/W and all the talk of minimizing the distance data has to travel, it is not going to happen.

The register files are in close proximity to the vALUs. And instructions can only be issued to a vALU, which has access to the register file, where all the values from the thread/warp are stored in. Therefore, there needs to be a distribution of warps to the schedulers, which can only schedule to a subset of the vALUs. A warp can't change its scheduler (its subset of vector ALUs) after it got assigned to one. Or all values has to be shuffled between the individual register files. That's simply implausible.
 
BF3 graph has been fixed reportedly, GTX680 only 1.7 FPS ahead in it now
bf3zc.png
 
The primarily responsible for the the Tessellation calculated PolyMorph Engine, also in the framework of the "Kepler" to upgrade to 2.0 The integrated Tessellator already been updated, and computational efficiency compared to "Fermi" 2 times, to the Radeon HD7970 4 times advantage.

:LOL:
And now I expect the usual excuses from AMD in style: "you don't need it now and so much...". Ok, if so, then why is your tesselator Gen. 9? :LOL:
 
:LOL:
And now I expect the usual excuses from AMD in style: "you don't need it now and so much...". Ok, if so, then why is your tesselator Gen. 9? :LOL:

Well, show me the games where AMD's tesselator has been the bottleneck and tesselation has been used at least in remotely sane way? ;)
 
No, but it plays a mean Angry Birds!
Sorry for the OT post but can the Ipad3 finally display websites that use Flash without hacking?

The majority of the sites I visit use flash. My neighbor complains about her Ipad not being able to view many of her favorite sites.

Job's holy war against all he doesn't control has so far turned me off of any current Apple product especially when it effects usability
 
Sorry for the OT post but can the Ipad3 finally display websites that use Flash without hacking?

The majority of the sites I visit use flash. My neighbor complains about her Ipad not being able to view many of her favorite sites.

Job's holy war against all he doesn't control has so far turned me off of any current Apple product especially when it effects usability

Apple still doesn't support Flash.
 
And it would be a huge power overhead. Considering nVidia's focus to performance/W and all the talk of minimizing the distance data has to travel, it is not going to happen.

The register files are in close proximity to the vALUs. And instructions can only be issued to a vALU, which has access to the register file, where all the values from the thread/warp are stored in. Therefore, there needs to be a distribution of warps to the schedulers, which can only schedule to a subset of the vALUs. A warp can't change its scheduler (its subset of vector ALUs) after it got assigned to one. Or all values has to be shuffled between the individual register files. That's simply implausible.
Well, since the actual layout of the RF apparently doesn't match the supposed number of SIMDs, it may very well be a plausible assumption. And more over, if Kepler treats the RF as a single address space, like in Fermi... :rolleyes:
 
Pic regarding the link posted earlier regarding the leaked benchmark results...

pic1hk.jpg


I'm not sure how accurate this is but what got my attention was the consumptions rates when their is no load.
 
Pic regarding the link posted earlier regarding the leaked benchmark results...

pic1hk.jpg


I'm not sure how accurate this is but what got my attention was the consumptions rates when their is no load.

Is this made from their GTX680 & 7970 reviews put together or...?
Also, any word on powertune settings used?
 
Is this made from their GTX680 & 7970 reviews put together or...?
Also, any word on powertune settings used?

source
That's where I got the chart from. I don't know how accurate that is but it does suggest that at desktop it's consuming more power. If that's true I have to wonder what the ram is clocked at in 2D mode?
 
Pic regarding the link posted earlier regarding the leaked benchmark results...

pic1hk.jpg


I'm not sure how accurate this is but what got my attention was the consumptions rates when their is no load.
So we should be expecting not so great idle and long idle consumption numbers :p
 
Has someone a clever idea how this really works?

Not sure if it's clever, but... here's a wild guess: 6 ALUs, one LD/ST and one SFU are right next to each other. This mini-cluster is then replicated 8 times vertically and another 4 times horizontally. I'd guess 1 in 6 (maybe 1 in 3) ALUs are DP capable, and that groups of 4 threads (pixel quads) are handled by a single mini-cluster (to get full ALU utilization, you need SP IPC of 1.5). This way, each set of 8 mini-clusters handles a 32 wide warp per cycle and is associated with a single dual issue scheduler.

Edit: quads assigned to groups of 6 ALUs might also explain why the number of registers is not a multiple of 3. Each mini-cluster would have access to 2048 registers, assuming the slide is correct, which is 64 warps at 8 registers per thread - not sure how that compares to Fermi.
 
Last edited by a moderator:
Back
Top