NVIDIA Kepler speculation thread

Well, somebody once claimed that it is fundamentally impossible to have 4x distributed geometry processing without having massive latency and gargantuan power consumption. That same someone inexplicably continued to prove this by pointing at, wait for it: a cell characterization problem that was fixed with a metal patch. (WTF?) Then he murmured something about crossbars too, which is funny, because I didn't really expect a crossbar that serves memory to have much to do with distributing geometry at the front-end. Right? It made him even speculate that GK104 would only have 2x geometry, because that's what sensible people do.

It all makes me wonder: what if Nvidia had used a 2x configuration instead of 4x, how much lower do you think the power of GK104 could have been?

What a missed opportunity...

Edit: I forgot the best one: the distributed geometry architecture is responsible for increase power consumption during compute. You can't make this up...

My favorite was going from a statement that Fermi would be terrible at Tesselation, to claiming Fermi had *too much* tesselation, when it turned out that Fermi was actually quite good at Tesselation..
 
Last edited by a moderator:
I think Charlie was the one who started the nonsense about Fermis tessellation abilities.

Unfortunately most of that tessellation power is used to tessellate square Jersey barriers in Crysis 2.

Anyways supposedly they buffed it up in Kepler by a factor of two. So pulling ahead in Heaven is not too surprising.
 
Latency? Cross-bar interconnects are the first choice to be used in cases of low-latency communication between moderately large number of clients.
Yup.

The problem with complex cross-bar interconnects is the accumulation of hotspots due to signal crossings.
Yup, that too. But I'm used to crossbars that are WAY bigger than whatever they use in GPUs. Look at it this way: in a GPU, how much area do you think just the crossbar with just 4 units (because that's what we're talking about in the case) takes compared to the rest of the area?

In GF100 the distributed nature of both geometry processing (16x) and primitive setup (4x) asked for very dense wiring mesh. JHH said that this aspect of the architecture was the main reason the product delays and metal re-spins.
All he said there was that the first spin came back dead because of an issue with the process parameter of TSMC not matching real life. For a standard cell based design, that's saying: a characterization issue. It does not point to a fundamental architecture problem.

The other obstacle was the large transistor leakage variance.
Yes. But that has nothing to do with fundamental architecture issues: a fundamental architecture is not something you can fix with even a limited base spin, IMO.
 
Or you might simply ask Techpowerup's w1zzard if GPU-z 0.5.9 fully supports Kepler yet. [Hint: It doesn't, as you have noted above wrt to shaderclocks].
Detecting basic gpu/mem frequencies should be quite trivial even if the particular chip isn't supported, no?
I mean, it does detect memory frequency right anyway?
 
Yup, that too. But I'm used to crossbars that are WAY bigger than whatever they use in GPUs. Look at it this way: in a GPU, how much area do you think just the crossbar with just 4 units (because that's what we're talking about in the case) takes compared to the rest of the area?

So how wide is the offending crossbar?
- 4 In, 4 Out
- how wide is each bus? 32-bits? 128-bits?

It's probably more of a track routing issue than a number-of-transistors issue....
 
Detecting basic gpu/mem frequencies should be quite trivial even if the particular chip isn't supported, no?
I mean, it does detect memory frequency right anyway?
There is always the possibility that a new generation introduces new power states or swizzles the old ones, so I would not take anything for granted that is displayed by a utility not supporting a particular architecture.
 
Given the theory that nVidia originally intended GK104 for a lower market segment can we expect unrestricted geometry throughput from its 4 GPCs? GF114 ran full speed.
 
Given the theory that nVidia originally intended GK104 for a lower market segment can we expect unrestricted geometry throughput from its 4 GPCs? GF114 ran full speed.
I don't expect Kepler to be any different than Fermi in consumer vs. professional segmentation -- half-rate setup w/o tessellation and full-rate with tessellation enabled.
 
So how wide is the offending crossbar?
- 4 In, 4 Out
- how wide is each bus? 32-bits? 128-bits?

It's probably more of a track routing issue than a number-of-transistors issue....
It's not going to change the opinion of the gullible, but 3 years after the fact, it's probably time to settle this once and for all: the issue in GF100-A01 was in a back-end bus that fed the memory controllers. It was not even in the general xbar that interconnects the usual agents. There was a custom designed cell that with a timing violation that was not picked up during characterization.

The net result was a broken MC system (no transactions to external memory at all), but not a bricked chip: major parts could be verified by rendering to PC memory over PCIe. A02 fixed all known bugs, but not those that were hiding behind MC specific paths, so A03 was needed.

GF100-A01 had no issues at all with distributing geometry across GPCs. Distributed geometry never comes up in discussions about power. I don't think it should surprise anyone with a bit of a brain that SMs+TEX are where the power is.

Also: don't fret so much about crossbars in general. It's under control.

(Crawling back into my bear cave...)
 
It's not going to change the opinion of the gullible, but 3 years after the fact, it's probably time to settle this once and for all: the issue in GF100-A01 was in a back-end bus that fed the memory controllers. It was not even in the general xbar that interconnects the usual agents. There was a custom designed cell that with a timing violation that was not picked up during characterization.
Is that the main reason for the low memory clocks in Fermi?
 
The slide with the "adaptive VSync" appears a bit strange. What they describe is basically vsync'd triple buffering. Otherwise tearing would appear also below the refresh rate. :rolleyes:

From what I understand that's exactly how it works. Above 60fps you get vsync and below you get tearing, but much smoother transitions. This is how many console games work and it's long overdue IMO. Would be nice if the limit could be set manually to 30fps as well though. Although given the choice between the two, especially on a card of this power I'd take 60fps any day.

I've gotta say I'm pretty damn excited about Kepler after the recent leaks. Outperforming the 7970 with all the benefits of adaptive vsycn TXAA and PhysX support. And all that at lower power draw/heat and presumably lower noise. I won't be going for the top end (rip off) edition but a 670Ti should be a massive upgrade over my 4890.
 
Back
Top