Nvidia BigK GK110 Kepler Speculation Thread

Ailuros · Apr 23, 2012

CarstenS said:
Plus as Big-K special sauce:
- one dedicated physx processor per SMK
- a broken and unfixable design
*SCNR*

Don't be so mean

....do you want fries with that?

Ailuros · Apr 23, 2012

iMacmatician said:
Well there's this set of rumors/speculation from 3DCenter (translated) saying 3072 CCs, so that would use lots of transistors. According to that rumor, GK110 seems close to an overall doubled GK104 in terms of basic specs.

ALUs are in a relative sense fairly "cheap" in hw compared to other units like TMUs. Tahiti in comparison packs 2048SPs into barely 365mm2, but here you'd need to note that amongst a multitude of other factors it also has "only" 128TMUs/32 ROPs compared to GK110.

If the so far information should be accurate, GK110 might have a slightly higher transistor density than GK104.

Kaotik · Apr 23, 2012

CarstenS said:
3072 ALUs
-> 6x GPCs (à 512 SPs)
--> 4 SMK to each GPC, 128 ALUs/SMK
--> each SMK has
---> 4 groups of 32 ALUs
----> two groups share a quad TMU

First way I read it:
2 groups share a quad TMU > 1 SMK has 8 TMUs
4 SMK = 32 TMUs / GPC
6 GPC = 192 TMUs

Yeah, I don't see that happening, especially considering that IIRC they're next to useless on GPGPU front?

The second, probably wrong way, I read it, would give 48 TMUs which is surely even more off

Any way one looks at it, I don't see how they could, even in theory, fit "double GK104" with added GPGPU + FP64 capabilities to chip twice the size of GK104.

Ailuros · Apr 23, 2012

For one it's not twice the amount of units on all fronts (50% more raster/trisetups, 50% more TMUs etc.) and as a close second 7b transistors are almost twice as much as there are on GK104.

In fact there's nothing much that speaks against it considering a hypothetical 550mm2@28nm die; the only other case in point is that this time the desktop high end consumer has to pay for far more transistors than in the past which are HPC related and therefore not invested in 3D performance.

With Fermi/GF110 it was roughly 35% more transistors compared to GF114 where the performance difference between the two was give or take at 40%. If now GK110 is let's say 50% faster (which isn't absurd at all assuming those hypothetical specs are true especially considering the shitload of added bandwidth a 512bit bus grants even with relatively low GDDR5 frequencies) than GK104 but at the cost of almost twice the transistors, it's a totally different chapter and possibly also affecting power consumption.

Kaotik · Apr 23, 2012

GF114 wasn't anywhere close to as stripped from GPGPU capabilities as GK104 is. There's far more things GK110 needs to add over GK104 than 110 had over 114 just to for the GPGPU speed

Ailuros · Apr 23, 2012

Kaotik said:
GF114 wasn't anywhere close to as stripped from GPGPU capabilities as GK104 is.

What kind of miraculous HPC capabilities did the GF114 actually have that I've missed them? Yes it is true that GK104 is somewhat more conservative in that regard if one considers the entire enchilada, but unless I'm missing something I don't see any worlds of differences between the two.

Unless someone of course believes that nonsense that floated around during the Fermi era that GF114 is just a GF110 with a rectangular edge chopped off. In reality both GF1x4 and GK104 have 3 SIMD blocks (which already is a milestone for theoretical double precision FLOPs when you have a 2:1 SP/DP ratio for the high end) and obviously don't have that much in common with GF110 dual SIMDs; even worse GF1x4 have 2 quad TMUs/SM while GF1x0 only 1.

One good point would be caches and surrounding logic, but in that regard I'm all eyes for a detailed analysis how those affected transistor budgets between GF1x4 and GK104.

There's far more things GK110 needs to add over GK104 than 110 had over 114 just to for the GPGPU speed

Again at a much higher die area AND transistor count difference between performance (GK104) and high end (GK110). I don't know if my math is broken but I see a huge difference between 1.95/3.0b (365/530mm2) and 3.54/7.0b (294/550mm2) but it's obviously just me.

Kaotik · Apr 23, 2012

Disregarding the DP performance, there must be some "special sauce" GK104 is missing, since in so many cases it's getting beaten left and right by 580 on SP workloads too, despite matching or beating 580 on most if not all theoretical meters

Alexko · Apr 23, 2012

Kaotik said:
Disregarding the DP performance, there must be some "special sauce" GK104 is missing, since in so many cases it's getting beaten left and right by 580 on SP workloads too, despite matching or beating 580 on most if not all theoretical meters

Cache!

psurge · Apr 23, 2012

Does it make sense to scale up the number of polymorph units (which seems tied to the SMs), rasterizers (tied to the number of GPCs), or TMUs versus GK104?

What about the same 4 GPC, 128 TMU, 8 SMX setup, but with 256 ALUs per SMX - something like 4x (1 scheduler, 1 vec32 SP ALU, 1 vec32 DP ALU, 1 vec32 LD/ST unit)?

tunafish · Apr 24, 2012

psurge said:
Does it make sense to scale up the number of polymorph units (which seems tied to the SMs), rasterizers (tied to the number of GPCs), or TMUs versus GK104?

The non-GPGPU professional segment really likes more polygons. The typical professional GPU app uses very simple shading, but extremely complex geometry. So the distributed geometry might be there exactly because they want to scale it way up in the high-end chip.

What about the same 4 GPC, 128 TMU, 8 SMX setup, but with 256 ALUs per SMX - something like 4x (1 scheduler, 1 vec32 SP ALU, 1 vec32 DP ALU, 1 vec32 LD/ST unit)?

And what about the cache? For a lot of the HPC folks the amount of cache/execution unit is a metric that is more important than pretty much anything else. GK104 is already extremely cache-starved for compute apps, and you want to *increase* the amount of execution units per SMX? Scaling up the cache per SMX would be hard, especially given how wide the ports need to be so serve that many threads.

denev2004 · Apr 24, 2012

Alexko said:
Cache!

Probably inner bandwidth is also a factor I guess?

Alexko · Apr 24, 2012

denev2004 said:
Probably inner bandwidth is also a factor I guess?

Yes, but cache size and internal/cache bandwidth are usually correlated.

Static scheduling probably hurts a bit too, but its impact should be nowhere near what we're seeing.

fellix · Apr 24, 2012

L2 cache? GK104 is definitely not BW starved in there.

Alexko · Apr 24, 2012

fellix said:
L2 cache? GK104 is definitely not BW starved in there.

The entire memory hierarchy is quite tight on GK104, but my guess is that L1 and registers are the main culprits.

fellix · Apr 24, 2012

Alexko said:
...but my guess is that L1 and registers are the main culprits.

I doubt -- since the RF is doubled, the L1 size should be less of a problem. Sharing data will be a tight job, for those kernels that rely more on the LDS, though.

boxleitnerb · Apr 24, 2012

Question:
GK110 will have many transistors that are not needed for gaming. Would it be possible to implement a finer grained powergating "grid" that could shut off many/most of theses transistors on GeForce chips?

Alexko · Apr 24, 2012

fellix said:
I doubt -- since the RF is doubled, the L1 size should be less of a problem. Sharing data will be a tight job, for those kernels that rely more on the LDS, though.

The RF is doubled, but the shader count is quadrupled compared to GF104, or sextupled compared to GF100.

Of course it's not that bad, because the maximum number of threads only increases by 33%, but really, it only increases by 33% precisely because the memory hierarchy couldn't handle any more than that.

So in my opinion, that's the real bottleneck.

CarstenS · Apr 24, 2012

psurge said:
Does it make sense to scale up the number of polymorph units (which seems tied to the SMs), rasterizers (tied to the number of GPCs), or TMUs versus GK104?

[my bold]
Absolutely, since they're absolutely free in terms of transistor count being only names for marketing slides.

For other things tied to the SMKs: Yes, I think it's time we realized that we need to keep utilization for every given workload as high as possible, since we're power limited now and cannot afford a single transistor sitting around idling and at the same time not contributing to throughput.

Ailuros · Apr 24, 2012

Kaotik said:
Disregarding the DP performance, there must be some "special sauce" GK104 is missing, since in so many cases it's getting beaten left and right by 580 on SP workloads too, despite matching or beating 580 on most if not all theoretical meters

As Alexko already noted amongst others there's cache missing compared to GF110; but in that regard GK104 doesn't have any significant differences compared to GF1x4. GK104 has the downside though that it packs a crapload of more SPs within each cluster compared to GF104 (192 vs. 48) if you come to think of cache amounts.

ninelven · Apr 24, 2012

CarstenS said:
Absolutely, since they're absolutely free in terms of transistor count being only names for marketing slides.

That isn't true.

Nvidia BigK GK110 Kepler Speculation Thread

Ailuros

Epsilon plus three

Ailuros

Epsilon plus three

Kaotik

Drunk Member

Ailuros

Epsilon plus three

Kaotik

Drunk Member

Ailuros

Epsilon plus three

Kaotik

Drunk Member

Alexko

psurge

tunafish

denev2004

Alexko

fellix

Alexko

fellix

boxleitnerb

Alexko

CarstenS

Moderator

Ailuros

Epsilon plus three

ninelven

PM

Similar threads