Nvidia BigK GK110 Kepler Speculation Thread

Well there's this set of rumors/speculation from 3DCenter (translated) saying 3072 CCs, so that would use lots of transistors. According to that rumor, GK110 seems close to an overall doubled GK104 in terms of basic specs.

ALUs are in a relative sense fairly "cheap" in hw compared to other units like TMUs. Tahiti in comparison packs 2048SPs into barely 365mm2, but here you'd need to note that amongst a multitude of other factors it also has "only" 128TMUs/32 ROPs compared to GK110.

If the so far information should be accurate, GK110 might have a slightly higher transistor density than GK104.
 
3072 ALUs
-> 6x GPCs (à 512 SPs)
--> 4 SMK to each GPC, 128 ALUs/SMK
--> each SMK has
---> 4 groups of 32 ALUs
----> two groups share a quad TMU
First way I read it:
2 groups share a quad TMU > 1 SMK has 8 TMUs
4 SMK = 32 TMUs / GPC
6 GPC = 192 TMUs

Yeah, I don't see that happening, especially considering that IIRC they're next to useless on GPGPU front?

The second, probably wrong way, I read it, would give 48 TMUs which is surely even more off :LOL:

Any way one looks at it, I don't see how they could, even in theory, fit "double GK104" with added GPGPU + FP64 capabilities to chip twice the size of GK104.
 
For one it's not twice the amount of units on all fronts (50% more raster/trisetups, 50% more TMUs etc.) and as a close second 7b transistors are almost twice as much as there are on GK104.

In fact there's nothing much that speaks against it considering a hypothetical 550mm2@28nm die; the only other case in point is that this time the desktop high end consumer has to pay for far more transistors than in the past which are HPC related and therefore not invested in 3D performance.

With Fermi/GF110 it was roughly 35% more transistors compared to GF114 where the performance difference between the two was give or take at 40%. If now GK110 is let's say 50% faster (which isn't absurd at all assuming those hypothetical specs are true especially considering the shitload of added bandwidth a 512bit bus grants even with relatively low GDDR5 frequencies) than GK104 but at the cost of almost twice the transistors, it's a totally different chapter and possibly also affecting power consumption.
 
GF114 wasn't anywhere close to as stripped from GPGPU capabilities as GK104 is. There's far more things GK110 needs to add over GK104 than 110 had over 114 just to for the GPGPU speed
 
GF114 wasn't anywhere close to as stripped from GPGPU capabilities as GK104 is.

What kind of miraculous HPC capabilities did the GF114 actually have that I've missed them? Yes it is true that GK104 is somewhat more conservative in that regard if one considers the entire enchilada, but unless I'm missing something I don't see any worlds of differences between the two.

Unless someone of course believes that nonsense that floated around during the Fermi era that GF114 is just a GF110 with a rectangular edge chopped off. In reality both GF1x4 and GK104 have 3 SIMD blocks (which already is a milestone for theoretical double precision FLOPs when you have a 2:1 SP/DP ratio for the high end) and obviously don't have that much in common with GF110 dual SIMDs; even worse GF1x4 have 2 quad TMUs/SM while GF1x0 only 1.

One good point would be caches and surrounding logic, but in that regard I'm all eyes for a detailed analysis how those affected transistor budgets between GF1x4 and GK104.

There's far more things GK110 needs to add over GK104 than 110 had over 114 just to for the GPGPU speed
Again at a much higher die area AND transistor count difference between performance (GK104) and high end (GK110). I don't know if my math is broken but I see a huge difference between 1.95/3.0b (365/530mm2) and 3.54/7.0b (294/550mm2) but it's obviously just me.
 
Disregarding the DP performance, there must be some "special sauce" GK104 is missing, since in so many cases it's getting beaten left and right by 580 on SP workloads too, despite matching or beating 580 on most if not all theoretical meters
 
Disregarding the DP performance, there must be some "special sauce" GK104 is missing, since in so many cases it's getting beaten left and right by 580 on SP workloads too, despite matching or beating 580 on most if not all theoretical meters

Cache!
 
Does it make sense to scale up the number of polymorph units (which seems tied to the SMs), rasterizers (tied to the number of GPCs), or TMUs versus GK104?

What about the same 4 GPC, 128 TMU, 8 SMX setup, but with 256 ALUs per SMX - something like 4x (1 scheduler, 1 vec32 SP ALU, 1 vec32 DP ALU, 1 vec32 LD/ST unit)?
 
Does it make sense to scale up the number of polymorph units (which seems tied to the SMs), rasterizers (tied to the number of GPCs), or TMUs versus GK104?
The non-GPGPU professional segment really likes more polygons. The typical professional GPU app uses very simple shading, but extremely complex geometry. So the distributed geometry might be there exactly because they want to scale it way up in the high-end chip.

What about the same 4 GPC, 128 TMU, 8 SMX setup, but with 256 ALUs per SMX - something like 4x (1 scheduler, 1 vec32 SP ALU, 1 vec32 DP ALU, 1 vec32 LD/ST unit)?

And what about the cache? For a lot of the HPC folks the amount of cache/execution unit is a metric that is more important than pretty much anything else. GK104 is already extremely cache-starved for compute apps, and you want to *increase* the amount of execution units per SMX? Scaling up the cache per SMX would be hard, especially given how wide the ports need to be so serve that many threads.
 
...but my guess is that L1 and registers are the main culprits.
I doubt -- since the RF is doubled, the L1 size should be less of a problem. Sharing data will be a tight job, for those kernels that rely more on the LDS, though.
 
Question:
GK110 will have many transistors that are not needed for gaming. Would it be possible to implement a finer grained powergating "grid" that could shut off many/most of theses transistors on GeForce chips?
 
I doubt -- since the RF is doubled, the L1 size should be less of a problem. Sharing data will be a tight job, for those kernels that rely more on the LDS, though.

The RF is doubled, but the shader count is quadrupled compared to GF104, or sextupled compared to GF100.

Of course it's not that bad, because the maximum number of threads only increases by 33%, but really, it only increases by 33% precisely because the memory hierarchy couldn't handle any more than that.

So in my opinion, that's the real bottleneck.
 
Does it make sense to scale up the number of polymorph units (which seems tied to the SMs), rasterizers (tied to the number of GPCs), or TMUs versus GK104?
[my bold]
Absolutely, since they're absolutely free in terms of transistor count being only names for marketing slides. ;)

For other things tied to the SMKs: Yes, I think it's time we realized that we need to keep utilization for every given workload as high as possible, since we're power limited now and cannot afford a single transistor sitting around idling and at the same time not contributing to throughput.
 
Disregarding the DP performance, there must be some "special sauce" GK104 is missing, since in so many cases it's getting beaten left and right by 580 on SP workloads too, despite matching or beating 580 on most if not all theoretical meters

As Alexko already noted amongst others there's cache missing compared to GF110; but in that regard GK104 doesn't have any significant differences compared to GF1x4. GK104 has the downside though that it packs a crapload of more SPs within each cluster compared to GF104 (192 vs. 48) if you come to think of cache amounts.
 
Back
Top