Nvidia BigK GK110 Kepler Speculation Thread

Base/Boost clock of 889/980 vs 875/928 of the 780Ti so we seem to have a new king of the hill if only by a marginal amount.

Most reference 780TI, have a boost clock way higher of that . Nvidia number for the boost clock are the "warranty ", minimum boost clock, in reality the clock speed is way higher. ( and i write reference, not AIB version from Evga etc ) .

What i find a bit surprising is this silent launch .. Guru and some other site have said there will not been a review for the Titan Black ( and it seems nobody in review site will do a review of it ).... basically Nvidia dont want any review are done for this Titan Black. Lets hope some serious site will do it anyway. ( computing sites or "consumer" hardware site )
 
So, it's a 780Ti with 6gb of ram and full speed DP. Sure it's a badass part, but just like the original Titan, there are incredibly few folks who would need all of that raw throughput.

Still, as a halo card, it certainly fits the bill.
 
So, it's a 780Ti with 6gb of ram and full speed DP. Sure it's a badass part, but just like the original Titan, there are incredibly few folks who would need all of that raw throughput.

Still, as a halo card, it certainly fits the bill.

Yeah, we were waiting for this, finally enough ram outside Quadro cards
 
Decent for the madmen who want to buy a card and game for 5 years with it, too.
Hell, five years ago was 2009 :oops:. The cards of back then are still usable, but lack some RAM (and drivers for the red one..)
 
Decent for the madmen who want to buy a card and game for 5 years with it, too.
Hell, five years ago was 2009 :oops:. The cards of back then are still usable, but lack some RAM (and drivers for the red one..)
Madmen indeed, an Nvidia employee said they need 1Tb+ bandwidth @ 16nm, likely to be achieved with HBM.
16nm GPUs are likely less than 2 years away, buying an expensive card now with intentions of keeping it for 4+ years is crazy, due to the large jumps in performance we will see over the next 2 years alone.
 
It actually depends more on the software, or the games, if there is nothing meaningful to be released meanwhile, then even 10x performance jump would be kind of meaningless ;)
 
True, although affordable, high-quality 4k-rez displays are almost around the corner already (not to mention 3D, eyefinity, 120Hz vblank and whatnot), thus you need quite a bit of horsepower running even current games at high settings and antialiasing options at such huge resolutions and high refresh rates.
 
High refresh rate improves your experience regardless of the GPU's performance. 60Hz is a bit ridiculous really, often a lag-fest or a tear-fest. Even if you're stuck around 40 fps or similar the 120Hz display will look a lot better (and maybe you can have a glimpse at 100+fps when you're looking at your feet or in a tight corridor :))

The weak point of high refresh rate monitors is they're 1080p and TN panels only (with one exception, the Korean 2560x1440 IPS panels are said to unofficially support 120Hz :oops:. Some 1080p IPS panels support 72Hz)
 
It actually depends more on the software, or the games, if there is nothing meaningful to be released meanwhile, then even 10x performance jump would be kind of meaningless ;)

This is a great card for all those applications that use the gpu for anything but games. There was a tendency that you would need Quadro cards in order to get enough ram. This card is a cheap way to get enough ram and Quadro style performance in compute and beyond.
I am glad we waited a few months..
 
Nvidia didn't enable the Quadro features though that speed up certain things with Quadro cards, just only uncapped the 64bit compute performance in TITAN/TITAN BLACK. Don't expect to get Quadro performance at a discount.
 
Yes Quadro suck for games and even in random productivity apps a Quadro might be slower and/or worse than a Geforce. On top of that Quadro have less units, frequency, TDP, or have GK104 which is slower than GK110 in computing (including single precision).

So some people actively seek a Geforce even for non game use.
Quadro is more for specific CAD software that cost like a small flat or house (esp. those that date back to the 80s or 90s ; some new CAD software may target gaming cards, even run on Direct3D) and some specific uses (quad buffer stereo, 10bit per color maybe, streaming/thin client stuff - same tech as streaming from Kepler to Tegra or Steam OS actually but not necessarily single user and single app)
 
Nvidia didn't enable the Quadro features though that speed up certain things with Quadro cards, just only uncapped the 64bit compute performance in TITAN/TITAN BLACK. Don't expect to get Quadro performance at a discount.

If you want to do GPGPU the only actual thing missing is support for ECC, and that lowers performance even (but makes you feel better about your week-long or month-long number crunching's correctness)
 
Tridan created some fascinating alternative SMX/SMM diagrams at hardware.fr in the GTX 750ti review.
I'm really happy with their detail, since it helps me understand a lot about actual throughput and limitations. They cleared up a giant misconception I had about Kepler's SMX design.. I had always envisioned all four schedulers feeding through a massive crossbar, with any one of the 8 dispatchers able to multiplex its registers into any of the lanes of SPs or SFUs or LD/ST. Tridan's diagram shows that it's much simpler and more limited than that with each scheduler "owning" a set of SPs, with a set of extra SPs shared between each of two pairs of schedulers. (Tridan, again THANKS for the diagram! I keep studying it!)


If you look at the GK104 SMX layout, I have some questions I can't resolve though.

I am guessing that each of the 4 register files can issue three registers (3 columns of 32, actually) per clock. Those get distributed to the SPs as needed, with the operand collector handling routing, including a buffer to allow accumulating registers over 2 clocks. So a kernel running all FP32 adds would have four schedulers, each continually issuing two registers to its "own" set of SPs. The extra register would be accumulated by the operand collector, and every other clock it could be sent over to the "shared" set of SPs. The partner scheduler would do the same on the alternate clocks, so it all elegantly works in keeping every SP fed every clock. This would give a throughput of 192 FP32 ops per clock.

NVidia claims that FP32 FMADs also have a throughput of 192, but nobody has ever been able to craft code to actually perform this well. The explanation is simple.. with only 3 registers per clock per scheduler, there's not enough bandwidth to feed THREE arguments into every SP per clock, only two. So an FP32 add or FP32 mul has the throughput of 192, but an FP32 FMAD has a throughput of 128. Tridan's diagram makes this all clear why.

We can see NVidia's official table of operation throughputs in the programming guide. Looking at the 3.0 architecture column, we see many throughputs have a logical explanation from the SMX figure. sin/cos/log have a throughput of 32, because there are 32 SFU units, for example.
But the real puzzle from these throughput values is why many operations like integer adds have a throughput of 160 operations/clock. This confuses me so much! What would cause this design to have a throughput of 160, as opposed to 192 or perhaps 128? It's not register bandwidth (you can do fp32 adds at 192). It's not that one set of 32 SPs are "special" and limited.. because every scheduler only connects to two sets of SPs. If one set was special, you'd get a throughput of 128, not 160, since there are two pairs of schedulers so you'd lose two sets.

So my rambling question is what architectural limit would give this weird 160 ops/clock throughput, not 192 or 128? Or maybe the obvious answer is that NVidia's table is wrong?
 
The hardware.info SM diagrams are indeed great. The one thing missing, however, are data bus width indicators. This would allow one to derive the bandwidth of various caches etc. Subtle enough hint?
 
Back
Top