Nvidia BigK GK110 Kepler Speculation Thread

The Big Kepler adds 3.5 billion transistors to the GK104.

Some of the improvements are:

reorganized processing cores with new instructions
an improved memory system with faster atomic processing and low-overhead ECC

So what are the additional changes that Nvidia has added to the Big Kepler that could use lots of transistors?

------------------------


S0642 - Inside Kepler

Stephen Jones (NVIDIA) , Lars Nyland (NVIDIA)

In this talk, individuals from the GPU architecture and CUDA software groups will dive into the features of the compute architecture for “Kepler” – NVIDIA’s new 7-billion transistor GPU :oops:. From the reorganized processing cores with new instructions and processing capabilities, to an improved memory system with faster atomic processing and low-overhead ECC, we will explore how the Kepler GPU achieves world leading performance and efficiency, and how it enables wholly new types of parallel problems to be solved.

Topic Areas: Parallel Programming Languages & Compilers
Level: Beginner

Day: Wednesday, 05/16
Time: 2:00 pm - 3:20 pm
Source: https://registration.gputechconf.com/?form=schedule
Change Drop Down date to Wednesday 5/16
 
Well there's this set of rumors/speculation from 3DCenter (translated) saying 3072 CCs, so that would use lots of transistors. According to that rumor, GK110 seems close to an overall doubled GK104 in terms of basic specs.
 
Big K isn't going to just be a doubled GK104. For nVidia, the HPC/workstation segment is bigger than the high-end gpu one. So Big K will likely emphasize 64-bit throughput, with a healthy helping of caches.
 
I was thinking along the lines of CC count and bus width, but yeah you're right.

But if the 3072 CC stuff is true, then I'm interested to know how they could squeeze that many CCs into GK110, especially considering the additional compute features would presumably make the die bigger for the same CC count.
 
But if the 3072 CC stuff is true, then I'm interested to know how they could squeeze that many CCs into GK110, especially considering the additional compute features would presumably make the die bigger for the same CC count.
The number of GPCs, SMX and TMUs are probably not scaling in the same way compared to GK104.

More interesting questions are power consumption and possibilities of partly deactivated units on top-SKU.
 
Last edited by a moderator:
How big will its die be? :???:
If they keep the same transistor density of ~12.04 MTr/mm2, then this 7000 M transisotors beast will need around 580 mm2. :oops:
 
Its cause the 7 Billions of transistors ( 7000M ), are not confirmed yet ...

I really doubt Nvidia and their experience of 28nm will end with a 550mm2 chips.. In reality dont forget we are absolutely not speaking about Kepler. We are speaking about a card who could see the daylight in 5-6 month.
 
merge this pointless thread back with the main one
You mean make this thread disappear in the useless noise of Bitcoin mining, how much tax the EU adds vs the USA, Physics jobs in Germany vs USA, etc , etc, etc.

If anything the other thread is the bloated pointless thread especially in relation to the Tesla line.

Having a thread specifically on the BigK GK110 Tesla/HPC GPU without the above mentioned useless posts is useful.

I expect that the GK110 will be fully dedicated to the professional market and would like to see what others expect the additional 3.5 billion transistors have added over the GK104 GPU.

And if you really like the other thread so much you can stay and post on that one and ignore this one.

Back to the speculation on what is added to make up the +3.5 billion transistors here are the guesses so far:

3072 CCs
64-bit throughput
healthy helping of caches
512-bit memory bus
 
3072 ALUs
-> 6x GPCs (à 512 SPs)
--> 4 SMK to each GPC, 128 ALUs/SMK
--> each SMK has
---> 4 groups of 32 ALUs
----> two of which are 64 Bit capable, re-using data-paths from the other ALUs
----> two groups share a quad TMU
----> 4x 32 kiB L1-Cache shared among the ALU blocks, configurable as scratchpad memory in block sizes of 32 kiB.

512 Bit MI
-> 8x 64-Bit memory partitions
-> 4 GiB default memory size for gaming cards, twice for Tesla, Quadro
-> (probably) 2048, rather still 1024 kiB L2-Cache

850 MHz core clock plus advanced turbo (independently clockable GPCs?) and probably 1.40ish MHz GDDR5 speed not pushing the envelope here as much.

Making close to 7 bln transistors and 550 mm² die size as agreed upon here.
Hm?

Plus as Big-K special sauce:
- one dedicated physx processor per SMK
- a broken and unfixable design
*SCNR*
 
I thought physx was adapted to run on standard shaders, hence there is no dedicated physx unit (unless theyve put the ageia stuff onchip)
scnr ???
 
I thought physx was adapted to run on standard shaders, hence there is no dedicated physx unit (unless theyve put the ageia stuff onchip)
scnr ???

Sorry, Could Not Resist.

In other words, the PhysX part was a joke. ;)
 
3072 ALUs
-> 6x GPCs (à 512 SPs)
--> 4 SMK to each GPC, 128 ALUs/SMK
--> each SMK has
---> 4 groups of 32 ALUs
----> two of which are 64 Bit capable, re-using data-paths from the other ALUs
----> two groups share a quad TMU
----> 4x 32 kiB L1-Cache shared among the ALU blocks, configurable as scratchpad memory in block sizes of 32 kiB.
Seems reasonable to me. I don't think you're describing the intra-SMX 'groups' correctly (a group of two schedulers share three 32-wide ALUs plus other units in GK104, and there are two such groups per SMX sharing L1/Shared Memory plus a few other things) and who knows how that'll evolve (see GF100 vs GF104) but the final numbers make a fair bit of sense.

advanced turbo (independently clockable GPCs?)
Extremely unlikely, that makes absolutely no sense for a GPU. Independent clocking makes sense on CPUs because single threaded performance is key. That should never matter on GPUs - although in practice it might because of static tile allocation to specific GPCs/SMXs. Fermi was certainly static, I think GK104 is as well, but they haven't really talked about it. They really should just switch to dynamic tile allocation ala SGX! ;)
 
Seems reasonable to me. I don't think you're describing the intra-SMX 'groups' correctly (a group of two schedulers share three 32-wide ALUs plus other units in GK104, and there are two such groups per SMX sharing L1/Shared Memory plus a few other things) and who knows how that'll evolve (see GF100 vs GF104) but the final numbers make a fair bit of sense.

I know that GK104 is organized differently, but I think it is possible that Nvidia did not follow the same route for their GPU-Compute optimized chip.

WRT to advanced GPU-Boost: Depending on how high you could go when enough GPCs idle I think this could make a difference for serial performance. In other words, depending on how power limited Big-K will turn out to be, the higher your possible gains for compiler-identifyable latency-dominated tasks.
 
An independent clock for all GPCs or for each GPC?

With ~850MHz base clock, GK110 could offer a much higher Boost, in cases when the performance is limited by the GPCs.
On the other hand NV could use this and present a < 3072SPs GeForce version, with ~1GHz clock, since gaming performance favors a faster front-end.
 
What I meant was a common clock throughout each GPC, but individually adjustable, possibly based on available power and maybe even on thread priority or type.

In any case, Nvidia would need to cut down on something if they are going to stay within 300 watts power budget.
 
An independent clock for all GPCs or for each GPC?

With ~850MHz base clock, GK110 could offer a much higher Boost, in cases when the performance is limited by the GPCs.
On the other hand NV could use this and present a < 3072SPs GeForce version, with ~1GHz clock, since gaming performance favors a faster front-end.

I don't think Nvidia are that concerned with gaming performance for big Keplar, hence that last would be doubtful if it impacts compute performance.

Likewise, the same could be applied to what CarstenS is suggesting with individually clocked GPCs. Don't compute oriented workloads generally push all compute units relatively uniformly? Hence even the current turbo on GK104 might be determined to be not needed and hence a waste of transistors.

IMO, for big Keplar, compute performance will matter most, with gaming performance being secondary. Unlike GK104 where game performance was king and computer performance secondary.

Regards,
SB
 
Back
Top