Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 21-Apr-2012, 01:14   #1
A1xLLcqAgt0qc2RyMz0y
Member
 
Join Date: Feb 2010
Posts: 558
Default Nvidia BigK GK110 Kepler Speculation Thread

The Big Kepler adds 3.5 billion transistors to the GK104.

Some of the improvements are:

reorganized processing cores with new instructions
an improved memory system with faster atomic processing and low-overhead ECC

So what are the additional changes that Nvidia has added to the Big Kepler that could use lots of transistors?

------------------------


Quote:
S0642 - Inside Kepler

Stephen Jones (NVIDIA) , Lars Nyland (NVIDIA)

In this talk, individuals from the GPU architecture and CUDA software groups will dive into the features of the compute architecture for “Kepler” – NVIDIA’s new 7-billion transistor GPU . From the reorganized processing cores with new instructions and processing capabilities, to an improved memory system with faster atomic processing and low-overhead ECC, we will explore how the Kepler GPU achieves world leading performance and efficiency, and how it enables wholly new types of parallel problems to be solved.

Topic Areas: Parallel Programming Languages & Compilers
Level: Beginner

Day: Wednesday, 05/16
Time: 2:00 pm - 3:20 pm
Source: https://registration.gputechconf.com/?form=schedule
Change Drop Down date to Wednesday 5/16
A1xLLcqAgt0qc2RyMz0y is offline   Reply With Quote
Old 21-Apr-2012, 02:17   #2
iMacmatician
Member
 
Join Date: Jul 2010
Location: United States of America
Posts: 452
Default

Well there's this set of rumors/speculation from 3DCenter (translated) saying 3072 CCs, so that would use lots of transistors. According to that rumor, GK110 seems close to an overall doubled GK104 in terms of basic specs.
iMacmatician is offline   Reply With Quote
Old 21-Apr-2012, 03:04   #3
tunafish
Member
 
Join Date: Aug 2011
Posts: 408
Default

Big K isn't going to just be a doubled GK104. For nVidia, the HPC/workstation segment is bigger than the high-end gpu one. So Big K will likely emphasize 64-bit throughput, with a healthy helping of caches.
tunafish is offline   Reply With Quote
Old 21-Apr-2012, 03:48   #4
iMacmatician
Member
 
Join Date: Jul 2010
Location: United States of America
Posts: 452
Default

I was thinking along the lines of CC count and bus width, but yeah you're right.

But if the 3072 CC stuff is true, then I'm interested to know how they could squeeze that many CCs into GK110, especially considering the additional compute features would presumably make the die bigger for the same CC count.
iMacmatician is offline   Reply With Quote
Old 21-Apr-2012, 05:49   #5
DavidGraham
Senior Member
 
Join Date: Dec 2009
Posts: 1,059
Default

512-bit memory bus?
DavidGraham is offline   Reply With Quote
Old 21-Apr-2012, 08:15   #6
AnarchX
Senior Member
 
Join Date: Apr 2007
Posts: 1,505
Default

Quote:
Originally Posted by iMacmatician View Post
But if the 3072 CC stuff is true, then I'm interested to know how they could squeeze that many CCs into GK110, especially considering the additional compute features would presumably make the die bigger for the same CC count.
The number of GPCs, SMX and TMUs are probably not scaling in the same way compared to GK104.

More interesting questions are power consumption and possibilities of partly deactivated units on top-SKU.

Last edited by AnarchX; 21-Apr-2012 at 08:37.
AnarchX is offline   Reply With Quote
Old 21-Apr-2012, 09:26   #7
UniversalTruth
Former Member
 
Join Date: Sep 2010
Posts: 1,529
Default

How big will its die be?
If they keep the same transistor density of ~12.04 MTr/mm2, then this 7000 M transisotors beast will need around 580 mm2.
UniversalTruth is offline   Reply With Quote
Old 21-Apr-2012, 10:07   #8
lanek
Senior Member
 
Join Date: Mar 2012
Location: Switzerland
Posts: 1,182
Default

Its cause the 7 Billions of transistors ( 7000M ), are not confirmed yet ...

I really doubt Nvidia and their experience of 28nm will end with a 550mm2 chips.. In reality dont forget we are absolutely not speaking about Kepler. We are speaking about a card who could see the daylight in 5-6 month.
lanek is offline   Reply With Quote
Old 21-Apr-2012, 13:10   #9
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,952
Send a message via Skype™ to Jawed
Default

merge this pointless thread back with the main one
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 21-Apr-2012, 16:12   #10
A1xLLcqAgt0qc2RyMz0y
Member
 
Join Date: Feb 2010
Posts: 558
Default

Quote:
Originally Posted by Jawed View Post
merge this pointless thread back with the main one
You mean make this thread disappear in the useless noise of Bitcoin mining, how much tax the EU adds vs the USA, Physics jobs in Germany vs USA, etc , etc, etc.

If anything the other thread is the bloated pointless thread especially in relation to the Tesla line.

Having a thread specifically on the BigK GK110 Tesla/HPC GPU without the above mentioned useless posts is useful.

I expect that the GK110 will be fully dedicated to the professional market and would like to see what others expect the additional 3.5 billion transistors have added over the GK104 GPU.

And if you really like the other thread so much you can stay and post on that one and ignore this one.

Back to the speculation on what is added to make up the +3.5 billion transistors here are the guesses so far:

3072 CCs
64-bit throughput
healthy helping of caches
512-bit memory bus
A1xLLcqAgt0qc2RyMz0y is offline   Reply With Quote
Old 21-Apr-2012, 18:50   #11
AlphaWolf
Specious Misanthrope
 
Join Date: May 2003
Location: Treading Water
Posts: 8,123
Default

So should we expect gk110 to be a lot better at bitcoin mining per transistor?
AlphaWolf is offline   Reply With Quote
Old 21-Apr-2012, 23:43   #12
jaredpace
Member
 
Join Date: Sep 2009
Posts: 157
Default

Quote:
Originally Posted by Jawed View Post
merge this pointless thread back with the main one
jaredpace is offline   Reply With Quote
Old 22-Apr-2012, 11:57   #13
CarstenS
Senior Member
 
Join Date: May 2002
Location: Germany
Posts: 2,973
Send a message via ICQ to CarstenS
Default

3072 ALUs
-> 6x GPCs (à 512 SPs)
--> 4 SMK to each GPC, 128 ALUs/SMK
--> each SMK has
---> 4 groups of 32 ALUs
----> two of which are 64 Bit capable, re-using data-paths from the other ALUs
----> two groups share a quad TMU
----> 4x 32 kiB L1-Cache shared among the ALU blocks, configurable as scratchpad memory in block sizes of 32 kiB.

512 Bit MI
-> 8x 64-Bit memory partitions
-> 4 GiB default memory size for gaming cards, twice for Tesla, Quadro
-> (probably) 2048, rather still 1024 kiB L2-Cache

850 MHz core clock plus advanced turbo (independently clockable GPCs?) and probably 1.40ish MHz GDDR5 speed not pushing the envelope here as much.

Making close to 7 bln transistors and 550 mm² die size as agreed upon here.
Hm?

Plus as Big-K special sauce:
- one dedicated physx processor per SMK
- a broken and unfixable design
*SCNR*
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts.
Work| Recreation
Warning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration!
CarstenS is online now   Reply With Quote
Old 22-Apr-2012, 12:04   #14
Davros
Senior Member
 
Join Date: Jun 2004
Posts: 11,075
Default

I thought physx was adapted to run on standard shaders, hence there is no dedicated physx unit (unless theyve put the ageia stuff onchip)
scnr ???
__________________
Guardian of the Bodacious Three Terabytes of Gaming Goodness™
Davros is offline   Reply With Quote
Old 22-Apr-2012, 12:24   #15
Alexko
Senior Member
 
Join Date: Aug 2009
Posts: 2,890
Send a message via MSN to Alexko
Default

Quote:
Originally Posted by Davros View Post
I thought physx was adapted to run on standard shaders, hence there is no dedicated physx unit (unless theyve put the ageia stuff onchip)
scnr ???
Sorry, Could Not Resist.

In other words, the PhysX part was a joke.
__________________
"Well, you mentioned Disneyland, I thought of this porn site, and then bam! A blue Hulk." —The Creature
My (currently dormant) blog: Teχlog
Alexko is offline   Reply With Quote
Old 22-Apr-2012, 14:14   #16
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,923
Default

Quote:
Originally Posted by CarstenS View Post
3072 ALUs
-> 6x GPCs (à 512 SPs)
--> 4 SMK to each GPC, 128 ALUs/SMK
--> each SMK has
---> 4 groups of 32 ALUs
----> two of which are 64 Bit capable, re-using data-paths from the other ALUs
----> two groups share a quad TMU
----> 4x 32 kiB L1-Cache shared among the ALU blocks, configurable as scratchpad memory in block sizes of 32 kiB.
Seems reasonable to me. I don't think you're describing the intra-SMX 'groups' correctly (a group of two schedulers share three 32-wide ALUs plus other units in GK104, and there are two such groups per SMX sharing L1/Shared Memory plus a few other things) and who knows how that'll evolve (see GF100 vs GF104) but the final numbers make a fair bit of sense.

Quote:
advanced turbo (independently clockable GPCs?)
Extremely unlikely, that makes absolutely no sense for a GPU. Independent clocking makes sense on CPUs because single threaded performance is key. That should never matter on GPUs - although in practice it might because of static tile allocation to specific GPCs/SMXs. Fermi was certainly static, I think GK104 is as well, but they haven't really talked about it. They really should just switch to dynamic tile allocation ala SGX!
__________________
"[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions."
Arun is offline   Reply With Quote
Old 22-Apr-2012, 15:13   #17
CarstenS
Senior Member
 
Join Date: May 2002
Location: Germany
Posts: 2,973
Send a message via ICQ to CarstenS
Default

Quote:
Originally Posted by Arun View Post
Seems reasonable to me. I don't think you're describing the intra-SMX 'groups' correctly (a group of two schedulers share three 32-wide ALUs plus other units in GK104, and there are two such groups per SMX sharing L1/Shared Memory plus a few other things) and who knows how that'll evolve (see GF100 vs GF104) but the final numbers make a fair bit of sense.
I know that GK104 is organized differently, but I think it is possible that Nvidia did not follow the same route for their GPU-Compute optimized chip.

WRT to advanced GPU-Boost: Depending on how high you could go when enough GPCs idle I think this could make a difference for serial performance. In other words, depending on how power limited Big-K will turn out to be, the higher your possible gains for compiler-identifyable latency-dominated tasks.
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts.
Work| Recreation
Warning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration!
CarstenS is online now   Reply With Quote
Old 22-Apr-2012, 15:51   #18
AnarchX
Senior Member
 
Join Date: Apr 2007
Posts: 1,505
Default

An independent clock for all GPCs or for each GPC?

With ~850MHz base clock, GK110 could offer a much higher Boost, in cases when the performance is limited by the GPCs.
On the other hand NV could use this and present a < 3072SPs GeForce version, with ~1GHz clock, since gaming performance favors a faster front-end.
AnarchX is offline   Reply With Quote
Old 22-Apr-2012, 17:38   #19
CarstenS
Senior Member
 
Join Date: May 2002
Location: Germany
Posts: 2,973
Send a message via ICQ to CarstenS
Default

What I meant was a common clock throughout each GPC, but individually adjustable, possibly based on available power and maybe even on thread priority or type.

In any case, Nvidia would need to cut down on something if they are going to stay within 300 watts power budget.
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts.
Work| Recreation
Warning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration!
CarstenS is online now   Reply With Quote
Old 22-Apr-2012, 20:51   #20
Silent_Buddha
Regular
 
Join Date: Mar 2007
Posts: 10,437
Default

Quote:
Originally Posted by AnarchX View Post
An independent clock for all GPCs or for each GPC?

With ~850MHz base clock, GK110 could offer a much higher Boost, in cases when the performance is limited by the GPCs.
On the other hand NV could use this and present a < 3072SPs GeForce version, with ~1GHz clock, since gaming performance favors a faster front-end.
I don't think Nvidia are that concerned with gaming performance for big Keplar, hence that last would be doubtful if it impacts compute performance.

Likewise, the same could be applied to what CarstenS is suggesting with individually clocked GPCs. Don't compute oriented workloads generally push all compute units relatively uniformly? Hence even the current turbo on GK104 might be determined to be not needed and hence a waste of transistors.

IMO, for big Keplar, compute performance will matter most, with gaming performance being secondary. Unlike GK104 where game performance was king and computer performance secondary.

Regards,
SB
Silent_Buddha is offline   Reply With Quote
Old 23-Apr-2012, 02:56   #21
Ailuros
Epsilon plus three
 
Join Date: Feb 2002
Location: Chania
Posts: 8,556
Default

Quote:
Originally Posted by CarstenS View Post
Plus as Big-K special sauce:
- one dedicated physx processor per SMK
- a broken and unfixable design
*SCNR*
Don't be so mean ....do you want fries with that?
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs.
Ailuros is offline   Reply With Quote
Old 23-Apr-2012, 03:03   #22
Ailuros
Epsilon plus three
 
Join Date: Feb 2002
Location: Chania
Posts: 8,556
Default

Quote:
Originally Posted by iMacmatician View Post
Well there's this set of rumors/speculation from 3DCenter (translated) saying 3072 CCs, so that would use lots of transistors. According to that rumor, GK110 seems close to an overall doubled GK104 in terms of basic specs.
ALUs are in a relative sense fairly "cheap" in hw compared to other units like TMUs. Tahiti in comparison packs 2048SPs into barely 365mm2, but here you'd need to note that amongst a multitude of other factors it also has "only" 128TMUs/32 ROPs compared to GK110.

If the so far information should be accurate, GK110 might have a slightly higher transistor density than GK104.
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs.
Ailuros is offline   Reply With Quote
Old 23-Apr-2012, 20:32   #23
Kaotik
Drunk Member
 
Join Date: Apr 2003
Posts: 5,414
Send a message via ICQ to Kaotik
Default

Quote:
Originally Posted by CarstenS View Post
3072 ALUs
-> 6x GPCs (à 512 SPs)
--> 4 SMK to each GPC, 128 ALUs/SMK
--> each SMK has
---> 4 groups of 32 ALUs
----> two groups share a quad TMU
First way I read it:
2 groups share a quad TMU > 1 SMK has 8 TMUs
4 SMK = 32 TMUs / GPC
6 GPC = 192 TMUs

Yeah, I don't see that happening, especially considering that IIRC they're next to useless on GPGPU front?

The second, probably wrong way, I read it, would give 48 TMUs which is surely even more off

Any way one looks at it, I don't see how they could, even in theory, fit "double GK104" with added GPGPU + FP64 capabilities to chip twice the size of GK104.
__________________
I'm nothing but a shattered soul...
Been ravaged by the chaotic beauty...
Ruined by the unreal temptations...
I was betrayed by my own beliefs...
Kaotik is offline   Reply With Quote
Old 23-Apr-2012, 20:43   #24
Ailuros
Epsilon plus three
 
Join Date: Feb 2002
Location: Chania
Posts: 8,556
Default

For one it's not twice the amount of units on all fronts (50% more raster/trisetups, 50% more TMUs etc.) and as a close second 7b transistors are almost twice as much as there are on GK104.

In fact there's nothing much that speaks against it considering a hypothetical 550mm2@28nm die; the only other case in point is that this time the desktop high end consumer has to pay for far more transistors than in the past which are HPC related and therefore not invested in 3D performance.

With Fermi/GF110 it was roughly 35% more transistors compared to GF114 where the performance difference between the two was give or take at 40%. If now GK110 is let's say 50% faster (which isn't absurd at all assuming those hypothetical specs are true especially considering the shitload of added bandwidth a 512bit bus grants even with relatively low GDDR5 frequencies) than GK104 but at the cost of almost twice the transistors, it's a totally different chapter and possibly also affecting power consumption.
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs.
Ailuros is offline   Reply With Quote
Old 23-Apr-2012, 21:12   #25
Kaotik
Drunk Member
 
Join Date: Apr 2003
Posts: 5,414
Send a message via ICQ to Kaotik
Default

GF114 wasn't anywhere close to as stripped from GPGPU capabilities as GK104 is. There's far more things GK110 needs to add over GK104 than 110 had over 114 just to for the GPGPU speed
__________________
I'm nothing but a shattered soul...
Been ravaged by the chaotic beauty...
Ruined by the unreal temptations...
I was betrayed by my own beliefs...
Kaotik is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 08:46.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.