NVIDIA Kepler speculation thread

Ailuros · Feb 14, 2012

whitetiger said:
I guess you are alluding to the GK110 being to the GF110 as the GK114 is to the GF114?
- meaning the GK110 is a 2048 SP chip with 64 SPs per SM, compared to the 96 SPs per SM of the GK114

So, that would fit with the die sizes of the GK110 being similar to the GF110....

Both chips therefore ending up with 2x SPs, and 25% more bandwidth, of their Fermi antecedent.

I had to get rid of the former "claim" in order to make the above assumption a little bit clearer. Please note that it's truly just an assumption based of course on the GK104 specifications. And that's exactly why I called for bullshit when I saw the lenzfire claimed 6.4b transistors for the GK110.

If they got rid of a few of the GF110 bottlenecks, this is still a good chip

Depends what you mean with bottlenecks exactly; for the record I don't expect to see a 512bit bus for one.

whitetiger · Feb 14, 2012

Ailuros said:
Depends what you mean with bottlenecks exactly; for the record I don't expect to see a 512bit bus for one.

The GF110 wasn't bw limited AFAIK
- but a GK110 could be (just like the GK114)
- but the bottlenecks that I'm aware are
1) not enough TMUs in the SM (fixed in the GF104/114)
2) Bus width problem into/out of the SM, limiting the Fill rate
(I can't remember off the top of my head if it was out of, or into the SM - probably out of, if it's fill-rate limited)

But anyway, there were supposed to be some limitations of the architecture, that aren't immediately obvious from a raw functional unit count, although the GF110 was known to be low on ROPs also.

So, basically, one would hope they'd fixed a few of these issues...

6.4B transistors?
50% more than the GK104 would get you there more-or-less, would it not?

Ailuros · Feb 14, 2012

whitetiger said:
The GF110 wasn't bw limited AFAIK
- but a GK110 could be (just like the GK114)
- but the bottlenecks that I'm aware are
1) not enough TMUs in the SM (fixed in the GF104/114)

"Fixed" it's debatable; GF1x4 were merely designs to have a performance part within a reasonable distance to the lowest top dog salvage part. In a relative sense you could say that texel fillrate and on paper single precision FLOPs were redundant on GF1x4, but the main culprit would always had been bandwidth. You can either think that GF110 had too little texel fillrate (for which I'd need some solid indication I haven't seen so far) or GF114 too much (which is obviously closer to reality since the texel fillrate to bandwidth ratio is quite a bit different compared to GF110).

2) Bus width problem into/out of the SM, limiting the Fill rate
(I can't remember off the top of my head if it was out of, or into the SM - probably out of, if it's fill-rate limited)

Care to elaborate since I don't understand what you mean?

But anyway, there were supposed to be some limitations of the architecture, that aren't immediately obvious from a raw functional unit count, although the GF110 was known to be low on ROPs also.

Why is the GF110 low on ROPs? When ROPs are coupled to the MC like in this case it's normal to expect 48 ROPs when there are 8 ROPs in each partition (6*64bits). Each rasterizer out of the 4 for each GPC is capable of 8 pixels/clock (32 pixels/clock in total), but I don't see what that would have to do with the ROP amount. What am I missing?

6.4B transistors?
50% more than the GK104 would get you there more-or-less, would it not?

There's no safe equation for that as long as the exclusive HPC additional functionalities of the top dog are unknown. However twice or almost twice as many transistors as GK104 sounds idiotic, especially considering that the die area estate of GK110 is most likely at ~550mm2 as SA stated.

TKK · Feb 14, 2012

Oh dear, that's what you get for making rough, uneducated guesses

hkultala said:
Btw. your transistor count for bulldoze is way off. 1.2G is impossible number, correct is about 1.5G.

It's the 'corrected' number AMD's PR gave out, so it's not my fault. I'm actually aware that it's very unlikely that this number is correct, that's why I added the "officially, at least"

Ailuros said:
You can either think that GF110 had too little texel fillrate (for which I'd need some solid indication I haven't seen so far) or GF114 too much (which is obviously closer to reality since the texel fillrate to bandwidth ratio is quite a bit different compared to GF110).

The only thing I noticed in computerbase reviews is that GF110 takes a slightly higher performance hit when enabling 16xAF compared to Tahiti and to a lesser extent Cayman. Nothing major, though.

Ailuros · Feb 14, 2012

TKK said:
The only thing I noticed in computerbase reviews is that GF110 takes a slightly higher performance hit when enabling 16xAF compared to Tahiti and to a lesser extent Cayman. Nothing major, though.

Do they have a recent review where they exclusively investigated AF without AA? If yes then I've missed it. In any other case if it's the typical 1xAA/1xAF and 4xAA/16xAF tests (amongst others) how can you attribute the higher performance drop in the second case just to filtering performance? The framebuffer difference (2GB vs. 1.5GB) should be enough to make a slightly higher difference for Cayman for 4xMSAA mostly and not AF. With 8xAA in =/>1080 those two depending on case either break even or Cayman pulls occassionally slightly ahead.

***edit: this one is a wee bit more interesting: http://www.computerbase.de/artikel/...7970-crossfire/6/#abschnitt_leistung_mit_ssaa I'm just not sure if they've offset LOD on GeForces for that comparison, but either way the framebuffer differences are a bit clearer in that one.

mczak · Feb 14, 2012

Ailuros said:
Do they have a recent review where they exclusively investigated AF without AA?

Here you go, separate AA/AF scaling:
http://www.computerbase.de/artikel/...-radeon-hd-7970/7/#abschnitt_skalierungstests
(You can also find the same tests for hd5870/hd6870 unfortunately not for GTX460/560 though I'd really expect them to lose less performance there.)

The difference to Tahiti is barely worth mentioning (don't forget Tahiti actually has the same tmu/alu ratio as GF110 anyway though you could argue GF110 has somewhat more alus as it has dedicated SFUs).

whitetiger · Feb 14, 2012

Ailuros said:
Care to elaborate since I don't understand what you mean?

It's the path from the SMs to the ROPs - 64-bits per SM

http://www.behardware.com/articles/795-3/report-nvidia-geforce-gtx-460.html

[FONT=Arial, Helvetica]We were also able to gain a better understanding of fillrate limitation on the GF100 and the GF104. The limitation comes from a datapath bottleneck between the SMs and ROPs – 64 bits per cycle. [/FONT]

[FONT=Arial, Helvetica]We don’t know if this is a deliberate architecture limitation decided by NVIDIA, a compromise made along the way or the result of a makeshift solution to some problem. Whatever the reason, we do see it as a significant architecture limitation which prevents the card from benefiting from all ROPs and to a lesser extent from all the available memory bandwidth.[/FONT]

It's why AA on Fermi appears to cost less than on other architectures
- it's actually because it's only with AA that the ROPs can get fully utilised
- without AA the SMs can't supply enough pixels to the ROPs...

So, it may not be an important limitation, given that everyone uses AA anyway...

Ailuros · Feb 14, 2012

whitetiger said:
It's the path from the SMs to the ROPs - 64-bits per SM

http://www.behardware.com/articles/795-3/report-nvidia-geforce-gtx-460.html

It's why AA on Fermi appears to cost less than on other architectures
- it's actually because it's only with AA that the ROPs can get fully utilised
- without AA the SMs can't supply enough pixels to the ROPs...

So, it may not be an important limitation, given that everyone uses AA anyway...

I see what you mean. However the 64bit datapath between SMs and ROPs is the same between GF110 and GF114 and it's more an architectural decision (for whatever reason) than anything else.

It's bleedingly obvious that the 4 raster units on GF110 capable of 8 pixels/clock each can process in total only 32 pixels/clock (which is exactly what I wrote in one of my former posts). Since ROPs and memory controller aren't decoupled on Fermi the amount of ROPs depend on the buswidth in a relative sense. I don't expect the latter to have changed in Kepler and the only other difference would be that GK104 with 4 GPCs will be capable of 32 pixels/clock from the 4 raster units this time.

Tessellation aside if NV should also use the GK104 for Quadros this time, I don't expect the desktop variant to be capable of as much geometry.

mczak said:
Here you go, separate AA/AF scaling:
http://www.computerbase.de/artikel/...-radeon-hd-7970/7/#abschnitt_skalierungstests
(You can also find the same tests for hd5870/hd6870 unfortunately not for GTX460/560 though I'd really expect them to lose less performance there.)

The difference to Tahiti is barely worth mentioning (don't forget Tahiti actually has the same tmu/alu ratio as GF110 anyway though you could argue GF110 has somewhat more alus as it has dedicated SFUs).

Τhank you. Hadn't seen that one.

whitetiger · Feb 14, 2012

Ailuros said:
I see what you mean. However the 64bit datapath between SMs and ROPs is the same between GF110 and GF114 and it's more an architectural decision (for whatever reason) than anything else.

It's bleedingly obvious that the 4 raster units on GF110 capable of 8 pixels/clock each can process in total only 32 pixels/clock (which is exactly what I wrote in one of my former posts). Since ROPs and memory controller aren't decoupled on Fermi the amount of ROPs depend on the buswidth in a relative sense. I don't expect the latter to have changed in Kepler and the only other difference would be that GK104 with 4 GPCs will be capable of 32 pixels/clock from the 4 raster units this time.

So the GK104 'fixes' this by having 4 GPCs, and it's probably safe to assume that the GK110 'fixes' this by having 8 GPCs, each with 4 SMs, with 64SPs each...

And there, ladies & gentleman, we have the spec of the GK110!

The 384-bit bus is a given
- so how about TMUs?
- hopefully some sensible increase over the GF110....

mczak · Feb 14, 2012

Ailuros said:
I see what you mean. However the 64bit datapath between SMs and ROPs is the same between GF110 and GF114 and it's more an architectural decision (for whatever reason) than anything else.

Yes. But with GF114 having "fatter" SMs (and half the SMs in total) it's probably more of a problem there (if it's really an issue outside synthetic benchmarks). Also on GF114 there are more ROPs (per SM) so essentially half the (color) ROPs are always idle (I don't think this should change with 4xmsaa in theory at least).
Also rasterizer matching that rate is only sort of true, since it's 64bit/SM if you've got 4-channel fp16 (for instance) your effective pixel fill rate is now down to 16 pixels/clock (full GF110) or 8 pixels/clock (full GF114). Maybe such trivial shaders which would need higher export rate don't really matter much overall though for performance but with the synthetic tests the limitation is easily visible.

Tessellation aside if NV should also use the GK104 for Quadros this time, I don't expect the desktop variant to be capable of as much geometry.

That makes sense, especially since GK110 comes quite a bit later. Nvidia might simply decide to wait releasing new Quadros though, it's not like AMD is flooding the market with GCN based workstation cards right now which could threaten their high-end cards (if they even can since Tahiti's geometry throughput is still no match for GF110). (For that matter, no FireStream CGN parts neither so far even though the chip has all the bloat bits needed for that market.)

Ailuros · Feb 14, 2012

whitetiger said:
So the GK104 'fixes' this by having 4 GPCs, and it's probably safe to assume that the GK110 'fixes' this by having 8 GPCs, each with 4 SMs, with 64SPs each...

Jebus as it would had been soo damn hard to reach such an assumption and no I don't know yet if it's true. It makes sense though. Now all you need is to tell me what it actually "fixes", since how many pixels/clock the raster units can process is completely irrelevant to the ROPs as Damien points out in the article you linked to.

And there, ladies & gentleman, we have the spec of the GK110!

The 384-bit bus is a given
- so how about TMUs?
- hopefully some sensible increase over the GF110....

How many do you think the GK104 has?

Ailuros · Feb 14, 2012

mczak said:
Yes. But with GF114 having "fatter" SMs (and half the SMs in total) it's probably more of a problem there (if it's really an issue outside synthetic benchmarks). Also on GF114 there are more ROPs (per SM) so essentially half the (color) ROPs are always idle (I don't think this should change with 4xmsaa in theory at least).
Also rasterizer matching that rate is only sort of true, since it's 64bit/SM if you've got 4-channel fp16 (for instance) your effective pixel fill rate is now down to 16 pixels/clock (full GF110) or 8 pixels/clock (full GF114). Maybe such trivial shaders which would need higher export rate don't really matter much overall though for performance but with the synthetic tests the limitation is easily visible.

Does it really matter as much in the end in a GF114 vs. GF100 comparison under the light that the first has way less bandwidth than the latter anyway?

That makes sense, especially since GK110 comes quite a bit later. Nvidia might simply decide to wait releasing new Quadros though, it's not like AMD is flooding the market with GCN based workstation cards right now which could threaten their high-end cards (if they even can since Tahiti's geometry throughput is still no match for GF110). (For that matter, no FireStream CGN parts neither so far even though the chip has all the bloat bits needed for that market.)

I don't consider that "quite a bit" as confirmed yet. It'll come down in what shape GK110 really is and if and how many metal spins it might need. None (highly unlikely IMO) and the quite a bit is probably not valid; just one chances are high that it might be somewhere mid year. More than the former is more in the quite a bit region.

whitetiger · Feb 14, 2012

Ailuros said:
Jebus as it would had been soo damn hard to reach such an assumption and no I don't know yet if it's true. It makes sense though. Now all you need is to tell me what it actually "fixes", since how many pixels/clock the raster units can process is completely irrelevant to the ROPs as Damien points out in the article you linked to.

How many do you think the GK104 has?

If you've got something useful to contribute, by all means go ahead and do so!

mczak · Feb 14, 2012

Ailuros said:
Does it really matter as much in the end in a GF114 vs. GF100 comparison under the light that the first has way less bandwidth than the latter anyway?

Well even taking bandwidth difference in account, GF114 still has considerably less pixel export capability (as GF110 has roughly 1.5 times the bandwidth but twice the shader export capability) the picture doesn't really change there.
As said I'm not sure it really makes much of a difference in the real world, but those fillrate tests were typically always bound by memory bandwidth, with ROPs capabilities usually far exceeding what the memory could sustain (well at least int8 is now ROP bound with Tahiti actually). But with GF114 (and GF110 too even) the ROPs capabilities exceed that of shader export (usually by a factor of 2 for GF114), and often it's shader export limiting these tests, not memory bandwidth.

I don't consider that "quite a bit" as confirmed yet. It'll come down in what shape GK110 really is and if and how many metal spins it might need. None (highly unlikely IMO) and the quite a bit is probably not valid; just one chances are high that it might be somewhere mid year. More than the former is more in the quite a bit region.

I dunno with latest rumor I find it unlikely it's before August. Though maybe that fits your definition of mid-year, in any case it should be "a couple of months later"

.

Ailuros · Feb 14, 2012

whitetiger said:
If you've got something useful to contribute, by all means go ahead and do so!

I thought you like riddles

mczak said:
I dunno with latest rumor I find it unlikely it's before August. Though maybe that fits your definition of mid-year, in any case it should be "a couple of months later" :smile:.

A couple of months later in the strict sense, is 3 months before August

fellix · Feb 14, 2012

Eight setup pipes in Kepler are actually not that far from plausible. Yes, 8 primitives per clock is probably an overkill by a wide margin, and the logic complexity too, but if NV is seeking an easy way to boost scan-out throughput, they could just use simpler setup units with half-rate speed (and 1/4 rate for consumer SKUs). This will keep the logic block size in check and will provide more optimized wiring to the SIMD multi-processors, avoiding critical hotspots.

whitetiger · Feb 14, 2012

Ailuros said:
I thought you like riddles

Well, I think the GK114 will have the same ratio of TMU to SP as the GF114
- and the GK114 has 4x the SPs, but its depends on which side of the hotclock the GF114 TMUs were on.
- as I said the GF104/114 fixed the problem that the GF100 had with not having enough TMUs ...

On a different note:
Yikes:
http://techreport.com/discussions.x/22478

Man from Atlantis · Feb 14, 2012

the more important questions is will Nvidia decouple ROPs from memory controller on GK110.. as Ailuros points supposed GK110 has 8GPCs 32 SMs 2048CCs with 384 bit bus wide.. on Fermi you couldnt use 32SMs&64ROPs with 384 bit bus you'd need 512 bit for it..

fellix · Feb 14, 2012

NV still has room for improvements for the ROPs on some surface formats that are now half ot 1/4 rate, compared to AMDs architecture. Boosting the count (event if that would mean decoupling) isn't imperative, me thinks. Also, decoupling means yet another mesh of wires for the cross-bar (i.e. a hot-spot), that NV had particularly bad dealings with Fermi's design process, especially with large number of end-points.

ninelven · Feb 14, 2012

@ Man from Atlantis, they could just increase the number of ROPs per memory channel. But I don't really think ROPs are that much of a performance issue...

NVIDIA Kepler speculation thread

Ailuros

Epsilon plus three

whitetiger

Ailuros

Epsilon plus three

TKK

Ailuros

Epsilon plus three

mczak

whitetiger

Ailuros

Epsilon plus three

whitetiger

mczak

Ailuros

Epsilon plus three

Ailuros

Epsilon plus three

whitetiger

mczak

Ailuros

Epsilon plus three

fellix

whitetiger

Man from Atlantis

fellix

ninelven

PM

Similar threads