NVIDIA Kepler speculation thread

Arty · Mar 16, 2012

Kaotik said:
What's up with hiding the BF3 7970 performance? The bar is cut short and number hidden

Pending "Doubele Confirmation". (The scale is wrong)

DarthShader · Mar 16, 2012

Caches:

768kb L2 cache
64kb Shared Memory/L1 cache per SMX
Texture Cache
Uniform Cache
65536 x 32bit registers per SMX

4 Schedulers with 8 dispatch units per SMX
8 SMX inside a GK104 chip.

Man from Atlantis · Mar 16, 2012

i was foulishly hoping for 1MB L2 but 768KB is good enough, +50% more L2 and Off chip bandwidth should be enough for ROPs.. 4 Schedulers can be seen clearly unlike GF100 dieshot btw..

Kaotik said:
Whatever it is, it's hiding 7970 performance completely on purpose, it's not being done on any other graph, and the box entitling game isn't showing on them either.

if 7970 is faster, graph wouldnt be stopped at 80 FPS.. earlier GDC rumor indicates +10% for GTX680 so 65FPS is likely for 7970

EDIT: BTW it seems that nvidia thought it was brute force to go higher fillrate for small triangles and they try to balance the tessellator system.. they are now similar to AMD but they should still be better at smaller triangles.. i wonder how it fares against GF110 in pure tess benchmarks(tessmark) at same clock..

EDIT2: it looks like it's nearly 2x faster than my overclocked GTX460, i just wish it was cheaper and i could get a msi lightning soon

A1xLLcqAgt0qc2RyMz0y · Mar 16, 2012

Kaotik said:
Whatever it is, it's hiding 7970 performance completely on purpose

Tin-foil hat time

A1xLLcqAgt0qc2RyMz0y · Mar 16, 2012

DarthShader said:
Caches:

768kb L2 cache
64kb Shared Memory/L1 cache per SMX
Texture Cache
Uniform Cache
65536 x 32bit registers per SMX

4 Schedulers with 8 dispatch units per SMX
8 SMX inside a GK104 chip.

These slides do not specify any cache sizes except the L1.

http://imgur.com/a/aQmuA#EFjJN

Do you have a link for where the "768kb L2 cache" size specification is stated?

The Instruction Cache size is also unknown.

iMacmatician · Mar 16, 2012

Man from Atlantis said:
another 640M review

Can anyone estimate die size from this picture?

Man from Atlantis · Mar 16, 2012

iMacmatician said:
Can anyone estimate die size from this picture?

compare it to GF108(116mm2)

Gipsel · Mar 16, 2012

Btw., has someone else problems to conciliate the die shot with the official version of 6x32 ALUs, 32 L/S, and 32 SFUs? I mean, each "SMX" appears to have 8 physically separate register files aligned along the vector ALU lanes. Even when one considers that half of the register banks are on the left side and the other half on the right of a vALU, each SMX has then a set of 4 identical and replicated subunits. That would fit somehow to the 4 schedulers, but what is in there?

The only way (I can think of right now) one can distribute the units would be, that each dual issue scheduler delivers its instructions to a set of 3 vec16 ALUs, 8 L/S units und 8 SFUs. That basically means one SMX would be a package of four GF104 style SMs (somwhat reminiscent of G80/GT200) where the hotclock and one scheduler got lost (and the local memory, TMUs and some other stuff are shared). The scheduler can issue each cycle two instructions from one thread and alternates each cycle between "even" and "odd" threads (same would then be true for the register access, maybe that's why one can identify 8 vector register files in each SMX, even and odd threads have separate register files). Or maybe a better picture: a scheduler issues up to 4 instructions from two threads every two clock cycles. Or the scheduler issues each cycle a single instruction from two threads (and the vecALU the instruction got issued to is blocked in the next cycle because one can issue an instruction for a 32 element warp only every second cycle to a vALU with 16 lanes). The last version would basically work like the two single issue schedulers in a GF100/110 SM, just that the scheduler run at the same clock as the ALUs and can therefore supply more of them.

Has someone a clever idea how this really works?

PS:
If they didn't have a similar mistake in that slides as during the Fermi presentation, the total register space is the same as with GF100/GF110 (2 MB), so really tiny compared to Tahiti (8 MB). I have a hard time believing that number, considering the similarity of the ALU count of GK104 and Tahiti. I would expect double the value given in that slide (4 MB), i.e. 512 kB per SMX or 128kB per Scheduler.
But also the local memory/L1 is quite small (still 64kB) considering how many threads/workgroups on one SMX have to share it.

DarthShader · Mar 16, 2012

A1xLLcqAgt0qc2RyMz0y said:
Do you have a link for where the "768kb L2 cache" size specification is stated?

Page 3 of the chinese preview, what else could I have?

psolord · Mar 16, 2012

doob said:
And in the end both will also will beat the consumer in performance/price.

+1000^1000

If these 256bit Kepler cards are priced as ridiculously as the Tahiti cards, I will file a complaint for price fixing.

Gipsel · Mar 16, 2012

CarstenS said:
Actually, what I'm seeing is different. Comparing the latest High-End cards that both have 384 bit of memory interface and where Nvidia has the advantage of 50% more ROPs, AMD looses less performance when going from 4x MSAA to 8x MSAA.

Probably because something else goes wrong on Fermi. The whole export limitation of the SMs is quite a fuckup in my opinion. Practically, GF100/110 has no 50% more ROPs, it has effectively the same (but lower clocked) or even less ROPs than a Cayman/Tahiti if you factor that in.

SimBy · Mar 16, 2012

What I find ridiculous are claims that Tahiti is priced ridiculously.

CarstenS · Mar 16, 2012

Gipsel said:
Btw., has someone else problems to conciliate the die shot with the official version of 6x32 ALUs, 32 L/S, and 32 SFUs? I mean, each "SMX" appears to have 8 physically separate register files aligned along the vector ALU lanes. Even when one considers that half of the register banks are on the left side and the other half on the right of a vALU, each SMX has then a set of 4 identical and replicated subunits. That would fit somehow to the 4 schedulers, but what is in there?

That looks strange indeed. I've been staring at this image for a while also and did not come up with something conclusive yet.

CarstenS · Mar 16, 2012

Gipsel said:
Probably because something else goes wrong on Fermi. The whole export limitation of the SMs is quite a fuckup in my opinion. Practically, GF100/110 has no 50% more ROPs, it has effectively the same (but lower clocked) or even less ROPs than a Cayman/Tahiti if you factor that in.

Right, it (Fermi GF100/b) has ROP-excess, so to say. But since the ROPs do only 4x MSAA single-cycle and loop over for 8x, that should make for even less performance hit when switching to 8x. But it isn't.

fellix · Mar 16, 2012

Gipsel said:
Btw., has someone else problems to conciliate the die shot with the official version of 6x32 ALUs, 32 L/S, and 32 SFUs? I mean, each "SMX" appears to have 8 physically separate register files aligned along the vector ALU lanes. Even when one considers that half of the register banks are on the left side and the other half on the right of a vALU, each SMX has then a set of 4 identical and replicated subunits. That would fit somehow to the 4 schedulers, but what is in there?

Maybe the RF/Scheduler block takes more area than we think? Especially if the schedulers are fully associative to the SIMD lanes, and not bounded to a subset, like in CGN, e.g. every scheduler can issue an instruction to any SIMD. That would be really a huge overhead, if true. :???:

silent_guy · Mar 16, 2012

SimBy said:
What I find ridiculous are claims that Tahiti is priced ridiculously.

If Tahiti and GK104 are performing the same and they cost the same, then they are both priced ridiculously or not. The rest is personal opinion. I find $550 a bit much for a GPU if you can get a new iPad (it's gorgeous!) for the same price, but that's just me.

Man from Atlantis · Mar 16, 2012

The primarily responsible for the the Tessellation calculated PolyMorph Engine, also in the framework of the "Kepler" to upgrade to 2.0 The integrated Tessellator already been updated, and computational efficiency compared to "Fermi" 2 times, to the Radeon HD7970 4 times advantage.

A1xLLcqAgt0qc2RyMz0y · Mar 16, 2012

DarthShader said:
Page 3 of the chinese preview, what else could I have?

Actually it was page 2 that had 768 mentioned. And these pages take forever to load.

http://www.hkepc.com/7672/page/2#view

Here is a direct quote:

"「 GK-104 」將以 2 組 SMX 建構成 1 組 GPC ，核心合共集成 4 組 GPC 及 4 組 Raster Engine ，並共享 768KB L2 Cache ， Cache 規格跟現有「 Fermi 」系列相同。不過「 Kepler 」已更新 PCI-E 3.0 規格的支援，提高顯示核心與主機板之間的傳輸頻寬； NVIDIA 同時修改了「 GK-104 」核心的 Memory Controller 規格，核心僅集成 4 組 64bit Memory Controller 規格，合共支援 256bit 記憶體，規格比上代 GF110 及主要對手 AMD 「 Tahiti 」核心的 384bit 為低。"

How do we know that the 768 refers to Kepler and not Fermi?

Rangers · Mar 16, 2012

jaredpace said:
AMD is beat on performance/watt and performance/mm2. Finally!

Yeah only took them what 5 years?

About time they took a turn.

Their absolute performance leadership is going to take a major hit though if exist at all.

Mize · Mar 16, 2012

silent_guy said:
If Tahiti and GK104 are performing the same and they cost the same, then they are both priced ridiculously or not. The rest is personal opinion. I find $550 a bit much for a GPU if you can get a new iPad (it's gorgeous!) for the same price, but that's just me.

Yeah, but can an iPad play Crysis on max at 50 fps?

NVIDIA Kepler speculation thread

Arty

KEPLER

DarthShader

Man from Atlantis

idk

A1xLLcqAgt0qc2RyMz0y

A1xLLcqAgt0qc2RyMz0y

iMacmatician

Man from Atlantis

idk

Gipsel

DarthShader

psolord

Gipsel

SimBy

CarstenS

Moderator

CarstenS

Moderator

fellix

silent_guy

Man from Atlantis

idk

A1xLLcqAgt0qc2RyMz0y

Rangers

Mize

3dfx Fan

Similar threads