NVIDIA Kepler speculation thread

Caches:

768kb L2 cache
64kb Shared Memory/L1 cache per SMX
Texture Cache
Uniform Cache
65536 x 32bit registers per SMX

4 Schedulers with 8 dispatch units per SMX
8 SMX inside a GK104 chip.
 
i was foulishly hoping for 1MB L2 but 768KB is good enough, +50% more L2 and Off chip bandwidth should be enough for ROPs.. 4 Schedulers can be seen clearly unlike GF100 dieshot btw..

Whatever it is, it's hiding 7970 performance completely on purpose, it's not being done on any other graph, and the box entitling game isn't showing on them either.

if 7970 is faster, graph wouldnt be stopped at 80 FPS.. earlier GDC rumor indicates +10% for GTX680 so 65FPS is likely for 7970


EDIT: BTW it seems that nvidia thought it was brute force to go higher fillrate for small triangles and they try to balance the tessellator system.. they are now similar to AMD but they should still be better at smaller triangles.. i wonder how it fares against GF110 in pure tess benchmarks(tessmark) at same clock..

EDIT2: it looks like it's nearly 2x faster than my overclocked GTX460, i just wish it was cheaper and i could get a msi lightning soon :D
 
Last edited by a moderator:
another 640M review

01xed0k.jpg
Can anyone estimate die size from this picture?
 
Btw., has someone else problems to conciliate the die shot with the official version of 6x32 ALUs, 32 L/S, and 32 SFUs? I mean, each "SMX" appears to have 8 physically separate register files aligned along the vector ALU lanes. Even when one considers that half of the register banks are on the left side and the other half on the right of a vALU, each SMX has then a set of 4 identical and replicated subunits. That would fit somehow to the 4 schedulers, but what is in there?

The only way (I can think of right now) one can distribute the units would be, that each dual issue scheduler delivers its instructions to a set of 3 vec16 ALUs, 8 L/S units und 8 SFUs. That basically means one SMX would be a package of four GF104 style SMs (somwhat reminiscent of G80/GT200) where the hotclock and one scheduler got lost (and the local memory, TMUs and some other stuff are shared). The scheduler can issue each cycle two instructions from one thread and alternates each cycle between "even" and "odd" threads (same would then be true for the register access, maybe that's why one can identify 8 vector register files in each SMX, even and odd threads have separate register files). Or maybe a better picture: a scheduler issues up to 4 instructions from two threads every two clock cycles. Or the scheduler issues each cycle a single instruction from two threads (and the vecALU the instruction got issued to is blocked in the next cycle because one can issue an instruction for a 32 element warp only every second cycle to a vALU with 16 lanes). The last version would basically work like the two single issue schedulers in a GF100/110 SM, just that the scheduler run at the same clock as the ALUs and can therefore supply more of them.

Has someone a clever idea how this really works?

PS:
If they didn't have a similar mistake in that slides as during the Fermi presentation, the total register space is the same as with GF100/GF110 (2 MB), so really tiny compared to Tahiti (8 MB). I have a hard time believing that number, considering the similarity of the ALU count of GK104 and Tahiti. I would expect double the value given in that slide (4 MB), i.e. 512 kB per SMX or 128kB per Scheduler.
But also the local memory/L1 is quite small (still 64kB) considering how many threads/workgroups on one SMX have to share it.
 
Actually, what I'm seeing is different. Comparing the latest High-End cards that both have 384 bit of memory interface and where Nvidia has the advantage of 50% more ROPs, AMD looses less performance when going from 4x MSAA to 8x MSAA.
Probably because something else goes wrong on Fermi. The whole export limitation of the SMs is quite a fuckup in my opinion. Practically, GF100/110 has no 50% more ROPs, it has effectively the same (but lower clocked) or even less ROPs than a Cayman/Tahiti if you factor that in.
 
Btw., has someone else problems to conciliate the die shot with the official version of 6x32 ALUs, 32 L/S, and 32 SFUs? I mean, each "SMX" appears to have 8 physically separate register files aligned along the vector ALU lanes. Even when one considers that half of the register banks are on the left side and the other half on the right of a vALU, each SMX has then a set of 4 identical and replicated subunits. That would fit somehow to the 4 schedulers, but what is in there?

That looks strange indeed. I've been staring at this image for a while also and did not come up with something conclusive yet.
 
Probably because something else goes wrong on Fermi. The whole export limitation of the SMs is quite a fuckup in my opinion. Practically, GF100/110 has no 50% more ROPs, it has effectively the same (but lower clocked) or even less ROPs than a Cayman/Tahiti if you factor that in.

Right, it (Fermi GF100/b) has ROP-excess, so to say. But since the ROPs do only 4x MSAA single-cycle and loop over for 8x, that should make for even less performance hit when switching to 8x. But it isn't.
 
Btw., has someone else problems to conciliate the die shot with the official version of 6x32 ALUs, 32 L/S, and 32 SFUs? I mean, each "SMX" appears to have 8 physically separate register files aligned along the vector ALU lanes. Even when one considers that half of the register banks are on the left side and the other half on the right of a vALU, each SMX has then a set of 4 identical and replicated subunits. That would fit somehow to the 4 schedulers, but what is in there?
Maybe the RF/Scheduler block takes more area than we think? Especially if the schedulers are fully associative to the SIMD lanes, and not bounded to a subset, like in CGN, e.g. every scheduler can issue an instruction to any SIMD. That would be really a huge overhead, if true. :???:
 
SimBy said:
What I find ridiculous are claims that Tahiti is priced ridiculously.
If Tahiti and GK104 are performing the same and they cost the same, then they are both priced ridiculously or not. The rest is personal opinion. I find $550 a bit much for a GPU if you can get a new iPad (it's gorgeous!) for the same price, but that's just me.
 
The primarily responsible for the the Tessellation calculated PolyMorph Engine, also in the framework of the "Kepler" to upgrade to 2.0 The integrated Tessellator already been updated, and computational efficiency compared to "Fermi" 2 times, to the Radeon HD7970 4 times advantage.
 
Page 3 of the chinese preview, what else could I have?
Actually it was page 2 that had 768 mentioned. And these pages take forever to load.

http://www.hkepc.com/7672/page/2#view

Here is a direct quote:

"「 GK-104 」將以 2 組 SMX 建構成 1 組 GPC ,核心合共集成 4 組 GPC 及 4 組 Raster Engine ,並共享 768KB L2 Cache , Cache 規格跟現有「 Fermi 」系列相同。不過「 Kepler 」已更新 PCI-E 3.0 規格的支援,提高顯示核心與主機板之間的傳輸頻寬; NVIDIA 同時修改了 「 GK-104 」核心的 Memory Controller 規格,核心僅集成 4 組 64bit Memory Controller 規格,合共支援 256bit 記憶體,規格比上代 GF110 及主要對手 AMD 「 Tahiti 」 核心的 384bit 為低。"

How do we know that the 768 refers to Kepler and not Fermi?
 
If Tahiti and GK104 are performing the same and they cost the same, then they are both priced ridiculously or not. The rest is personal opinion. I find $550 a bit much for a GPU if you can get a new iPad (it's gorgeous!) for the same price, but that's just me.

Yeah, but can an iPad play Crysis on max at 50 fps? :)
 
Back
Top