NVIDIA Kepler speculation thread

I guess you could have something like 4 GPCs, 4 SMs per GPC, 32 wide SMs, 6 TMUs per SM. That keeps the ALU to TMU ratio at close to 6x, as in the GF104 design. Not that I'm convinced or anything, but why would the warp width have to be divisible by the TMU count, given that texturing is decoupled from computation? Also, I assume the OP means 1536 threads per SM, which matches GF100 exactly. 1536 threads for the whole chip cannot be right, and 1536 warps gives 96 threads per ALU (assuming the warp width is still 32, which I think is a safe bet because changing it is liable to hose the performance of a lot of CUDA code). That sounds high.
 
Not that I'm convinced or anything, but why would the warp width have to be divisible by the TMU count, given that texturing is decoupled from computation?

It doesn't have to be but it requires some funky combinations. Like mczak pointed out it's doable with 12 TMUs per SM.
 
Assuming "funky combination" means a*b TMUs where b still divides the warp width, then my question is - why not something that's totally coprime to the warp width, like say 5 TMUs? It's not like you are guaranteed to get a warp widths worth of requests at a time - for example, a shader might contain if (foo) { texlookup() }.
 
Well not really, by funky I meant "not seen before". As for why 5 TMUs doesn't make sense, it's really hard to imagine any configuration that isn't quad aligned.
 
For interest only, a user named Seronx on the SemiAccurate forums claims to have GK104 and GK100/110/112 specs from an "undisclosed source."

These specs (including: 2x hot clock, GK104: 512 CC, 0.9-1.0 GHz core clock, GK100/110/112: 1024 CC) are quite different from those of most rumors, and the "1536" number is the number of separate threads the scheduler can work with.

Update: There's more later in the thread.
fake.
 
And how they're going to hide latency with just 1536 threads. 3 threads per ALU.

That's NVIDIA's(tm) InstaExec(tm) Technology. The data gets tunnelled through warped space-time into NVIDIA's(tm) Supercomputer in Jen-Hsun's closet, executed there an tunnelled back with according time-adjustment so that it arrives in the destination registers exactly one clock cycle later.

The Mid-Life kicker will of course feature AnticipaExec(tm) with no latency at all, basically allowing for predicted execution with a 100.0% hit rate. Ub0r-next-gen will then come with all possible computations already executed and stored in dimensional rift memory, from where it only need to be fetched, so that Jen-Hsun can use his closet normally again. :)

Seriously, why are we even discussing this random post by someone with no apparent track record in industry sources? I mean, it's not even Charlie himself, whose predictions do contain more than a spark-o-truth most of the time (you only need to unwarp his interpretation of the facts). ;)
 
From what I've seen, we can roughly expect a doubling of transistor density from going 40->28nm alone. real-life-1.95x was the number quoted somewhere around these forums compared to 2.04 in theory. That's a pretty large gain.
It depends how you count under what circumstances. SRAM probably shrinks close to theoretical scaling, most logic won't. Do you remember the discussion about gate first vs. gate last and GF claiming about 10% higher density than TSMC?
You only get the 1.95 scaling (TSMC indeed gives this number) when you compare 40nm and 28nm with a special set of layout rules. In my opinion, that number is a bit made up and not that relevant for a lot of cases. TSMC also gives the scaling without those rules (i.e. a more conventional layout without putting redundant structures in to get it as regular as possible) and then the claimed density scaling reduces to a mere 1.6. As an average (logic and SRAM mixed on a chip and the layout pays at least some attention to the 28nm layout peculiarities), I think a ~1.8 scaling is somewhat realistic (which also matches the claim of 10% better scaling with GF's 28nm HKMG processes).
 
I'm new to posting on this forum, I've lurked for years but never entered a discussion before. I have a question. Does anyone knows if the rumors of a change to VLIW has any merit? Also if it does, could it be VLIW3 with 3 SP's to a CUDA core? Or is that totally off base.
 
I'm new to posting on this forum, I've lurked for years but never entered a discussion before. I have a question. Does anyone knows if the rumors of a change to VLIW has any merit? Also if it does, could it be VLIW3 with 3 SP's to a CUDA core? Or is that totally off base.

It's very unlikely. There's a reason AMD decided to abandon VLIW, so NVIDIA wouldn't do the opposite.
 
The warp size is a power of 2 larger than the SIMD size. It gets executed over multiple cycle. On G80, the warp size was already 32. The SIMD size was only 8 (or was it 16?)

On AMD VILW warp size was 64, executed also over 4 cycles. (Still true for GCN, I believe.)
I remember Nvidia saying the VS warp size was 16 at one point so maybe that's still true, but from a programmer's perspective for Cuda my understanding is warp size equals SIMD size. Also, for Southern Islands it doesn't matter than a wavefront executes over 4 cycles (still true) it's still a 64 wide SIMD as 64 threads are executing the same instruction.
 
I remember Nvidia saying the VS warp size was 16 at one point so maybe that's still true
This was true for G80 (the advantage was lower warp size, the disadvantage was you couldn't co-issue to the special function unit) but it isn't true anymore for Fermi.
 
It's very unlikely. There's a reason AMD decided to abandon VLIW, so NVIDIA wouldn't do the opposite.
Actually (Ailuros hinted already to that direction above), nVidia said in some presentations that they will use (V)LIW3 for the Einstein architecture, which is the base for Echelon. For single precision it looks like two vec2 ALUs with 1 L/S, for DP it's just two DP ALUs + L/S.
 
Last edited by a moderator:
The warp size is a power of 2 larger than the SIMD size. It gets executed over multiple cycle. On G80, the warp size was already 32. The SIMD size was only 8 (or was it 16?)
Yes, that's true. The key thing is however nvidia cannot further increase the simd width without changing the warp size (well they can by dropping hot clock but that's it(*)), hence weird configurations with more alus per SMs would definitely require more simds (with more complex logic for dispatch etc.) not just the "easier" option of same number of simds with larger width.
(*) Strictly speaking I guess dropping hotclock for being able to double simd width is not necessary, instead if you could dispatch instructions per hotclock that would work too. At this point pretty much everything in the SM would run at hot clock though...
In a followup post he says that the TMU and ROP counts are from him and not from the source. "The source only told me core count, price, process, general clock, and memory."
That doesn't make it a lot more convincing...
 
Actually (Ailuros hinted already to that direction above), nVidia said in some presentations that they will use (V)LIW3 for the Einstein architecture, which is the base for Echelon. For single precision it looks like two vec2 ALUs with 1 L/S, for DP it's just two DP ALUs + L/S.

In single precision it seems fairly similar to a VLIW5 setup, just slightly less flexible. I wonder what problems there would be with hiding the LIWness of the compute unit by just mapping a quad of pixels to a thread, one pixel per SIMD lane. One benefit would be that if you have say some expensive computation inside a rarely taken branch, the compiler could loop over that code block instead of vectorizing, and a thread taking the branch would temporarily have 4x the compute resources at its disposal. In such a case, even if the compiler produces code with less than stellar ALU usage, I can't see how it could be worse than idling 3 of 4 SP ALUs.

Also, I wonder if the current "scalar" setups even have truly independent lanes - from what I understand the pixels in a quad need to exchange data to compute tex coord derivates. For that reason, I imagine that SP ALUs are probably already arranged in groups of 4. I'm guessing the SFU and load/texture request/store resources are alongside (since you need to feed them register values, and you don't want to send those very far for power reasons). All of which doesn't sound so very far removed from what's described in the Einstein presentation. The most interesting bits of Echelon to me are the things that are vaguely described: configurable cache hierarchy, active messages, L2 cache slices no longer attached to memory controllers, multiple coherency domains...
 
(*) Strictly speaking I guess dropping hotclock for being able to double simd width is not necessary, instead if you could dispatch instructions per hotclock that would work too. At this point pretty much everything in the SM would run at hot clock though...

My (100%) guess is that for Kepler they will in fact dispatch one 32 wide warp per clock per SM. The hierarchical scheduling scheme they've described in papers (which I speculate has been implemented) means that far fewer threads must be looked at per clock, which should allow for a simpler, faster, lower power scheduler.
 
Back
Top