NVIDIA Kepler speculation thread

psurge · Feb 26, 2012

I guess you could have something like 4 GPCs, 4 SMs per GPC, 32 wide SMs, 6 TMUs per SM. That keeps the ALU to TMU ratio at close to 6x, as in the GF104 design. Not that I'm convinced or anything, but why would the warp width have to be divisible by the TMU count, given that texturing is decoupled from computation? Also, I assume the OP means 1536 threads per SM, which matches GF100 exactly. 1536 threads for the whole chip cannot be right, and 1536 warps gives 96 threads per ALU (assuming the warp width is still 32, which I think is a safe bet because changing it is liable to hose the performance of a lot of CUDA code). That sounds high.

trinibwoy · Feb 26, 2012

psurge said:
Not that I'm convinced or anything, but why would the warp width have to be divisible by the TMU count, given that texturing is decoupled from computation?

It doesn't have to be but it requires some funky combinations. Like mczak pointed out it's doable with 12 TMUs per SM.

iMacmatician · Feb 26, 2012

This followup post should clear up the TMU stuff, but it definitely sounds like backpedalling….

psurge · Feb 26, 2012

Assuming "funky combination" means a*b TMUs where b still divides the warp width, then my question is - why not something that's totally coprime to the warp width, like say 5 TMUs? It's not like you are guaranteed to get a warp widths worth of requests at a time - for example, a shader might contain if (foo) { texlookup() }.

trinibwoy · Feb 26, 2012

Well not really, by funky I meant "not seen before". As for why 5 TMUs doesn't make sense, it's really hard to imagine any configuration that isn't quad aligned.

chavvdarrr · Feb 26, 2012

iMacmatician said:
For interest only, a user named Seronx on the SemiAccurate forums claims to have GK104 and GK100/110/112 specs from an "undisclosed source."

These specs (including: 2x hot clock, GK104: 512 CC, 0.9-1.0 GHz core clock, GK100/110/112: 1024 CC) are quite different from those of most rumors, and the "1536" number is the number of separate threads the scheduler can work with.

Update: There's more later in the thread.

fake.

CarstenS · Feb 26, 2012

silent_guy said:
And how they're going to hide latency with just 1536 threads. 3 threads per ALU.

That's NVIDIA's(tm) InstaExec(tm) Technology. The data gets tunnelled through warped space-time into NVIDIA's(tm) Supercomputer in Jen-Hsun's closet, executed there an tunnelled back with according time-adjustment so that it arrives in the destination registers exactly one clock cycle later.

The Mid-Life kicker will of course feature AnticipaExec(tm) with no latency at all, basically allowing for predicted execution with a 100.0% hit rate. Ub0r-next-gen will then come with all possible computations already executed and stored in dimensional rift memory, from where it only need to be fetched, so that Jen-Hsun can use his closet normally again.

Seriously, why are we even discussing this random post by someone with no apparent track record in industry sources? I mean, it's not even Charlie himself, whose predictions do contain more than a spark-o-truth most of the time (you only need to unwarp his interpretation of the facts).

snarfbot · Feb 26, 2012

lol man, that was good stuff

Alexko · Feb 26, 2012

silent_guy said:
And how they're going to hide latency with just 1536 threads. 3 threads per ALU.

Yeah. For some perspective, GF100 can handle 24576.

Gipsel · Feb 26, 2012

CarstenS said:
From what I've seen, we can roughly expect a doubling of transistor density from going 40->28nm alone. real-life-1.95x was the number quoted somewhere around these forums compared to 2.04 in theory. That's a pretty large gain.

It depends how you count under what circumstances. SRAM probably shrinks close to theoretical scaling, most logic won't. Do you remember the discussion about gate first vs. gate last and GF claiming about 10% higher density than TSMC?
You only get the 1.95 scaling (TSMC indeed gives this number) when you compare 40nm and 28nm with a special set of layout rules. In my opinion, that number is a bit made up and not that relevant for a lot of cases. TSMC also gives the scaling without those rules (i.e. a more conventional layout without putting redundant structures in to get it as regular as possible) and then the claimed density scaling reduces to a mere 1.6. As an average (logic and SRAM mixed on a chip and the layout pays at least some attention to the 28nm layout peculiarities), I think a ~1.8 scaling is somewhat realistic (which also matches the claim of 10% better scaling with GF's 28nm HKMG processes).

Obsolete1337 · Feb 26, 2012

I'm new to posting on this forum, I've lurked for years but never entered a discussion before. I have a question. Does anyone knows if the rumors of a change to VLIW has any merit? Also if it does, could it be VLIW3 with 3 SP's to a CUDA core? Or is that totally off base.

Alexko · Feb 26, 2012

Obsolete1337 said:
I'm new to posting on this forum, I've lurked for years but never entered a discussion before. I have a question. Does anyone knows if the rumors of a change to VLIW has any merit? Also if it does, could it be VLIW3 with 3 SP's to a CUDA core? Or is that totally off base.

It's very unlikely. There's a reason AMD decided to abandon VLIW, so NVIDIA wouldn't do the opposite.

Ailuros · Feb 26, 2012

Alexko said:
It's very unlikely. There's a reason AMD decided to abandon VLIW, so NVIDIA wouldn't do the opposite.

Today or in the less foreseable future?

3dcgi · Feb 26, 2012

silent_guy said:
The warp size is a power of 2 larger than the SIMD size. It gets executed over multiple cycle. On G80, the warp size was already 32. The SIMD size was only 8 (or was it 16?)

On AMD VILW warp size was 64, executed also over 4 cycles. (Still true for GCN, I believe.)

I remember Nvidia saying the VS warp size was 16 at one point so maybe that's still true, but from a programmer's perspective for Cuda my understanding is warp size equals SIMD size. Also, for Southern Islands it doesn't matter than a wavefront executes over 4 cycles (still true) it's still a 64 wide SIMD as 64 threads are executing the same instruction.

Arun · Feb 26, 2012

3dcgi said:
I remember Nvidia saying the VS warp size was 16 at one point so maybe that's still true

This was true for G80 (the advantage was lower warp size, the disadvantage was you couldn't co-issue to the special function unit) but it isn't true anymore for Fermi.

Gipsel · Feb 27, 2012

Alexko said:
It's very unlikely. There's a reason AMD decided to abandon VLIW, so NVIDIA wouldn't do the opposite.

Actually (Ailuros hinted already to that direction above), nVidia said in some presentations that they will use (V)LIW3 for the Einstein architecture, which is the base for Echelon. For single precision it looks like two vec2 ALUs with 1 L/S, for DP it's just two DP ALUs + L/S.

mczak · Feb 27, 2012

silent_guy said:
The warp size is a power of 2 larger than the SIMD size. It gets executed over multiple cycle. On G80, the warp size was already 32. The SIMD size was only 8 (or was it 16?)

Yes, that's true. The key thing is however nvidia cannot further increase the simd width without changing the warp size (well they can by dropping hot clock but that's it(*)), hence weird configurations with more alus per SMs would definitely require more simds (with more complex logic for dispatch etc.) not just the "easier" option of same number of simds with larger width.
(*) Strictly speaking I guess dropping hotclock for being able to double simd width is not necessary, instead if you could dispatch instructions per hotclock that would work too. At this point pretty much everything in the SM would run at hot clock though...

iMacmatician said:
In a followup post he says that the TMU and ROP counts are from him and not from the source. "The source only told me core count, price, process, general clock, and memory."

That doesn't make it a lot more convincing...

psurge · Feb 27, 2012

Gipsel said:
Actually (Ailuros hinted already to that direction above), nVidia said in some presentations that they will use (V)LIW3 for the Einstein architecture, which is the base for Echelon. For single precision it looks like two vec2 ALUs with 1 L/S, for DP it's just two DP ALUs + L/S.

In single precision it seems fairly similar to a VLIW5 setup, just slightly less flexible. I wonder what problems there would be with hiding the LIWness of the compute unit by just mapping a quad of pixels to a thread, one pixel per SIMD lane. One benefit would be that if you have say some expensive computation inside a rarely taken branch, the compiler could loop over that code block instead of vectorizing, and a thread taking the branch would temporarily have 4x the compute resources at its disposal. In such a case, even if the compiler produces code with less than stellar ALU usage, I can't see how it could be worse than idling 3 of 4 SP ALUs.

Also, I wonder if the current "scalar" setups even have truly independent lanes - from what I understand the pixels in a quad need to exchange data to compute tex coord derivates. For that reason, I imagine that SP ALUs are probably already arranged in groups of 4. I'm guessing the SFU and load/texture request/store resources are alongside (since you need to feed them register values, and you don't want to send those very far for power reasons). All of which doesn't sound so very far removed from what's described in the Einstein presentation. The most interesting bits of Echelon to me are the things that are vaguely described: configurable cache hierarchy, active messages, L2 cache slices no longer attached to memory controllers, multiple coherency domains...

psurge · Feb 27, 2012

mczak said:
(*) Strictly speaking I guess dropping hotclock for being able to double simd width is not necessary, instead if you could dispatch instructions per hotclock that would work too. At this point pretty much everything in the SM would run at hot clock though...

My (100%) guess is that for Kepler they will in fact dispatch one 32 wide warp per clock per SM. The hierarchical scheduling scheme they've described in papers (which I speculate has been implemented) means that far fewer threads must be looked at per clock, which should allow for a simpler, faster, lower power scheduler.

iMacmatician · Feb 27, 2012

mczak said:
That doesn't make it a lot more convincing...

Yes, you're right, but I wasn't trying to convince, just to notify that (according to him) the source didn't actually come up with the ROP count since the post I was quoting mentioned the source.

NVIDIA Kepler speculation thread

psurge

trinibwoy

Meh

iMacmatician

psurge

trinibwoy

Meh

chavvdarrr

CarstenS

Moderator

snarfbot

Alexko

Gipsel

Obsolete1337

Alexko

Ailuros

Epsilon plus three

3dcgi

Arun

Unknown.

Gipsel

mczak

psurge

psurge

iMacmatician

Similar threads