NVIDIA Kepler speculation thread

Gipsel · Mar 17, 2012

The slide with the "adaptive VSync" appears a bit strange. What they describe is basically vsync'd triple buffering. Otherwise tearing would appear also below the refresh rate.

Gipsel · Mar 17, 2012

fellix said:
I wonder why they left the LDS/L1 combo size inchanged?

No GPGPU focus for GK104?

rpg.314 · Mar 17, 2012

fellix said:
Cleaned for thou geekish pleasure :

The command processor in the middle is strangely smeared. Looks like NV doesn't want anyone to peek in the gigathread factory.

The bar on the right edge is strange. Can't be ROPs or mem controller as the pads are on the other edges and there's no point in isolating them here. The layout is fermi'ish so it has to be something new. PhysX hw.

fellix · Mar 17, 2012

rpg.314 said:
The bar on the right edge is strange. Can't be ROPs or mem controller as the pads are on the other edges and there's no point in isolating them here. The layout is fermi'ish so it has to be something new. PhysX hw.

Fermi packed everything, not part of the SMs, right into the middle section of the die. That was a major reason for the dense wire clutering, that plagued the Fermi design. Now, in Kepler we see a pretty radical re-distribution of all the blocks away from each other. Setup and thread dispatching, memory controllers & ROPs and misc. I/O are all plased apart, each in its own area.

rpg.314 · Mar 17, 2012

Gipsel said:
PS:
If they didn't have a similar mistake in that slides as during the Fermi presentation, the total register space is the same as with GF100/GF110 (2 MB), so really tiny compared to Tahiti (8 MB). I have a hard time believing that number, considering the similarity of the ALU count of GK104 and Tahiti. I would expect double the value given in that slide (4 MB), i.e. 512 kB per SMX or 128kB per Scheduler.
But also the local memory/L1 is quite small (still 64kB) considering how many threads/workgroups on one SMX have to share it.

It seems really unbalanced in terms of compute/(cache+reg file) vs GF104. There is 4x more compute per core and only 2x more reg file and same shared mem. I guess they must be relying on something else to compensate.

Devildoll · Mar 17, 2012

Excuse me if this has already been discussed.
but that hkepc website uploaded a fancy youtube video.

and around 30 seconds they show a shot of the gpu, and some specs, including "22nm"

HOLY SHIT

of course I'm not being serious about thinking its true.. jeez

psurge · Mar 17, 2012

Gipsel said:
That is an 8 way scheduler (or dual 4 way) . My last paragraph with the "quarter warps" was describing exactly this.It would basically result in a scheduling with a granularity of just 8 work items.

I think it's extremely similar to what you were proposing. Think of it this way... your scheme: 3 groups of 2 ALUs each execute an instruction over a quad every other cycle. My scheme: one block of 4 ALUs executes an instruction over a pixel quad every cycle, one block of 2 ALUs executes a quad every other cycle. I'm still completely in the dark as to why the scheduling for what I'm thinking is so much more complex (or what you are saying about granularity)

. My scheme would result in reduced latency for executing an instruction over a warp in some cases (not sure whether that matters much). But it also seems like it makes RF access more complex, which might be a bigger problem.

All in all, I do think you're more likely to be right than I am

.

silent_guy · Mar 17, 2012

jimbo75 said:
That's pretty much exactly as I see it as well. Don't get me wrong - if it is optional and it works then I think it's fair enough for Nvidia to be exploring these options. But yeah...turbo and "lower fps side screens" makes me wonder a lot about how it's gonna make them look better in benchmarks.

Interesting point on the mouse, I guess they'd need the cursor to be decoupled from fps in most games for that to work ok.

I thought most games and desktop are using a HW cursor, that is: rendered independently from the main frame buffer. So you could still move the cursor at full frame rate while rendering the frame at half.

This may not work in all games where cursor movement (as opposed to clicking) is directly tied to the frame buffer but that's what profiles are for...

In the case of Civ that SB mentioned you could tie the full speed render to the screen where your mouse is located?

Shouldn't lower frame rate be fine for a racing game? Is it really more distracting to have half the rate on the side than lower frame rates across the board?

Probably worthy of healthy experimentation.

psurge · Mar 17, 2012

@rpg14 - yeah on the face of it I agree. Regarding the registers only... off the top of my head, the only thing that comes to mind is that if that register file caching mechanism Dally et al. wrote about isn't really a cache, but is compiler visible (which I believe is one setup they explored in the paper), then perhaps it can be used to reduce the number of registers required in the main RF for typical kernels.

trinibwoy · Mar 17, 2012

So it looks like Kepler is still very much Fermi just with things laid out a little differently and a DDR speed increase. The unfixable wasn't all that broken after all.

rpg.314 · Mar 17, 2012

psurge said:
@rpg14 - yeah on the face of it I agree. Regarding the registers only... off the top of my head, the only thing that comes to mind is that if that register file caching mechanism Dally et al. wrote about isn't really a cache, but is compiler visible (which I believe is one setup they explored in the paper), then perhaps it can be used to reduce the number of registers required in the main RF for typical kernels.

RF cache doesn't reduce the number of registers needed. It just changes their access pattern. It's a cache after all.

gongo · Mar 17, 2012

seems strange...kepler 680gtx looks like something amd would do...
so the hotclocks is powertune in reverse..?

it will be interesting to see 7970 1ghz edition going h2h with 680gtx...
sucks to us the price war aint...happening...
boring times for dx11.1 gpus

Silent_Buddha · Mar 17, 2012

silent_guy said:
Shouldn't lower frame rate be fine for a racing game? Is it really more distracting to have half the rate on the side than lower frame rates across the board?

Probably worthy of healthy experimentation.

Doubtful. It's similar to some games that have had shadow or reflection updates at half the speed of full screen rendering. It's already fairly noticeable and distracting. Now take that and blow it up to where it's 2/3s of everything being rendered.

As Jawed mentioned it would be a bit of a Jello effect where every other frame, your side monitors will snap into sync with your main monitor. Much much worse if your eyes happen to glance at the boundary between monitors. In this situation a racing game would be the absolute worst example of the effect.

And god forbid if you drop down to 30 FPS and the side monitors were suddenly chugging along at 15 FPS. It's not like there'd be the option to have the side monitors suddenly match the primary monitor in framerate as supposedly they'd be doing this to try to maintain a high framerate on the primary monitor.

Regards,
SB

A1xLLcqAgt0qc2RyMz0y · Mar 17, 2012

ECH said:
I can only think of some sort of game profiling for 3Dmark11.

I seriously doubt board power is determined by profiling.

For it to switch so quickly it really must be hardware based.

trinibwoy said:
So it looks like Kepler is still very much Fermi just with things laid out a little differently and a DDR speed increase. The unfixable wasn't all that broken after all.

It was only broken in Charlies and his disciples minds.

For those wishing for a 7970 1ghz (1.2 ghz) edition to compete against the GTX680 I really don't see the purpose.

-----------------

Lets say AMD produces a factory overclocked 1.2 ghz 7970 to outperform the GTX680.

Wouldn't the TPD of 250 watts have to be raised?
If so to what a number?
If its 300 watts isn't that dual 8 pin power then.

Now what about pricing?
It would have to be higher than the current $549 and that means even lower number of units sold.

And since AMD owners claim they could over-clock the 7970 to that same 1.2 ghz why would they ever buy the factory over clocked 1.2 ghz version.

And what about the thermals that a 1.2 ghz card would produce. That would mean more/bigger/faster fans and lots of noise.

And after all this all Nvidia has to do is either release a GTX685 with higher clocks or release the upcoming GK110.

So all in all I do not see AMD releasing a factory over clocked 7970.

rpg.314 · Mar 17, 2012

A1xLLcqAgt0qc2RyMz0y said:
For those wishing for a 7970 1ghz (1.2 ghz) edition to compete against the GTX680 I really don't see the purpose.

Lets say AMD produces a factory overclocked 1.2 ghz 7970 to outperform the GTX680.

Wouldn't the TPD of 250 watts have to be raised?
If so to what a number?
If its 300 watts isn't that dual 8 pin power then.

Now what about pricing?
It would have to be higher than the current $549 and that means even lower number of units sold.

And since AMD owners claim they could over-clock the 7970 to that same 1.2 ghz why would they ever buy the factory over clocked 1.2 ghz version.

And what about the thermals that a 1.2 ghz card would produce. That would mean more/bigger/faster fans and lots of noise.

Binning?

kalelovil · Mar 17, 2012

A1xLLcqAgt0qc2RyMz0y said:
I seriously doubt board power is determined by profiling.

For it to switch so quickly it really must be hardware based.

It was only broken in Charlies and his disciples minds.

For those wishing for a 7970 1ghz (1.2 ghz) edition to compete against the GTX680 I really don't see the purpose.

-----------------

Lets say AMD produces a factory overclocked 1.2 ghz 7970 to outperform the GTX680.

Wouldn't the TPD of 250 watts have to be raised?
If so to what a number?
If its 300 watts isn't that dual 8 pin power then.

Now what about pricing?
It would have to be higher than the current $549 and that means even lower number of units sold.

And since AMD owners claim they could over-clock the 7970 to that same 1.2 ghz why would they ever buy the factory over clocked 1.2 ghz version.

And what about the thermals that a 1.2 ghz card would produce. That would mean more/bigger/faster fans and lots of noise.

And after all this all Nvidia has to do is either release a GTX685 with higher clocks or release the upcoming GK110.

So all in all I do not see AMD releasing a factory over clocked 7970.

Lowering the voltage by 5-7% while raising the clock to ~1075MHz should be possible using binning. Power usage should be about the same or only slightly higher.
Call it the Radeon HD7970 GHz+ Edition.

I don't see why it would have to be higher priced. Its the same die after all, and AMD has a lot of experience with binning chips for different SKUs.
Based on reviews and experiences of the HD7970, most or many of the chips currently being sold in HD7970s have very conservatively clocks and voltages set anyway.

silent_guy · Mar 17, 2012

Silent_Buddha said:
And god forbid if you drop down to 30 FPS and the side monitors were suddenly chugging along at 15 FPS. It's not like there'd be the option to have the side monitors suddenly match the primary monitor in framerate as supposedly they'd be doing this to try to maintain a high framerate on the primary monitor.

I'm seeing this as orthogonal to vsync: left and right at 1/2 speed by frame doubling. I know nothing about how our brain would deal with this and I take your word for it that it would probably be terrible. But I'd still like to see an actual realization of the idea to make sure. Sounds like a nice student project for a Unreal engine hack.

silent_guy · Mar 17, 2012

kalelovil said:
... should be possible using binning. ...

I don't see why it would have to be higher priced.

In this context, 'binning' is the opposite from 'not a higher price'. Yes, it's a die with the same logic and area, but binning is by its very definition excluding inadequate dies from the total pool. How could that not lead to price differences? (Yes, I leave the door open of keeping the top price the same and lowering the price of the rejects, but that means that AMD would make less money.)

3dilettante · Mar 17, 2012

One scenario that comes to mind might be fast motion on high-contrast images transitioning between screens, or possibly very rapid strobe flashes (for some kind of rave-themed shooter, I dunno).

The worst case might be a very rapid flash where the brightest frame falls on a displayed frame in the center monitor, but falls between the frames rendered on the side displays. Even if the flash is too fast to be noted, the monitors would appear to be off(as in mismatched) in their lighting.

A1xLLcqAgt0qc2RyMz0y · Mar 17, 2012

kalelovil said:
Lowering the voltage by 5-7% while raising the clock to ~1075MHz should be possible using binning.

Please tell exactly how many of these magic 7970 dies can be produced vs the standard 7970 dies for each wafer?

If a significant number of these magic dies could have been produced then the 7970 would have been clocked higher than 975 mhz.

The 975 mhz was chosen for a balance of power consumed, thermal power and number of good dies.

Increase frequency and power/temps go up and number of good dies go down.

And again if AMD goes down this road so can Nvidia. Quid Pro Quo.

NVIDIA Kepler speculation thread

Gipsel

Gipsel

rpg.314

fellix

rpg.314

Devildoll

psurge

silent_guy

psurge

trinibwoy

Meh

rpg.314

gongo

Silent_Buddha

A1xLLcqAgt0qc2RyMz0y

rpg.314

kalelovil

silent_guy

silent_guy

3dilettante

A1xLLcqAgt0qc2RyMz0y

Similar threads