NVIDIA Kepler speculation thread

A1xLLcqAgt0qc2RyMz0y · Feb 13, 2012

CarstenS said:
What exactly is it btw, that keeps GPUs from reaching frequencies that CPUs have been more or less comfortable at for years, i.e. 3 GHz-ish?

The same thing that prevented the 10ghz NetBurst from ever being produced the limit of Power and Heat.

http://en.wikipedia.org/wiki/NetBurst_(microarchitecture)

With this microarchitecture, Intel looked to attain clock speeds of 10 GHz, but because of rising clock speeds, Intel faced increasing problems with keeping power dissipation within acceptable limits.

TKK · Feb 13, 2012

A1xLLcqAgt0qc2RyMz0y said:
The same thing that prevented the 10ghz NetBurst from ever being produced the limit of Power and Heat.

http://en.wikipedia.org/wiki/NetBurst_%28microarchitecture%29

With this microarchitecture, Intel looked to attain clock speeds of 10 GHz, but because of rising clock speeds, Intel faced increasing problems with keeping power dissipation within acceptable limits.

I think what Carsten is getting at, why can a 32nm 315mm² Bulldozer be mass-produced at 3.6 GHz while a 28nm 365mm² Radeon 7970 only clocks at 925 MHz and yet consumes more power.

My guess would be it's mainly transistor density. Tahiti has 4.3 billion transistors, BD only has 1.2 (officially, at least). BD's clockspeed is nearly 4 times as high, while its transistor density is roughly 3.5 times lower.

3dilettante · Feb 13, 2012

TKK said:
I think what Carsten is getting at, why can a 32nm 315mm² Bulldozer be mass-produced at 3.6 GHz while a 28nm 365mm² Radeon 7970 only clocks at 925 MHz and yet consumes more power.

My guess would be it's mainly transistor density. Tahiti has 4.3 billion transistors, BD only has 1.2 (officially, at least). BD's clockspeed is nearly 4 times as high, while its transistor density is roughly 3.5 times lower.

Bulldozer isn't entirely mass produced at 3.6 GHz. There are a lot of SKUs that do not reach that clock speed. Tahiti has all of 2 standard SKUs on a slower process and with far less custom circuit design.

Bulldozer's target market makes things like IO, less-dense and complex logic, and better RAS more important.
It should also be noted that the 7970's TDP includes the entire board. Shaving 30 or so watts off the total, and a 7950's still higher, but not massively higher considering the CPU's RAM and associated logic are not included.

silent_guy · Feb 13, 2012

TKK said:
My guess would be it's mainly transistor density. Tahiti has 4.3 billion transistors, BD only has 1.2 (officially, at least). BD's clockspeed is nearly 4 times as high, while its transistor density is roughly 3.5 times lower.

You're switching cause and effect. Full custom design can e made much faster and has a much higher density. But high density doesn't automatically result in much faster speeds.

edit: you're actually arguing something else than what I thought, but I don't know what. You're talking about densities without using area, so it's not densities at all, but just absolute nr of transistors. That also doesn't have a first order impact on max clock speed, though max power would be a major second order one.

Alexko · Feb 13, 2012

silent_guy said:
You're switching cause and effect. Full custom design can e made much faster and has a much higher density. But high density doesn't automatically result in much faster speeds.

edit: you're actually arguing something else than what I thought, but I don't know what. You're talking about densities without using area, so it's not densities at all, but just absolute nr of transistors. That also doesn't have a first order impact on max clock speed, though max power would be a major second order one.

Tahiti is 365mm², Orochi is 315mm² if I recall correctly, so density ≈ number of trannies in this case.

CarstenS · Feb 13, 2012

TKK said:
I think what Carsten is getting at, why can a 32nm 315mm² Bulldozer be mass-produced at 3.6 GHz while a 28nm 365mm² Radeon 7970 only clocks at 925 MHz and yet consumes more power.

Yes and no.

I was mainly wondering, if there are maybe specific things to graphics computations (which I don't know, that's why I asked - it was an honest question) that prevent the GPUs from reaching speeds as high as CPUs'. I mean, even when power was not so much a concern, GPUs only ran at, what, 20-30 percent (at most) of CPUs. That was probably 1998 (CPUs ~400 MHz, GPUs ~100 MHz), 1999 then CPUs really ran away, reaching a GHz seemlingly easy and GPUs stayed below 200 MHz - more complex ones even below 140 MHz.

I do not want to derail this thread any further, sorry.

whitetiger · Feb 13, 2012

CarstenS said:
Yes and no. I was mainly wondering, if there are maybe specific things to graphics computations (which I don't know, that's why I asked - it was an honest question) that prevent the GPUs from reaching speeds as high as CPUs'. I mean, even when power was not so much a concern, GPUs only ran at, what, 20-30 percent (at most) of CPUs. That was probably 1998 (CPUs ~400 MHz, GPUs ~100 MHz), 1999 then CPUs really ran away, reaching a GHz seemlingly easy and GPUs stayed below 200 MHz - more complex ones even below 140 MHz.

I do not want to derail this thread any further, sorry.

It's also the design methodology
- GPUs are ASICs, made up of standard cells, or 'components'
- whereas CPUs have, up until now, been highly optimised, with close ties between the process technology, and the design.
- Intel is still very much doing this, whereas AMD is moving away from this model, and Bulldozer gives you feel for how things go when move to a more ASIC type of approach.
- AMD used to have a continuous improvement model, whereby improvements and refinements in the process technology were fed back into the design.
- this can't happen with a sub-contract manufacturer - things have to be done more at arms length.

With CPUs they spend alot more time optimising not only the layout, but the transistor dimensions for critical parts, to get speed where it's needed, or reduce power where its not.

With GPUs, the rate at which they have to produce new designs means that there really isn't the time to do this...

One way of showing this is the time taken from when a CPU is first demo'd until it's actual commercial availability - with Intel it's usually well over a year
- for a GPU it's a few months at best, or in NV's case about 6 weeks!

I think NV went down the route of using much more carefully laid out & optimised designs in order to get their 'hot-clocks'
- which were about 2x what AMD/ATI were achieving
- but you can also see the problems they had delivering products using these higher-clocking designs
- i.e. they took a lot longer to design, and get working ...

fellix · Feb 13, 2012

whitetiger said:
I think NV went down the route of using much more carefully laid out & optimised designs in order to get their 'hot-clocks'
- which were about 2x what AMD/ATI were achieving
- but you can also see the problems they had delivering products using these higher-clocking designs
- i.e. they took a lot longer to design, and get working ...

Similar to what Intel was doing with their line of NetBurst processors. The double-pumped ALU pipeline was 100% hand-crafted cell design down to a single transistor. Everything else was pretty much IC library automation, with some exceptions for the branch predictor, which was notoriously delicate and time-sensitive piece of logic, for obvious reasons.

whitetiger · Feb 14, 2012

fellix said:
Similar to what Intel was doing with their line of NetBurst processors. The double-pumped ALU pipeline was 100% hand-crafted cell design down to a single transistor. Everything else was pretty much IC library automation, with some exceptions for the branch predictor, which was notoriously delicate and time-sensitive piece of logic, for obvious reasons.

I saw a lecture from Stanford given by an architect on the original P5
- he felt that the super-pipelining they did for the original P5 had good engineering decisions behind it
- they went to a 20-stage pipeline, and doubled the clock frequency, which gave a 40% real increase in performance
- then marketing realised that they could really sell the chips based on these higher clock speeds because everybody loved higher clock speeds
- so they demanded even higher clock frequencies - which pushed good engineering too far, and resulted in the failed Netburst with it's massively power hungry 30-stage, pipeline ....!
- he left before the failed Netburst was finished ..

3dcgi · Feb 14, 2012

CarstenS said:
I do not want to derail this thread any further, sorry.

Too late. :smile:

CarstenS said:
So, general consensus here seems to be that it's got nothing to do with graphics related functions, but rather with power issues.

There have been multiple good answers and power matters, but it is not THE reason. Custom design comes into play as well, but a large part of it is GPUs simply don't try to hit high clock speeds so there are far more levels of logic in a pipe stage than in a CPU. I won't quote numbers, but it's amazing how much work you can do in a clock cycle with modern processes.

Unless you want to have a lot of clocks in a chip the rate is set by the lowest common denominator and any calculations, like addressing, that require feedback are more performant when done in a single cycle. GPUs have a lot of varying logic and scale with more units so it's easier to design a massively parallel system with a more modest clock rate than to spend a lot of effort pushing clocks. Easier = quicker time to market which is a good thing.

FWIW I don't think Nvidia's shaders use much if any custom design. At least if they do they're not very good at it and they employ a lot of smart people which reinforces the idea that it's not a hand placed layout.

tunafish · Feb 14, 2012

TKK said:
I think what Carsten is getting at, why can a 32nm 315mm² Bulldozer be mass-produced at 3.6 GHz while a 28nm 365mm² Radeon 7970 only clocks at 925 MHz and yet consumes more power.

This is a misunderstanding of terms.

Clock speeds of different architectures are not directly comparable. The clock speed of a GPU or a CPU is not the speed that individual transistors switch, but the speed where the longest critical path of transistors inside the chip can switch.

To put this really simply, if we both build different chips on the same process where in your chip the longest critical path is 10 FO4 (FO4 is a process-independent metric for transistor delay -- basically, FO4 is the time it takes for a single inverter that drives 4 copies of itself to switch.), and in my chip the longest critical path is 20 FO4, and the process allows individual transistors to switch at 20GHz (or, a FO4 takes 50ps), then your chip will run at 2GHz and mine will run at 1GHz.

The clock speed difference between GPUs and CPUs has almost nothing directly to do with the process, transistor densities, an all that jazz, and everything to do with the fact that they are designed for more complex critical paths and lower clock speeds.

Why are they designed that way? Because spending transistors to make things twice faster generally costs *a lot* more transistors and power than spending transistors to make two things at a time. High-end CPUs push this *way* past the knee of the curve -- in the past, they have repeatedly accepted design decisions that give 5% more clockspeed for 10% more transistors. Given how hard multi-core programming is, this makes sense. But when you are designing a device for an embarrassingly parallel task like rendering, this does not make sense.

hkultala · Feb 14, 2012

TKK said:
I think what Carsten is getting at, why can a 32nm 315mm² Bulldozer be mass-produced at 3.6 GHz while a 28nm 365mm² Radeon 7970 only clocks at 925 MHz and yet consumes more power.

My guess would be it's mainly transistor density. Tahiti has 4.3 billion transistors, BD only has 1.2 (officially, at least). BD's clockspeed is nearly 4 times as high, while its transistor density is roughly 3.5 times lower.

mostly wrong.

It's all about pipeline length.

Bulldozer has such pipeline length that there are much less transistors (and much less wire length) serially in one pipeline stage.

The following is somewhat oversimplified, but explains the principles:

ie. the transistors are capable of switching state in about 10 picoseconds ( 100 GHz) but there are maybe 25 of those transistors serially on each pipeline stage on bulldozer, meaning every pipeline stage takes at least 250 picoseconds, putting the clock speed to about 4 GHz.

In AMD GPUs, if the transistors are equally fast, but there are 100 transistors serially on each pipeline stage, then it means each pipeline stage takes at least 1 nanosecond time, putting the clock speed to about 1 GHz.

In Nvidia GPU, if transistors are equally fast, but there are 65 transistors serially on each pipeline stage(on the shaders/hot clock domain), then each pipeline stage takes at least 650 picoseconds, putting the clock speed to around 1540 MHz.

In reality wire lengths and delays caused by those might have more effect than the transistor delays, but the principles still are the same. And the GPU might be manufactured with a bit slower manufacturing process, it might mean that the GPU transistors take 12.5 picoseconds to change state and there are only like 80 of transistors them serially on ATI, 52 on nvidia.

Btw. your transistor count for bulldoze is way off. 1.2G is impossible number, correct is about 1.5G.

The reason for the transistor densities are that different transistors in different structures consume different amount of space.

In CPUs most space is consumed by "dedicated logic transistors" doing something complex, each transistor has to be positioned "for it's job".

Only the register files(very small part of chips) and caches in CPU chips are very tightly packet, and >80% of the transistor count comes from the caches, even though only about half of the die area comes from the caches

In GPU's most space is consumed by register files which are very regular structures and can be packed very tightly. Also the logic can be packed more tightly in GPU's because most of it is highly symmetric vector units.

But of course 28nm allows packing more transistors to same space than 32nm.

tunafish · Feb 14, 2012

heh, muropaketti to the rescue to correct semiconductor design misapprehensions.

dkanter · Feb 14, 2012

I think you guys are missing a big part of the issue. CPUs have to run fast to have low latency. Running fast requires that you use larger than minimum size transistors.

In contrast, for a GPU it always makes sense to use minimum size transistors and have as many shader copies as possible.

David

Ailuros · Feb 14, 2012

By the way since leaks will start to pile up slowly and I've had at least one case biting the bullett of the GK110/4096SP stuff it doesn't make much sense to not say that trinibwoy has damn good insticts

Arty · Feb 14, 2012

And who would that be? Theo's betting his money on 2304 ..

psurge · Feb 14, 2012

My prediction: dynamic warp splitting will show up in Kepler: http://www.google.com/patents/US20110219221

That and the hierarchical register file/scheduling scheme in Dally's paper.

Ailuros · Feb 14, 2012

Arty said:
And who would that be? Theo's betting his money on 2304 ..

Also wrong; here you go: http://forum.hardware.fr/hfr/Hardware/2D-3D/nvidia-geforce-kepler-sujet_891447_73.htm#t8216477

On which the GK104 TDP is as ridiculous as on every other fake that has circulated so far.

Just for the record's sake how many SPs did the original chiphell specification table state? Coincidentially 2304SPs with 6GPCs. Or even better why would you go on a HPC oriented core like GK110 for an uneven amount of SIMDs/SM? There was never ever any half way reasonable speculation about GK110 compared to GK104 and it's probably the reason why no one was able to think of something that makes a wee bit more sense.

Now it's time that the gentlemen that are creating tables and fake photoshopped slides to start thinking if there could be some common aspects between GK104 and GK110 as they were between GF114 and GF110. There's a good chance that arithmetic throughput isn't too much far apart on paper between the first two (just as the latter two) and texel fillrate actually ending up being higher on paper for the performance part.

whitetiger · Feb 14, 2012

Ailuros said:
Now it's time that the gentlemen that are creating tables and fake photoshopped slides to start thinking if there could be some common aspects between GK104 and GK110 as they were between GF114 and GF110. There's a good chance that arithmetic throughput isn't too much far apart on paper between the first two (just as the latter two) and texel fillrate actually ending up being higher on paper for the performance part.

I guess you are alluding to the GK110 being to the GF110 as the GK114 is to the GF114?
- meaning the GK110 is a 2048 SP chip with 64 SPs per SM, compared to the 96 SPs per SM of the GK114

So, that would fit with the die sizes of the GK110 being similar to the GF110....

Both chips therefore ending up with 2x SPs, and 25% more bandwidth, of their Fermi antecedent.

If they got rid of a few of the GF110 bottlenecks, this is still a good chip

trinibwoy · Feb 14, 2012

psurge said:
My prediction: dynamic warp splitting will show up in Kepler: http://www.google.com/patents/US20110219221

That and the hierarchical register file/scheduling scheme in Dally's paper.

Far simpler and cheaper than the dynamic warp formation proposed in another paper. It would be very cool but probably won't benefit games much. I can't imagine there are many cases in the average game where all warps are stalled for significant amounts of time.

General compute tasks would benefit though as demonstrated in this paper.

http://hps.ece.utexas.edu/pub/TR-HPS-2010-006.pdf

Edit: Here's the corresponding DWS paper with more detail on the approach and benefits.

http://www.cs.virginia.edu/~skadron/Papers/meng_simd_isca10.pdf

NVIDIA Kepler speculation thread

A1xLLcqAgt0qc2RyMz0y

TKK

3dilettante

silent_guy

Alexko

CarstenS

Moderator

whitetiger

fellix

whitetiger

3dcgi

tunafish

hkultala

tunafish

dkanter

Ailuros

Epsilon plus three

Arty

KEPLER

psurge

Ailuros

Epsilon plus three

whitetiger

trinibwoy

Meh

Similar threads