NVIDIA Kepler speculation thread

Discussion in 'Architecture and Products' started by Kaotik, Sep 21, 2010.

Tags:
  1. A1xLLcqAgt0qc2RyMz0y

    Veteran Regular

    Joined:
    Feb 6, 2010
    Messages:
    1,589
    Likes Received:
    1,490
    The same thing that prevented the 10ghz NetBurst from ever being produced the limit of Power and Heat.

    http://en.wikipedia.org/wiki/NetBurst_(microarchitecture)

    With this microarchitecture, Intel looked to attain clock speeds of 10 GHz, but because of rising clock speeds, Intel faced increasing problems with keeping power dissipation within acceptable limits.
     
  2. TKK

    TKK
    Newcomer

    Joined:
    Jan 12, 2010
    Messages:
    148
    Likes Received:
    0
    I think what Carsten is getting at, why can a 32nm 315mm² Bulldozer be mass-produced at 3.6 GHz while a 28nm 365mm² Radeon 7970 only clocks at 925 MHz and yet consumes more power.

    My guess would be it's mainly transistor density. Tahiti has 4.3 billion transistors, BD only has 1.2 (officially, at least). BD's clockspeed is nearly 4 times as high, while its transistor density is roughly 3.5 times lower.
     
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,797
    Location:
    Well within 3d
    Bulldozer isn't entirely mass produced at 3.6 GHz. There are a lot of SKUs that do not reach that clock speed. Tahiti has all of 2 standard SKUs on a slower process and with far less custom circuit design.

    Bulldozer's target market makes things like IO, less-dense and complex logic, and better RAS more important.
    It should also be noted that the 7970's TDP includes the entire board. Shaving 30 or so watts off the total, and a 7950's still higher, but not massively higher considering the CPU's RAM and associated logic are not included.
     
  4. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,380
    You're switching cause and effect. Full custom design can e made much faster and has a much higher density. But high density doesn't automatically result in much faster speeds.

    edit: you're actually arguing something else than what I thought, but I don't know what. You're talking about densities without using area, so it's not densities at all, but just absolute nr of transistors. That also doesn't have a first order impact on max clock speed, though max power would be a major second order one.
     
    #1704 silent_guy, Feb 13, 2012
    Last edited by a moderator: Feb 13, 2012
  5. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,532
    Likes Received:
    957
    Tahiti is 365mm², Orochi is 315mm² if I recall correctly, so density ≈ number of trannies in this case.
     
  6. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,656
    Likes Received:
    3,661
    Location:
    Germany
    Yes and no. :) I was mainly wondering, if there are maybe specific things to graphics computations (which I don't know, that's why I asked - it was an honest question) that prevent the GPUs from reaching speeds as high as CPUs'. I mean, even when power was not so much a concern, GPUs only ran at, what, 20-30 percent (at most) of CPUs. That was probably 1998 (CPUs ~400 MHz, GPUs ~100 MHz), 1999 then CPUs really ran away, reaching a GHz seemlingly easy and GPUs stayed below 200 MHz - more complex ones even below 140 MHz.

    I do not want to derail this thread any further, sorry.
     
  7. whitetiger

    Newcomer

    Joined:
    Feb 5, 2012
    Messages:
    57
    Likes Received:
    0
    It's also the design methodology
    - GPUs are ASICs, made up of standard cells, or 'components'
    - whereas CPUs have, up until now, been highly optimised, with close ties between the process technology, and the design.
    - Intel is still very much doing this, whereas AMD is moving away from this model, and Bulldozer gives you feel for how things go when move to a more ASIC type of approach.
    - AMD used to have a continuous improvement model, whereby improvements and refinements in the process technology were fed back into the design.
    - this can't happen with a sub-contract manufacturer - things have to be done more at arms length.

    With CPUs they spend alot more time optimising not only the layout, but the transistor dimensions for critical parts, to get speed where it's needed, or reduce power where its not.

    With GPUs, the rate at which they have to produce new designs means that there really isn't the time to do this...

    One way of showing this is the time taken from when a CPU is first demo'd until it's actual commercial availability - with Intel it's usually well over a year
    - for a GPU it's a few months at best, or in NV's case about 6 weeks!

    I think NV went down the route of using much more carefully laid out & optimised designs in order to get their 'hot-clocks'
    - which were about 2x what AMD/ATI were achieving
    - but you can also see the problems they had delivering products using these higher-clocking designs
    - i.e. they took a lot longer to design, and get working ...
     
    #1707 whitetiger, Feb 13, 2012
    Last edited by a moderator: Feb 13, 2012
  8. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,537
    Likes Received:
    496
    Location:
    Varna, Bulgaria
    Similar to what Intel was doing with their line of NetBurst processors. The double-pumped ALU pipeline was 100% hand-crafted cell design down to a single transistor. Everything else was pretty much IC library automation, with some exceptions for the branch predictor, which was notoriously delicate and time-sensitive piece of logic, for obvious reasons.
     
  9. whitetiger

    Newcomer

    Joined:
    Feb 5, 2012
    Messages:
    57
    Likes Received:
    0
    I saw a lecture from Stanford given by an architect on the original P5
    - he felt that the super-pipelining they did for the original P5 had good engineering decisions behind it
    - they went to a 20-stage pipeline, and doubled the clock frequency, which gave a 40% real increase in performance
    - then marketing realised that they could really sell the chips based on these higher clock speeds because everybody loved higher clock speeds
    - so they demanded even higher clock frequencies - which pushed good engineering too far, and resulted in the failed Netburst with it's massively power hungry 30-stage, pipeline ....!
    - he left before the failed Netburst was finished ..
     
  10. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    Too late. :smile:

    There have been multiple good answers and power matters, but it is not THE reason. Custom design comes into play as well, but a large part of it is GPUs simply don't try to hit high clock speeds so there are far more levels of logic in a pipe stage than in a CPU. I won't quote numbers, but it's amazing how much work you can do in a clock cycle with modern processes.

    Unless you want to have a lot of clocks in a chip the rate is set by the lowest common denominator and any calculations, like addressing, that require feedback are more performant when done in a single cycle. GPUs have a lot of varying logic and scale with more units so it's easier to design a massively parallel system with a more modest clock rate than to spend a lot of effort pushing clocks. Easier = quicker time to market which is a good thing.

    FWIW I don't think Nvidia's shaders use much if any custom design. At least if they do they're not very good at it and they employ a lot of smart people which reinforces the idea that it's not a hand placed layout.
     
  11. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    627
    Likes Received:
    414
    This is a misunderstanding of terms.

    Clock speeds of different architectures are not directly comparable. The clock speed of a GPU or a CPU is not the speed that individual transistors switch, but the speed where the longest critical path of transistors inside the chip can switch.

    To put this really simply, if we both build different chips on the same process where in your chip the longest critical path is 10 FO4 (FO4 is a process-independent metric for transistor delay -- basically, FO4 is the time it takes for a single inverter that drives 4 copies of itself to switch.), and in my chip the longest critical path is 20 FO4, and the process allows individual transistors to switch at 20GHz (or, a FO4 takes 50ps), then your chip will run at 2GHz and mine will run at 1GHz.

    The clock speed difference between GPUs and CPUs has almost nothing directly to do with the process, transistor densities, an all that jazz, and everything to do with the fact that they are designed for more complex critical paths and lower clock speeds.

    Why are they designed that way? Because spending transistors to make things twice faster generally costs *a lot* more transistors and power than spending transistors to make two things at a time. High-end CPUs push this *way* past the knee of the curve -- in the past, they have repeatedly accepted design decisions that give 5% more clockspeed for 10% more transistors. Given how hard multi-core programming is, this makes sense. But when you are designing a device for an embarrassingly parallel task like rendering, this does not make sense.
     
  12. hkultala

    Regular

    Joined:
    May 22, 2002
    Messages:
    296
    Likes Received:
    38
    Location:
    Herwood, Tampere, Finland
    mostly wrong.

    It's all about pipeline length.

    Bulldozer has such pipeline length that there are much less transistors (and much less wire length) serially in one pipeline stage.

    The following is somewhat oversimplified, but explains the principles:

    ie. the transistors are capable of switching state in about 10 picoseconds ( 100 GHz) but there are maybe 25 of those transistors serially on each pipeline stage on bulldozer, meaning every pipeline stage takes at least 250 picoseconds, putting the clock speed to about 4 GHz.

    In AMD GPUs, if the transistors are equally fast, but there are 100 transistors serially on each pipeline stage, then it means each pipeline stage takes at least 1 nanosecond time, putting the clock speed to about 1 GHz.

    In Nvidia GPU, if transistors are equally fast, but there are 65 transistors serially on each pipeline stage(on the shaders/hot clock domain), then each pipeline stage takes at least 650 picoseconds, putting the clock speed to around 1540 MHz.


    In reality wire lengths and delays caused by those might have more effect than the transistor delays, but the principles still are the same. And the GPU might be manufactured with a bit slower manufacturing process, it might mean that the GPU transistors take 12.5 picoseconds to change state and there are only like 80 of transistors them serially on ATI, 52 on nvidia.



    Btw. your transistor count for bulldoze is way off. 1.2G is impossible number, correct is about 1.5G.

    The reason for the transistor densities are that different transistors in different structures consume different amount of space.

    In CPUs most space is consumed by "dedicated logic transistors" doing something complex, each transistor has to be positioned "for it's job".

    Only the register files(very small part of chips) and caches in CPU chips are very tightly packet, and >80% of the transistor count comes from the caches, even though only about half of the die area comes from the caches

    In GPU's most space is consumed by register files which are very regular structures and can be packed very tightly. Also the logic can be packed more tightly in GPU's because most of it is highly symmetric vector units.



    But of course 28nm allows packing more transistors to same space than 32nm.
     
  13. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    627
    Likes Received:
    414
    heh, muropaketti to the rescue to correct semiconductor design misapprehensions.
     
  14. dkanter

    Regular

    Joined:
    Jan 19, 2008
    Messages:
    360
    Likes Received:
    20
    I think you guys are missing a big part of the issue. CPUs have to run fast to have low latency. Running fast requires that you use larger than minimum size transistors.

    In contrast, for a GPU it always makes sense to use minimum size transistors and have as many shader copies as possible.

    David
     
  15. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,474
    Likes Received:
    190
    Location:
    Chania
    By the way since leaks will start to pile up slowly and I've had at least one case biting the bullett of the GK110/4096SP stuff it doesn't make much sense to not say that trinibwoy has damn good insticts :wink:
     
  16. Arty

    Arty KEPLER
    Veteran

    Joined:
    Jun 16, 2005
    Messages:
    1,906
    Likes Received:
    55
    And who would that be? Theo's betting his money on 2304 ..
     
  17. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    953
    Likes Received:
    51
    Location:
    LA, California
  18. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,474
    Likes Received:
    190
    Location:
    Chania
    Also wrong; here you go: http://forum.hardware.fr/hfr/Hardware/2D-3D/nvidia-geforce-kepler-sujet_891447_73.htm#t8216477

    On which the GK104 TDP is as ridiculous as on every other fake that has circulated so far.

    Just for the record's sake how many SPs did the original chiphell specification table state? Coincidentially 2304SPs with 6GPCs. Or even better why would you go on a HPC oriented core like GK110 for an uneven amount of SIMDs/SM? There was never ever any half way reasonable speculation about GK110 compared to GK104 and it's probably the reason why no one was able to think of something that makes a wee bit more sense.

    Now it's time that the gentlemen that are creating tables and fake photoshopped slides to start thinking if there could be some common aspects between GK104 and GK110 as they were between GF114 and GF110. There's a good chance that arithmetic throughput isn't too much far apart on paper between the first two (just as the latter two) and texel fillrate actually ending up being higher on paper for the performance part.
     
  19. whitetiger

    Newcomer

    Joined:
    Feb 5, 2012
    Messages:
    57
    Likes Received:
    0
    I guess you are alluding to the GK110 being to the GF110 as the GK114 is to the GF114?
    - meaning the GK110 is a 2048 SP chip with 64 SPs per SM, compared to the 96 SPs per SM of the GK114

    So, that would fit with the die sizes of the GK110 being similar to the GF110....

    Both chips therefore ending up with 2x SPs, and 25% more bandwidth, of their Fermi antecedent.

    If they got rid of a few of the GF110 bottlenecks, this is still a good chip
    :grin:
     
  20. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    11,676
    Likes Received:
    2,594
    Location:
    New York
    Far simpler and cheaper than the dynamic warp formation proposed in another paper. It would be very cool but probably won't benefit games much. I can't imagine there are many cases in the average game where all warps are stalled for significant amounts of time.

    General compute tasks would benefit though as demonstrated in this paper.

    http://hps.ece.utexas.edu/pub/TR-HPS-2010-006.pdf

    Edit: Here's the corresponding DWS paper with more detail on the approach and benefits.

    http://www.cs.virginia.edu/~skadron/Papers/meng_simd_isca10.pdf
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...