Nvidia Pascal Announcement

Discussion in 'Architecture and Products' started by huebie, Apr 5, 2016.

Tags:
  1. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,634
    Likes Received:
    5,210
    There was no GT4e with Haswell or Broadwell.
    It's the 72 EU + EDRAM model that approaches the performance of a mobile GM107, not the 48 EU + EDRAM.
    If the mobile GP108's performance gets close to the mobile GM107, then it'll be close to the GT4e too.

    There is a Sky Lake 45W Core i5 with the GT4e too.
     
  2. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    647
    Likes Received:
    94
    I'm well aware of that..which is why I specifically mentioned Broadwell or Haswell gt3e. And neither of them were used in place of GK107/GM108 btw.

    You can hardly say that the Gt4e approaches the performance of a GM107. Its 3d mark Firestrike Graphics score is less than half of a GM107 - http://www.pcworld.com/article/3074...n-nuc-smashes-all-mini-pc-preconceptions.html

    I do expect the GP108 to be close to the GM107..which would put it ahead of a gt4e.
    Again I already mentioned that the low end GPUs are usually paired with mid range i5s and not the gt3e/gt4e variants. Either ways..can you name any laptop using such a chip?
     
  3. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,634
    Likes Received:
    5,210
    That score of a synthetic benchmark comparing a 45W APU to a 60W desktop GTX 750 Ti + 91W CPU is meaningless.
    If you want to compare the mobile Iris 580 Pro to a mobile GM107, you'll have to go to notebookcheck, scroll down to the game scores and compare with e.g. the 850M/950M.


    Plus, the GP108 wouldn't even go against Skylake's GT4e. There's a chance it would face Kaby Lake's iGPUs during most of its lifetime.
     
  4. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    265
    Likes Received:
    162
    And why would you not compare it to 860/960M?

    Anyway I followed your suggestion and I can still see 950M routinely being at least 50% faster in most places. 960M seems to be close to 100% faster and sometimes substantially more. Of course, game results in that page are a mess and it's difficult to find 950M and 960M in many of the lists... but the overall picture. GM107 mobile GPUs are definitely faster than Iris Pro 580.
     
    Erinyes likes this.
  5. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    864
    Likes Received:
    266
    CodeXL produces everything (min, max, add etc.) except v_mul_16 for Fiji, it unpacks and does v_mul_32. Polaris has v_mul_16 generated though.
     
  6. spworley

    Newcomer

    Joined:
    Apr 19, 2013
    Messages:
    146
    Likes Received:
    190
    I keep seeing this misconception from multiple posters that fast INT8 and INT16 support is new to Pascal. It's not. All Kepler, Maxwell, and Pascal parts have the key 4x rate INT8 MAD and 2x rate INT16 MAD, plus other functions like min/max and shift. These are commonly used in CUDA and are labeled "scalar video instructions".
    What's new and unique to GP102 and GP104 is the DP4A and DP2A and instructions, which are a 32 bit accumulate of 4 or 2 INT8 dot products.
    What's old to Kepler is a bunch of more complex native SIMD integer functions (mostly for used for implementing video codec encoders) which were replaced with multiple instruction emulation microcode for Maxwell and Pascal, mostly because they have better fixed function encoders/decoders.
     
    Ext3h and pharma like this.
  7. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    647
    Likes Received:
    94
    Once again you fail to see that I explicitly compared only the Graphics scores of both the parts. Also the clock of the mobile 960M is higher than the desktop 750ti FYI. But granted..the lower TDP might still result in slightly lower performance. Hardly significant though when the score is less than half.

    Anyway to humour you I did go to notebookcheck and they did a NUC review here - http://www.notebookcheck.net/Intel-Iris-Pro-Graphics-580.160664.0.html

    They got a Firestrike Graphics score of 1836 for Gt4e NUC vs 4304 for a 960M

    And in actual gaming benchmarks, for 960M vs GT4e (From your link):-

    Overwatch 1920x1080 ultra - 39.73 FPS vs 23.1 FPS
    ROTR 1920x1080 high - 28.06 FPS vs 12.2 FPS
    MGS V 1920x1080 ultra - 37.1 FPS vs 14.9 FPS
    Metro Last Light 1920x1080 ultra - 30.69 FPS vs 14.3 FPS
    Bioshock Infinite 1920x1080 ultra - 44.8 FPS vs 14.9 FPS
    We havent even seen GT4e in a laptop yet. Either ways..given the above numbers..I'm sure NV isnt worried.
     
  8. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,634
    Likes Received:
    5,210
    Okay, Skylake's GT4e isn't quite there yet. Kaby Lake might.
    However, 1080p scores greatly hurt the Iris 580. The difference at 720/768p is a lot smaller from what I've seen: 25-30%.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Which is where ASICs come in.

    There's a lot of binary in the brain :mrgreen:
     
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Instead of "raster" I think it would be better to think in terms of "work group and work item despatch and scheduling". Instead of ROPs you might think of "global atomics" and general memory operations. And instead of TMU-filters it's better to think of algorithms that are served by the texturing cache hierarchy/swizzling. All of these things are relevant to pure compute.

    Part of the reason for my question was: imagine that the priority for the chip was double precision. Is it possible to put more DP into the chip, regardless of SP and HP and stay within the power budget?
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Maybe this is the real reason?

    That's only if you think of SP as being like the actual HP implementation. Intel uses ganging. In other words, the SP is two real lanes and DP is two lanes working in concert.

    No doubt, power has been a tight constraint for quite a while now: but NVidia keeps telling us that computation is not the power hog, it's routing data into and out of the ALUs. Routing and area must interact. It seems likely to me that routing either to SP or DP ALUs and then routing results back hinders power-efficiency (larger overall candidate area spanned by the data).

    Having dedicated SP and DP ALUs allows one or the other to be turned off while the other is working. On the other hand, multipliers built from repeating blocks of functionality and used for both SP and DP can turn off the blocks that are only needed for DP while doing SP.

    GP102 is probably Maxwell-like in the quantity of DP it offers. Does GP102 have more SP ALU capability than GP100?

    Is GP102 power limited in its SP capability?

    Isn't that precisely what GP100 is?

    I would expect a modern design to switch off the paths that aren't required in SP mode. Intel's design (being multi-precision) is the obvious place where this should be the case. But does anyone know if that's what's happening?

    Absolutely. I talked about this earlier (NVidia doing the noble thing, Intel buying bums on seats.)

    Volta isn't that far away it seems, so that all sounds reasonable.

    But x86 or Phi coupled to on-package or on-die FPGA functionality shouldn't be more than 5 years away. And that, in my view, marks the end of GPUs in HPC of any kind.
     
  12. pixelio

    Newcomer

    Joined:
    Feb 17, 2014
    Messages:
    47
    Likes Received:
    75
    Location:
    Seattle, WA
    I've been hacking some compute kernels on an Intel Broadwell GT3e lately and it's a fascinating GPU.

    I'm still tracking down some seemingly odd GEN codegen for 64-bit load/stores to local memory but, otherwise, for my use case performance seems to be competitive with a similarly spec'd discrete Maxwell v1 GPU.

    FP16 support has not yet shown up in the Windows driver but in theory it should enable double-rate FMA throughput (16 FP16 FMAs/clock per EU x 48 EUs x 1.15 GHz = 1766 FP16 GOPS).

    I will have my FP16x2 support in this life or the next! :runaway:
     
    liolio, homerdog, spworley and 6 others like this.
  13. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Late here so cannot look back at reference info for it, but a good card to possibly look at from that perspective is the original Kepler GTX Titan as it had two modes for DP and in the 1/24 ratio this gave higher clock speed, while with its full DP support enabled of 1/3 this reduced the clock.
    I assume for power budget as you say but might be worth checking out the clock difference, although unfortunately we are talking about Kepler rather than Maxwell but gives some ideas.

    Cheers
     
    spworley and nnunn like this.
  14. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,018
    Likes Received:
    114
    Well I think that would sort of work - like ddr3 does today with 940m (I'm still quite amazed what gm108 gets out of this no-bandwidth solution). Fastest ddr3 used on these chips is 1000Mhz (often just 900Mhz) - so 50% more with ddr4-3000, plus the alleged 20% improvement due to better compression should help quite a lot (so, a gp108 chip would be similarly bandwidth limited as gm108, though that would also depend if it's just higher clocks or adding another cluster).
    That's not what I'd call "take gddr5 serious". I've seen exactly zero notebooks with a 940MX and featuring gddr5 memory - yes the option is there but it's optional. If nvidia is serious about this they need to give it a higher model number otherwise there's plenty of evidence (not just with this chip) noone is going to bother (not that things are any better in the red camp wrt ddr3/gddr5 variant naming). The fastest gm108 part to date is still the one in the surface book (albeit with just 1GB gddr5 memory).
    Yes, possible. clamshell configurations don't seem to be popular in the low end segment. I absolutely agree though it would make sense...
    I think you're probably right the graphics might be mostly the same.
     
    Lightman likes this.
  15. A1xLLcqAgt0qc2RyMz0y

    Veteran Regular

    Joined:
    Feb 6, 2010
    Messages:
    1,063
    Likes Received:
    390
    Be careful with the "in 5 years marks the end of GPUs in HPC of any kind" as you don't want to do a "charlie".

     
    Bob, homerdog, spworley and 5 others like this.
  16. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    647
    Likes Received:
    94
    I agree..the fact that it has less bandwidth than many SoCs today shows just how lacking in bandwidth it is. DDR4 3000 with the better compression would give 80% more bandwidth than DDR3 2000. While that would definitely be a lot better..it still falls short of what it needs..especially with the ~40-50% higher clocks expected this gen. I do expect one more SM for GP108 (384 to 512 CCs)..so the increase in b/w would barely keep up with the increased graphics resources. It definitely needs GDDR5 to show its full potential.
    Actually I have seen a few 940MX with GDDR5..but I totally agree..they need to stop making it optional and separate the DDR3/4 & GDDR5 variants with different model nos, say 1020M and 1030M.

    Eg- http://www.newegg.com/Product/Produ...-cables-_-na-_-na&Item=N82E16834315422&cm_sp=

    https://www.amazon.com/Acer-Aspire-...scsubtag=d9b0fc28548711e689b1cedd434cefcf0INT
     
  17. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,875
    Likes Received:
    2,183
    Location:
    Germany
    Problem with GT3e already in 14 nm is: Power is 40-50is watts for the GT alone under load. So mobile parts won't have a hard time beating it perf/watt.

    Interesting! Any reason why there's no 32-bit sort for the Quadro? Would it skew the diagram's scale?
     
    #1917 CarstenS, Jul 28, 2016
    Last edited: Jul 28, 2016
  18. constant

    Newcomer

    Joined:
    Feb 9, 2014
    Messages:
    22
    Likes Received:
    8
    Careful there, GPUs in HPC and GPUs in gaming go hand in hand, i.e. the same architectures have been reused over the years with little HW segmentation. Just like inte Xeons are able to reuse functionality from the consumer core i7 components.

    The point being is that both GPUs in HPC (tesla) and Intel Xeons are driven by the consumer market, they are both just riding piggyback on the gaming / consumer segments.

    This is why the Xeon Phi series has been predicted to fail in the long term, it's a very separate components from the consumer market. The same goes for the future Xeon + FPGA segment, it's completely directed towards HPC.

    As Chief researcher Bill Dally said in a session I went to, "GPUs have a day job in gaming, in it's spare time on the weekends it goes and plays in its cool rockband doing HPC computations".

    As long as Intel doesn't for example get the Xeon Phi into the gaming segment ( as was originally planned with Larrabee ), it's basically by far more likely to be dead than GPUs are in the HPC segment. Intel has a lot of cash and can keep it afloat for a long time, but even they cant spend billions on research year after year without getting a payback ( and no a couple of 100 K sales of the Xeon Phi is not going to cover it ).
     
  19. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,875
    Likes Received:
    2,183
    Location:
    Germany
    Those are the programmable parts, yes, even in DP mode it's not only the ALUs and PRFs working full time. But there's quite a bit of FF in those stages as well which won't be consuming much energy while the chip churns through DP warps. Hence i explicitly mentioned raster, not whatever-threaded command processor.

    That's a tough nut to crack. I'd say: Insufficient data as of yet.

    --

    Of course, but wasn't that what you were proposing?


    It surely is a delicated balance.

    AFAIR, no one yet has gotten an answer out of Nvidia about whether or not the DP units are actually inside some select SMs (all of them in GP100 for example) and are in fact just fatter multipliers and adders taking over, sharing the datapaths of two SP-units once it's a DP-Warp. So technically, they would fulfill what nvidia termed "separate units, off to the side" (which is their official and most detailed answer yet, AFAIR)


    Not with FP32 and FP64 in separate ALU blocks. What I meant here was what you talk about later: Multi-precision ALUs throughout the chip, sacricifing a bit of FP32 and power-in-FP32 for maximum FP64. It would also make the more sense, the closer delivery date for the government funded exascale architecture looms.

    Obviously, but it's only clock- not power-gating I would guess. Power gating inside each multiplier (and probably adder) seems quite (rather prohibitevly expensive in terms of transistor budget.

    That may be the case - if GPUs do not evolve a bit as well in the meantime. I don't know though were exactly FPGAs sit on the 3d-curve of throughput, power and configurability. On any two of them, they are pretty strong, but does that apply for the third dimension as well?
     
    #1919 CarstenS, Jul 28, 2016
    Last edited: Jul 28, 2016
  20. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,634
    Likes Received:
    5,210
    Maybe not for long.
    The creation of GP102 + GP100 may be setting a precedent for that differentiation.

    At least on the nvidia side.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...