Nvidia Pascal Announcement

Ok so how many laptops with Broadwell or even Haswell GT3e then?
There was no GT4e with Haswell or Broadwell.
It's the 72 EU + EDRAM model that approaches the performance of a mobile GM107, not the 48 EU + EDRAM.
If the mobile GP108's performance gets close to the mobile GM107, then it'll be close to the GT4e too.

These "bottom of the barrel" GPUs are usually paired with mid range i5 processors and not the much more expensive gt3e/gt4e variants
There is a Sky Lake 45W Core i5 with the GT4e too.
 
There was no GT4e with Haswell or Broadwell.
It's the 72 EU + EDRAM model that approaches the performance of a mobile GM107, not the 48 EU + EDRAM.
If the mobile GP108's performance gets close to the mobile GM107, then it'll be close to the GT4e too.

I'm well aware of that..which is why I specifically mentioned Broadwell or Haswell gt3e. And neither of them were used in place of GK107/GM108 btw.

You can hardly say that the Gt4e approaches the performance of a GM107. Its 3d mark Firestrike Graphics score is less than half of a GM107 - http://www.pcworld.com/article/3074...n-nuc-smashes-all-mini-pc-preconceptions.html

I do expect the GP108 to be close to the GM107..which would put it ahead of a gt4e.
There is a Sky Lake 45W Core i5 with the GT4e too.

Again I already mentioned that the low end GPUs are usually paired with mid range i5s and not the gt3e/gt4e variants. Either ways..can you name any laptop using such a chip?
 
You can hardly say that the Gt4e approaches the performance of a GM107. Its 3d mark Firestrike Graphics score is less than half of a GM107 - http://www.pcworld.com/article/3074...n-nuc-smashes-all-mini-pc-preconceptions.html

That score of a synthetic benchmark comparing a 45W APU to a 60W desktop GTX 750 Ti + 91W CPU is meaningless.
If you want to compare the mobile Iris 580 Pro to a mobile GM107, you'll have to go to notebookcheck, scroll down to the game scores and compare with e.g. the 850M/950M.


Plus, the GP108 wouldn't even go against Skylake's GT4e. There's a chance it would face Kaby Lake's iGPUs during most of its lifetime.
 
That score of a synthetic benchmark comparing a 45W APU to a 60W desktop GTX 750 Ti + 91W CPU is meaningless.
If you want to compare the mobile Iris 580 Pro to a mobile GM107, you'll have to go to notebookcheck, scroll down to the game scores and compare with e.g. the 850M/950M.


Plus, the GP108 wouldn't even go against Skylake's GT4e. There's a chance it would face Kaby Lake's iGPUs during most of its lifetime.

And why would you not compare it to 860/960M?

Anyway I followed your suggestion and I can still see 950M routinely being at least 50% faster in most places. 960M seems to be close to 100% faster and sometimes substantially more. Of course, game results in that page are a mess and it's difficult to find 950M and 960M in many of the lists... but the overall picture. GM107 mobile GPUs are definitely faster than Iris Pro 580.
 
AMD's entire desktop lineup already supports FP16/INT16 (Tonga, Fiji, Polaris10/11). I am sure Nvidia follows the suit when there are real performance gains to be seen in most of the games.

CodeXL produces everything (min, max, add etc.) except v_mul_16 for Fiji, it unpacks and does v_mul_32. Polaris has v_mul_16 generated though.
 
AMD's entire desktop lineup already supports FP16/INT16 (Tonga, Fiji, Polaris10/11). I am sure Nvidia follows the suit when there are real performance gains to be seen in most of the games.
I keep seeing this misconception from multiple posters that fast INT8 and INT16 support is new to Pascal. It's not. All Kepler, Maxwell, and Pascal parts have the key 4x rate INT8 MAD and 2x rate INT16 MAD, plus other functions like min/max and shift. These are commonly used in CUDA and are labeled "scalar video instructions".
What's new and unique to GP102 and GP104 is the DP4A and DP2A and instructions, which are a 32 bit accumulate of 4 or 2 INT8 dot products.
What's old to Kepler is a bunch of more complex native SIMD integer functions (mostly for used for implementing video codec encoders) which were replaced with multiple instruction emulation microcode for Maxwell and Pascal, mostly because they have better fixed function encoders/decoders.
 
That score of a synthetic benchmark comparing a 45W APU to a 60W desktop GTX 750 Ti + 91W CPU is meaningless.
If you want to compare the mobile Iris 580 Pro to a mobile GM107, you'll have to go to notebookcheck, scroll down to the game scores and compare with e.g. the 850M/950M.

Once again you fail to see that I explicitly compared only the Graphics scores of both the parts. Also the clock of the mobile 960M is higher than the desktop 750ti FYI. But granted..the lower TDP might still result in slightly lower performance. Hardly significant though when the score is less than half.

Anyway to humour you I did go to notebookcheck and they did a NUC review here - http://www.notebookcheck.net/Intel-Iris-Pro-Graphics-580.160664.0.html

They got a Firestrike Graphics score of 1836 for Gt4e NUC vs 4304 for a 960M

And in actual gaming benchmarks, for 960M vs GT4e (From your link):-

Overwatch 1920x1080 ultra - 39.73 FPS vs 23.1 FPS
ROTR 1920x1080 high - 28.06 FPS vs 12.2 FPS
MGS V 1920x1080 ultra - 37.1 FPS vs 14.9 FPS
Metro Last Light 1920x1080 ultra - 30.69 FPS vs 14.3 FPS
Bioshock Infinite 1920x1080 ultra - 44.8 FPS vs 14.9 FPS
Plus, the GP108 wouldn't even go against Skylake's GT4e. There's a chance it would face Kaby Lake's iGPUs during most of its lifetime.

We havent even seen GT4e in a laptop yet. Either ways..given the above numbers..I'm sure NV isnt worried.
 
Okay, Skylake's GT4e isn't quite there yet. Kaby Lake might.
However, 1080p scores greatly hurt the Iris 580. The difference at 720/768p is a lot smaller from what I've seen: 25-30%.
 
FWIW, I think that including Raster, ROPs and TMU-filters, which are more likely used in conjunction with SP rather than DP (I guess here mostly the data-patch from the TMUs fetchers comes into play), the highest power would be seen in SP loads.
Instead of "raster" I think it would be better to think in terms of "work group and work item despatch and scheduling". Instead of ROPs you might think of "global atomics" and general memory operations. And instead of TMU-filters it's better to think of algorithms that are served by the texturing cache hierarchy/swizzling. All of these things are relevant to pure compute.

But it'd be hard to test for two reasons: I know of no DP-workload that also uses most of the fixed function stuff. Second and more important: It's highly likely that even in SP with boost, the card will run into it's power limit. If it does as well for DP workloads, we cannot be sure which one would cause higher power draw if unthrottled.
Part of the reason for my question was: imagine that the priority for the chip was double precision. Is it possible to put more DP into the chip, regardless of SP and HP and stay within the power budget?
 
[...]I think Nvidia is playing most of their bets rather safe.
Maybe this is the real reason?

Without major stunts (actually I cannot think of anything that would work) you would loose half your potential FP32 throughput as soon as you cannot find paired instructions anymore in addition to the more complex instruction routing
That's only if you think of SP as being like the actual HP implementation. Intel uses ganging. In other words, the SP is two real lanes and DP is two lanes working in concert.

Of course I could be wrong with a high likelihood, but for a couple of generations now, power seems to be the main concern, not area anymore.
No doubt, power has been a tight constraint for quite a while now: but NVidia keeps telling us that computation is not the power hog, it's routing data into and out of the ALUs. Routing and area must interact. It seems likely to me that routing either to SP or DP ALUs and then routing results back hinders power-efficiency (larger overall candidate area spanned by the data).

Having dedicated SP and DP ALUs allows one or the other to be turned off while the other is working. On the other hand, multipliers built from repeating blocks of functionality and used for both SP and DP can turn off the blocks that are only needed for DP while doing SP.

GP102 is probably Maxwell-like in the quantity of DP it offers. Does GP102 have more SP ALU capability than GP100?

Is GP102 power limited in its SP capability?

So you would have to have a whole line of your GPUs totally dedicated to HPC,
Isn't that precisely what GP100 is?

or your other chips (think about power) would have to carry the more complex mulitpliers (and adders and muxes) in their guts as well. Correct me if I'm wrong, but even 53x53 MULs would be ok for DP, right? For iterating over the SP-ALUs, you'd need 27x27, which is a ~26,5% increase over the 24-bit MULs. I am not sure, if you can effectively mask the addtional bits out so they do not use any energy anymore.
I would expect a modern design to switch off the paths that aren't required in SP mode. Intel's design (being multi-precision) is the obvious place where this should be the case. But does anyone know if that's what's happening?

Intel already has these AVX-based ALUs and plans on using them in their regular Xeon processors as well. They need their code compatibility which has been touted as (one of) the big advantages of going Intel from day one.
Absolutely. I talked about this earlier (NVidia doing the noble thing, Intel buying bums on seats.)

My feeling is, that maybe even with Volta, we might see a completely separated lineup for HPC (FP64+FP32-focus), Deep-Learning (FP16+INT8-focus and other uses such as gaming (FP32 focus plus whatever you can cram in for free, INT8?).
Volta isn't that far away it seems, so that all sounds reasonable.

But x86 or Phi coupled to on-package or on-die FPGA functionality shouldn't be more than 5 years away. And that, in my view, marks the end of GPUs in HPC of any kind.
 
Okay, Skylake's GT4e isn't quite there yet. Kaby Lake might.

I've been hacking some compute kernels on an Intel Broadwell GT3e lately and it's a fascinating GPU.

I'm still tracking down some seemingly odd GEN codegen for 64-bit load/stores to local memory but, otherwise, for my use case performance seems to be competitive with a similarly spec'd discrete Maxwell v1 GPU.

FP16 support has not yet shown up in the Windows driver but in theory it should enable double-rate FMA throughput (16 FP16 FMAs/clock per EU x 48 EUs x 1.15 GHz = 1766 FP16 GOPS).

I will have my FP16x2 support in this life or the next! :runaway:
 
Part of the reason for my question was: imagine that the priority for the chip was double precision. Is it possible to put more DP into the chip, regardless of SP and HP and stay within the power budget?
Late here so cannot look back at reference info for it, but a good card to possibly look at from that perspective is the original Kepler GTX Titan as it had two modes for DP and in the 1/24 ratio this gave higher clock speed, while with its full DP support enabled of 1/3 this reduced the clock.
I assume for power budget as you say but might be worth checking out the clock difference, although unfortunately we are talking about Kepler rather than Maxwell but gives some ideas.

Cheers
 
Yep..64 bit ddr3 would not cut it anymore. Even Higher clocked DDR4 (Say 3000 mhz) would not be enough IMHO.
Well I think that would sort of work - like ddr3 does today with 940m (I'm still quite amazed what gm108 gets out of this no-bandwidth solution). Fastest ddr3 used on these chips is 1000Mhz (often just 900Mhz) - so 50% more with ddr4-3000, plus the alleged 20% improvement due to better compression should help quite a lot (so, a gp108 chip would be similarly bandwidth limited as gm108, though that would also depend if it's just higher clocks or adding another cluster).
They'd have to move to GDDR5 and there's evidence they are moving in this direction as we saw with the 940MX.
That's not what I'd call "take gddr5 serious". I've seen exactly zero notebooks with a 940MX and featuring gddr5 memory - yes the option is there but it's optional. If nvidia is serious about this they need to give it a higher model number otherwise there's plenty of evidence (not just with this chip) noone is going to bother (not that things are any better in the red camp wrt ddr3/gddr5 variant naming). The fastest gm108 part to date is still the one in the surface book (albeit with just 1GB gddr5 memory).
4 GB with clamshell should be possible.
Yes, possible. clamshell configurations don't seem to be popular in the low end segment. I absolutely agree though it would make sense...
FWIW I dont expect Kaby Lake to be much of an improvement. We'll find out soon enough though..Kaby Lake is shipping to OEMs already.
I think you're probably right the graphics might be mostly the same.
 
But x86 or Phi coupled to on-package or on-die FPGA functionality shouldn't be more than 5 years away. And that, in my view, marks the end of GPUs in HPC of any kind.

Be careful with the "in 5 years marks the end of GPUs in HPC of any kind" as you don't want to do a "charlie".

May 11, 2011

I would ask the question in a more general sense. Will GPUs exist in 5 years. The answer there would be no.

The low end dies this year, or at least starts to do a PeeWee Herman at the end of Buffy the Vampire Slayer (the movie, not the show). There goes the volume. 2012 sees the same happening for the high end. The middle isn't enough to sustain NV.

They have 2 years to make compute and widgets profitable. Good luck there guys.

-Charlie

http://www.semiaccurate.com/forums/showpost.php?p=48497&postcount=10
 
Well I think that would sort of work - like ddr3 does today with 940m (I'm still quite amazed what gm108 gets out of this no-bandwidth solution). Fastest ddr3 used on these chips is 1000Mhz (often just 900Mhz) - so 50% more with ddr4-3000, plus the alleged 20% improvement due to better compression should help quite a lot (so, a gp108 chip would be similarly bandwidth limited as gm108, though that would also depend if it's just higher clocks or adding another cluster).

I agree..the fact that it has less bandwidth than many SoCs today shows just how lacking in bandwidth it is. DDR4 3000 with the better compression would give 80% more bandwidth than DDR3 2000. While that would definitely be a lot better..it still falls short of what it needs..especially with the ~40-50% higher clocks expected this gen. I do expect one more SM for GP108 (384 to 512 CCs)..so the increase in b/w would barely keep up with the increased graphics resources. It definitely needs GDDR5 to show its full potential.
That's not what I'd call "take gddr5 serious". I've seen exactly zero notebooks with a 940MX and featuring gddr5 memory - yes the option is there but it's optional. If nvidia is serious about this they need to give it a higher model number otherwise there's plenty of evidence (not just with this chip) noone is going to bother (not that things are any better in the red camp wrt ddr3/gddr5 variant naming). The fastest gm108 part to date is still the one in the surface book (albeit with just 1GB gddr5 memory).
Actually I have seen a few 940MX with GDDR5..but I totally agree..they need to stop making it optional and separate the DDR3/4 & GDDR5 variants with different model nos, say 1020M and 1030M.

Eg- http://www.newegg.com/Product/Produ...-cables-_-na-_-na&Item=N82E16834315422&cm_sp=

https://www.amazon.com/Acer-Aspire-...scsubtag=d9b0fc28548711e689b1cedd434cefcf0INT
 
There was no GT4e with Haswell or Broadwell.
It's the 72 EU + EDRAM model that approaches the performance of a mobile GM107, not the 48 EU + EDRAM.
If the mobile GP108's performance gets close to the mobile GM107, then it'll be close to the GT4e too.


There is a Sky Lake 45W Core i5 with the GT4e too.
Problem with GT3e already in 14 nm is: Power is 40-50is watts for the GT alone under load. So mobile parts won't have a hard time beating it perf/watt.

I've been hacking some compute kernels on an Intel Broadwell GT3e lately and it's a fascinating GPU.

I'm still tracking down some seemingly odd GEN codegen for 64-bit load/stores to local memory but, otherwise, for my use case performance seems to be competitive with a similarly spec'd discrete Maxwell v1 GPU.

FP16 support has not yet shown up in the Windows driver but in theory it should enable double-rate FMA throughput (16 FP16 FMAs/clock per EU x 48 EUs x 1.15 GHz = 1766 FP16 GOPS).

I will have my FP16x2 support in this life or the next! :runaway:
Interesting! Any reason why there's no 32-bit sort for the Quadro? Would it skew the diagram's scale?
 
Last edited:
Maybe this is the real reason?
But x86 or Phi coupled to on-package or on-die FPGA functionality shouldn't be more than 5 years away. And that, in my view, marks the end of GPUs in HPC of any kind.

Careful there, GPUs in HPC and GPUs in gaming go hand in hand, i.e. the same architectures have been reused over the years with little HW segmentation. Just like inte Xeons are able to reuse functionality from the consumer core i7 components.

The point being is that both GPUs in HPC (tesla) and Intel Xeons are driven by the consumer market, they are both just riding piggyback on the gaming / consumer segments.

This is why the Xeon Phi series has been predicted to fail in the long term, it's a very separate components from the consumer market. The same goes for the future Xeon + FPGA segment, it's completely directed towards HPC.

As Chief researcher Bill Dally said in a session I went to, "GPUs have a day job in gaming, in it's spare time on the weekends it goes and plays in its cool rockband doing HPC computations".

As long as Intel doesn't for example get the Xeon Phi into the gaming segment ( as was originally planned with Larrabee ), it's basically by far more likely to be dead than GPUs are in the HPC segment. Intel has a lot of cash and can keep it afloat for a long time, but even they cant spend billions on research year after year without getting a payback ( and no a couple of 100 K sales of the Xeon Phi is not going to cover it ).
 
Instead of "raster" I think it would be better to think in terms of "work group and work item despatch and scheduling". Instead of ROPs you might think of "global atomics" and general memory operations. And instead of TMU-filters it's better to think of algorithms that are served by the texturing cache hierarchy/swizzling. All of these things are relevant to pure compute.
Those are the programmable parts, yes, even in DP mode it's not only the ALUs and PRFs working full time. But there's quite a bit of FF in those stages as well which won't be consuming much energy while the chip churns through DP warps. Hence i explicitly mentioned raster, not whatever-threaded command processor.

I
Part of the reason for my question was: imagine that the priority for the chip was double precision. Is it possible to put more DP into the chip, regardless of SP and HP and stay within the power budget?
That's a tough nut to crack. I'd say: Insufficient data as of yet.

--

That's only if you think of SP as being like the actual HP implementation. Intel uses ganging. In other words, the SP is two real lanes and DP is two lanes working in concert.
Of course, but wasn't that what you were proposing?


No doubt, power has been a tight constraint for quite a while now: but NVidia keeps telling us that computation is not the power hog, it's routing data into and out of the ALUs. Routing and area must interact. It seems likely to me that routing either to SP or DP ALUs and then routing results back hinders power-efficiency (larger overall candidate area spanned by the data).
It surely is a delicated balance.

Having dedicated SP and DP ALUs allows one or the other to be turned off while the other is working. On the other hand, multipliers built from repeating blocks of functionality and used for both SP and DP can turn off the blocks that are only needed for DP while doing SP.
AFAIR, no one yet has gotten an answer out of Nvidia about whether or not the DP units are actually inside some select SMs (all of them in GP100 for example) and are in fact just fatter multipliers and adders taking over, sharing the datapaths of two SP-units once it's a DP-Warp. So technically, they would fulfill what nvidia termed "separate units, off to the side" (which is their official and most detailed answer yet, AFAIR)

GP102 is probably Maxwell-like in the quantity of DP it offers. Does GP102 have more SP ALU capability than GP100?

Is GP102 power limited in its SP capability?


Isn't that precisely what GP100 is?
Not with FP32 and FP64 in separate ALU blocks. What I meant here was what you talk about later: Multi-precision ALUs throughout the chip, sacricifing a bit of FP32 and power-in-FP32 for maximum FP64. It would also make the more sense, the closer delivery date for the government funded exascale architecture looms.

I would expect a modern design to switch off the paths that aren't required in SP mode. Intel's design (being multi-precision) is the obvious place where this should be the case. But does anyone know if that's what's happening?
Obviously, but it's only clock- not power-gating I would guess. Power gating inside each multiplier (and probably adder) seems quite (rather prohibitevly expensive in terms of transistor budget.

But x86 or Phi coupled to on-package or on-die FPGA functionality shouldn't be more than 5 years away. And that, in my view, marks the end of GPUs in HPC of any kind.
That may be the case - if GPUs do not evolve a bit as well in the meantime. I don't know though were exactly FPGAs sit on the 3d-curve of throughput, power and configurability. On any two of them, they are pretty strong, but does that apply for the third dimension as well?
 
Last edited:
Careful there, GPUs in HPC and GPUs in gaming go hand in hand, i.e. the same architectures have been reused over the years with little HW segmentation.
Maybe not for long.
The creation of GP102 + GP100 may be setting a precedent for that differentiation.

At least on the nvidia side.
 
Back
Top