Nvidia Pascal Announcement

Makes sense as that was the largest demograph of enthusiast complaints and case put forward; setting benchmark scores and also benchmark applications usually has the best support for 3&4-way mGPU implementations.
Cheers
 
http://videocardz.com/60977/nvidia-to-enable-34-way-gtx-1080-sli-only-for-selected-applications

Original article

http://www.pcper.com/news/Graphics-...3-Way-and-4-Way-SLI-will-not-be-enabled-games


Looks like they aren't completely dropping 3 way and 4 way SLi.
Just games will not work lol, well they didn't scale past 2 way SLi that much anyways.
They made it clear already in the launch already that 3- and 4-way SLI won't be dropped (even if they're not really supported either)
 
ScottGray has done an update with his experience using 1080FE with regards to fp support:
I finally got to play with a 1080 today and I thought I'd post my findings.

So the int8/16 dot production instructions (dp4a and dp2a) are indeed full throughput and work just as advertised. The fp16 support, however, has a lot of functionality that isn't documented:

  1. HFMA2.<MRG_H0|MRG_H1|F32>.<FTZ|FMZ>.<SAT> d, a.<H0_H0|H1_H1|F32>,<->b.<H0_H0|H1_H1>,<->c.<H0_H0|H1_H1|F32>;
  2. <HADD2|HMUL2>.<MRG_H0|MRG_H1|F32>.<FTZ|FMZ>.<SAT> d, a.<H0_H0|H1_H1|F32>,<->b.<H0_H0|H1_H1>;


On sm_61 this of course runs at 1/64 throughput (the instruction, not the math) so all this is of limited use. But it might be good to be aware of it if you get ahold of hardware that isn't crippled ( sm_60, sm_62? ).

Interestingly, there's a lot of mixed fp32/fp16 precision support. Though the mode I was most interested in (fp16x2 dot product with fp32 accumulate) isn't supported. Any time you mix in fp32 registers the instruction can only work on one of the packed fp16 values in other registers (as far as I can tell anyway).

The H0_H0 and H1_H1 flags are there to facilitate a SIMD matrix multiply outer product, but are also used to select the packed values in single throughput mode. The merge flags don't currently work on sm_61 but it's clear they're meant to facilitate BFI like functionality.

Oh, and one other thing.. it looks like there's a performance bug in the compiler. If you're loading in memory in fp16 and then converting to fp32 for compute, ptxas tries to be clever and use HADD2 instead of F2F to do the conversion. But on sm_61 hardware this is a 16x slower pathway. On sm_60 hardware this makes a lot more sense since it's a 4x speedup. I've already submitted a bug for this.
https://devtalk.nvidia.com/default/...e-gtx-1080-amp-gtx-1070/post/4898563/#4898563

Anyway along with some other aspects reported I wonder if some driver bugs.
Cheers
 
What's much more surprising to me is that the 1070m is using same maxed out 8000 MT/sec GDDR5 frequency as the desktop part. Mobile GDDR5 tends to be clocked 20%-30% slower than desktop. (ie 5000 versus 7010 MT/sec for GM204).
Could it be using underclocked gddr5x instead?
The slower clocks of gddr5 in notebooks are otherwise because mobile chips nearly exclusively use low-voltage gddr5. But I don't think you can get anything faster than 6gbps or so with 1.35V (it's next to impossible to figure out from the never uptodate piss poor websites of samsung, hynix, and micron). I'm not even sure you can get "regular" (1.5V) 8gbps gddr5 (vs. factory overclocked 1.55-1.6V as they did in the past for the fastest speed grades, which would definitely be non-desirable for notebooks).
 
Detailed import data of GP106
nVidia-GP106-Lieferungen-Juni-2016-Zauba.vorschau2.png

Poor man's translation ....
For nVidia brings the GP106 Chip for the rest in a very good position at eye level (or even beyond) to compete with AMD's Polaris-10 graphics chip. Whether the GP106 chip only half of the GP104-processing units carries (aka 1280 shader units) or whether it be a bit more (up to 1408 shader units) - with respect to the growth of computing power must be in the Pascal chips do well not worry given the timing capabilities of the 16nm production by TSMC. The previous issue of the GP106 Chip was, rather, an anticipated Bandwidth limitation if one would remain at a 128 bit memory interface as in the previous GM206 chip GeForce GTX 950 & 960 and GDDR5X memory for cost reasons currently can not display option in this market segment , With a 50% larger memory interface when GP106 chip nVidia goes here all the problems out of the way and ensures that actually performs the expected competitive computing power to more performance.

In addition, the memory configuration of 6GB quite attractive in a (assumed) price range of 200 to 250 euros, in which AMD indeed already 8 GB solutions will offer, for just as well as 4 GB solutions in AMD's portfolio will be available. The 6 GB of memory, the nVidia counterparties are nominally in the middle, but the benefits turn out very differently: Opposite 4 GB 6 GB an almost elemental improvement, while the difference between 6 GB and 8 GB then falls hardly significant. Either way is 6 GB storage for the aspired these cards FullHD resolution from today's perspective, a good memory configuration - not necessarily applied particularly durable (this should apply rather to 8GB) , but still significantly better than "only" 4 GB memory , Depending on how attractively priced AMD makes its 8 GB solutions and whether the trade press those storage volumes contest eventually assessed differently, of course, can still win AMD 8-GB solutions - but really bad 6 GB memory in this market segment is certainly not
http://www.3dcenter.org/news/nvidia...mit-6-gb-speicher-und-192-bit-speicherinterfa

https://www.zauba.com/import-gp106-hs-code.html
 
Last edited by a moderator:
Makes sense as that was the largest demograph of enthusiast complaints and case put forward; setting benchmark scores and also benchmark applications usually has the best support for 3&4-way mGPU implementations.
Cheers

Well scaling of 4way SLI/CFX, have allways been pretty limited, not just by "game support", but you need to OC your CPU way higher than what a gamer will do in general due to the stability, heat dissipation on long game session. I had not much problem with scaling when i had my CPU under subzero cooling ( with a beefy phase change ).. In fact, you had 2 cases.. or invert scaling ( not supported at all, but in general it was too the case for 2Way ) or good scaling.. This said, you need to know how to work with diferent profiles and it have never been an easy "Out of the box".

When you see some gamers and reviewers who was try quad CFX / SLI on a 1080p monitors during years.. with a middle grade quadcore cpu on air.. well.. that was funny.

What is strange is now with 4K and VR, im pretty sure "cpu" bottlneck could well be more limited than never.

This said, 3way SLI and CFX was way simple to use and in general scaling was not so bad.

On the end of the day, it is a little bit like 4U CPU systems, its not because you have more CPU that you will launch an compute / render software and get a magical 4x power boost.. If you dont know how to set the softwares for use al the 4CPU's and how to set the system memory management, let alone the OS for use really all cores, threads available.
 
Last edited:
Well scaling of 4way SLI/CFX, have allways been pretty limited, not just by "game support", but you need to OC your CPU way higher than what a gamer will do in general due to the stability, heat dissipation on long game session. I had not much problem with scaling when i had my CPU under subzero cooling ( with a beefy phase change ).. In fact, you had 2 cases.. or invert scaling ( not supported at all, but in general it was too the case for 2Way ) or good scaling.. This said, you need to know how to work with diferent profiles and it have never been an easy "Out of the box".

When you see some gamers and reviewers who was try quad CFX / SLI on a 1080p monitors during years.. with a middle grade quadcore cpu on air.. well.. that was funny.

What is strange is now with 4K and VR, im pretty sure "cpu" bottlneck could well be more limited than never.

This said, 3way SLI and CFX was way simple to use and in general scaling was not so bad.

Yeah totally agree.
The 3-Way and 4-way scale reasonably well in benchmark testing competitions?
Not my cup of tea but seems to be a healthy segment of PC industry along with world record OCing/etc and at events; from the enthusiast perspective anyway.
Which is why I assume Nvidia will continue to support this.
Cheers
 
Just wanted to give a heads-up a member here has just linked an Nvidia publication showing what looks like a GP102 variant being applied to the Drive-PX2.
Page 11 describes a large die Pascal with more cores than P100 but with less DP.
https://forum.beyond3d.com/threads/nvidia-tegra-architecture.52270/page-183#post-1923510
Yeah may be just a concept, but seems pretty solid presentation information.
I do think the Tflops examples in the presentation could be confusing though as they are showing FP64 in both examples and not FP32 that some may expect instead, note the DGX-1 has 8xP100s to get the FP64 40Tflops.
Also raises question what happened to the GP106 being used for that, or was it just a placeholder/smokescreen/mistake either mentioning GP106 or what looks like a GP102 variant.

Posting here as I think quite a few would miss where it is currently posted.
Kudos to Itaru.

And may explain why we are hearing rumours about GP102 being earlier than expected, just like the Kepler GK110 GPU it could fit into the Tesla/Quadro/Titan-ti range.
Cheers

Edit:
Being a bit dense due to my headache.
It still could be a GP100 variant due to the unusual ratio of the FP64 and FP16.
Basing FP32 as half of FP16 means it has better than 1:2 ratio of P100, which seems wrong.
Maybe they do something unusual and why never mention FP32 in the presentation.

Cheers
 
Last edited:
I remain skeptical. Wasn't PX2 touted as 24 DL-TOPS - not FLOPS? Could be INT8, making 2×8 for the discrete Pascals in PX2 plus 2×4 for the integrated ones plus the CPU cores.

The 8 TFLOPS I'd rather view as FP32, with 2×4 for two GP106, each GP106 having 1,280 ALUs at roughly 1.6 GHz. Why would a Drive PX2 with a Neural Network target have high DP-rate in any case?

For whatever weird reason, 3× INT8 rate does not seem to be totally out of this world.
 
Last edited:
I remain skeptical. Wasn't PX2 touted as 24 DL-TOPS - not FLOPS? Could be INT8, making 2×8 for the discrete Pascals in PX2 plus 2×4 for the integrated ones plus the CPU cores.

The 8 TFLOPS I'd rather view as FP32, with 2×4 for two GP106, each GP106 having 1,280 ALUs at roughly 1.6 GHz. Why would a Drive PX2 with a Neural Network target have high DP-rate in any case?
The only one that seems dubious is the 24 TFlops, your right the live event says 24 DL TOPS while the presentation has it has 24 DL TFLOPS.
The presentation shows the 8Tflops (which has FP64 next to it in presentation) in same context as the 8xP100 with 40 Tflops few pages later, thats based upon FP64
That is a big mistake to make not only on the live presentation where they said 8TFLops and also in the separate presentation linked in other thread.

So I would say the 8TFlops figure is FP64.
Although it could be a humungous mistake, along with saying in the presentation 2 x 3840-core Pascal GPU.
http://emit.tech/EMiT2016/Ramirez-EMiT2016-Barcelona.pdf
That is quite a lot of mistakes to make, so I feel it is correct.
Also the presentation is pretty recent and was shown at EmiT 2016, Day 1 (Thursday 2nd June 2016).

But those figures are definitely a head scratcher.
Cheers
 
Last edited:
Why would a Drive PX2 with a Neural Network target have high DP-rate in any case?
.
Depends upon the compute requirements of DriveWorks and DriveNet.
The system needs Sensor,Detection-identification,Localisation, and Maps.
It is not just about identifying objects, but calculating distances/speeds/path of moving objects/etc,processing various data from the sensors including lidar & radar,spatial information,etc.

Cheers
 
Last edited:
For whatever weird reason, 3× INT8 rate does not seem to be totally out of this world.

I think my headache is clouding my thinking.
Yeah Int8 is their DL TeraOps, which makes the maths add up in your context, my thinking was skewed focusing on TFlops as shown in the recent presentation and not thinking straight with the events DL Tops.

Well, WTH was Alex Ramirez talking about with his specification and presenting same context additional FP64 example in that presentation sigh.
He is not marketing but a professor computer science with a heavy background in scientific-engineer research.
Really weird.
Cheers
 
Last edited:
Detailed import data of GP106
nVidia-GP106-Lieferungen-Juni-2016-Zauba.vorschau2.png
A memory size of 3 or 6GB implies that the memory bus is 192 bits wide.
This does not match with the Drive PX 2 board, showing a GP106 with 8 GDDR5 chips and therefore implying a 128 bit wide bus.

One way to reconcile the clues: GP106 could be using asymmetric memory chip sizes or counts to support both 4GB and 6 GB SKUs with the same GPU. NVidia used that trick before with GF116, GK106, and GK104. Those all had a 192 bit bus. GM106 had a 128 bit bus and did not have any asymmetric SKUs.

192-wide GDDR5 at GTX 1070/RX 480 speeds of 8GT/sec would give GP106 a 192 GB/sec memory bandwidth. That'd help GP106 compete with the surprisingly wide 256-bit RTG RX 480 with 256GB/sec. GP106 may not need more than 192GB/sec since that already gives it a better bandwidth/compute ratio than both GTX 1070 and GTX 1080.
 
It's anyone's guess what the "f" stands for, but GP107 is said to be in direct competition with AMD's Polaris-11 chip.
A total guess to explain the extra GP107 variant: supporting a single HBM2 module would be a powerful 4GB discrete mobile GPU, and HBM2 would make the package tiny, lower wattage and mechanically easier to cool. Not sure the HBM2 cost for a mobile part makes economic/business sense but it'd be very attactive for a high end but compact laptop. It'd also be a nice part for a dense 4x or even 8x GRID card using an onboard PLX switch.
 
Back
Top