Nvidia Pascal Announcement

lanek · Apr 6, 2016

Even the revision number seems different, on the "Blue" matte one, you have 1540A1 .. on the other one 1543 ( if thoses are the rev numbers ofc ).

Even the second line ( series number i think ) are quite different in nommenclature. (NC410.000 and something like NOP000.L0P ).

Arun · Apr 6, 2016

I just realised that for simple shaders without heavy register pressure, the number of warps available per FP32 FMA has doubled (since it's the same number of warps but spread amongst 64 ALUs instead of 128). This means 2x higher latency tolerance not just when register-limited but in *all* cases. This could make it more beneficial for gaming (even for games without much compute).

Do we know how HBM2 latency compares to GDDR5? I guess at 1.4GHz it must be quite high in absolute terms, while the increase in GPU clock speed also increases the relative latency in clock cycles. So I wonder how much of the register file increase (and therefore area increase!!!) is just to offset that. I'm sure relative latency tolerance will have improved (no way memory latency has gone up by 2x!) but I am not sure by how much.

fellix · Apr 6, 2016

I just realized, Pascal's multiprocessor now has become very similar to the GCN's design -- 64 FP32 lanes, 256KB GPRF, 64KB LDS and a single TMU quad. Well, everything else is still quite different, but an amusing fact nonetheless.

CSI PC · Apr 6, 2016

gamervivek said:
The specs look unimpressive for a 610mm2 chip. Perhaps they will get another 600mm2 chip for gaming purposes.

The clockspeed is pretty impressive. 2Ghz air cooled overclocked cards incoming?

Spec's may look unimpressive but it delivers in the only "benchmark" they have released to date - Alexnet fits their narrative.
As a very rough ballpark (calculated using the Titan X) Maxwell M40 manages around 2,700 images/sec, and the new Pascal-cuDNNV5 replacement manages around 4,500 images/sec.

Cheers

CSI PC · Apr 6, 2016

silent_guy said:
One explanation could be that Nvidia is more conservative with HPC and automotive systems.
I don't think we'll see 2GHz air cooled, and it'll be a big stretch with forced cooling. But it's not completely outrageous.

Yeah and worth remembering just how quickly the performance ceiling vs TDP/heat is reached when it hits the upper limits of clocking (Maxwell was good but it lost efficiency once the clocking reached a certain point and maybe we will see the same again).
Cheers

gamervivek · Apr 6, 2016

Frenetic Pony said:
But finfet has a sharper curve, a higher curve point, but sharper, for either foundry process. The big chip gets a higher clock versus 28nm, the smaller chips get a lower clock, that's simply the math of it. And considering it's a new process, with the first "big" GPU off it already have disabled modules to increase yields, and the process is already more expensive (per mm^2) than the previous one, well this is going to be a very expensive and possibly quite limited card.

What consumer, or otherwise, cards there are will have to wait until June. And since there's no mention of price/availability/etc. from the announcement there's limited guesses that can be made.

Where's this maths coming from? I doubt any chip at a given die size is going down in clockspeeds, except for product differentiation, compared to the previous process.
Maybe you wanted to imply that relative increases would be the highest on the biggest chips?

silent_guy said:
Expensive is very relative when you're really talking about the difference of a few percentage points more or less on what's probably a gross margin of 80%.

Not expensive in regards to money but die size. A chip 30-50% smaller from AMD could boast these SP/DP specs. In theory, of course.

iirc nvidia were behind when it came to hardware support for double precision for GPGPU stuff and Kepler could do it at 1/3 of SP while AMD could turn it in at 1/2 rate with Hawaii.

CSI PC said:
Spec's may look unimpressive but it delivers in the only "benchmark" they have released to date - Alexnet fits their narrative.
As a very rough ballpark (calculated using the Titan X) Maxwell M40 manages around 2,700 images/sec, and the new Pascal-cuDNNV5 replacement manages around 4,500 images/sec.

Cheers

Yeah, there is a big difference between theory and practice. Still, I hope that nvidia do a gaming chip. Unless they've some big plans for DP/DLTOPS(?) gameworks in the future.

Benetanegia · Apr 6, 2016

Hi guys. Regarding the TDP, I'm thinking that it will only actually reach those levels in DP operations. Remember that on the GK110 based Titan enabing 1:3 DP ratio had consequences:

"The penalty for enabling full speed FP64 mode is that NVIDIA has to reduce clockspeeds to keep everything within spec. For our sample card this manifests itself as GPU Boost being disabled, forcing our card to run at 837MHz (or lower) at all times. And while we haven't seen it first-hand, NVIDIA tells us that in particularly TDP constrained situations Titan can drop below the base clock to as low as 725MHz. This is why NVIDIA’s official compute performance figures are 4.5 TFLOPS for FP32, but only 1.3 TFLOPS for FP64. The former is calculated around the base clock speed, while the latter is calculated around the worst case clockspeed of 725MHz. The actual execution rate is still 1/3."

Source

I figure that 1:2 DP ratio must have a larger impact, but shouldn't have an impact under FP32 operation. Thoughts?

CSI PC · Apr 6, 2016

Benetanegia said:
Hi guys. Regarding the TDP, I'm thinking that it will only actually reach those levels in DP operations. Remember that on the GK110 based Titan enabing 1:3 DP ratio had consequences:

"The penalty for enabling full speed FP64 mode is that NVIDIA has to reduce clockspeeds to keep everything within spec. For our sample card this manifests itself as GPU Boost being disabled, forcing our card to run at 837MHz (or lower) at all times. And while we haven't seen it first-hand, NVIDIA tells us that in particularly TDP constrained situations Titan can drop below the base clock to as low as 725MHz. This is why NVIDIA’s official compute performance figures are 4.5 TFLOPS for FP32, but only 1.3 TFLOPS for FP64. The former is calculated around the base clock speed, while the latter is calculated around the worst case clockspeed of 725MHz. The actual execution rate is still 1/3."

Source

I figure that 1:2 DP ratio must have a larger impact, but shouldn't have an impact under FP32 operation. Thoughts?

I do not see any issues with the Pascal card, there was a chart as part of the presentation showing those performance and with clocks, only the Pascal had boost clocks as part of the values so seems this is considered.
Cheers

Benetanegia · Apr 6, 2016

CSI PC said:
I do not see any issues with the Pascal card, there was a chart as part of the presentation showing those performance and with clocks, only the Pascal had boost clocks as part of the values so seems this is considered.
Cheers

I'm not saying there's any issues. What I'm saying is that instead of operating at lower frequencies like they did on Titan, they simply increased the TDP.

CarstenS · Apr 6, 2016

Benetanegia said:
Hi guys. Regarding the TDP, I'm thinking that it will only actually reach those levels in DP operations. Remember that on the GK110 based Titan enabing 1:3 DP ratio had consequences:

"The penalty for enabling full speed FP64 mode is that NVIDIA has to reduce clockspeeds to keep everything within spec. For our sample card this manifests itself as GPU Boost being disabled, forcing our card to run at 837MHz (or lower) at all times. And while we haven't seen it first-hand, NVIDIA tells us that in particularly TDP constrained situations Titan can drop below the base clock to as low as 725MHz. This is why NVIDIA’s official compute performance figures are 4.5 TFLOPS for FP32, but only 1.3 TFLOPS for FP64. The former is calculated around the base clock speed, while the latter is calculated around the worst case clockspeed of 725MHz. The actual execution rate is still 1/3."

Source

I figure that 1:2 DP ratio must have a larger impact, but shouldn't have an impact under FP32 operation. Thoughts?

I am aware of Nvidias reasoning for GK110's DP-Mode, but I didn't buy it back then.

It somehow eludes me, how you should use more energy with only two thirds of the calculations and data moved around. AFAIAA, SP-units could not be fed by the register file while DP-units where busy, because it was tailored to the latter's needs.

Benetanegia · Apr 6, 2016

CarstenS said:
I am aware of Nvidias reasoning for GK110's DP-Mode, but I didn't buy it back then.

It somehow eludes me, how you should use more energy with only two thirds of the calculations and data moved around. AFAIAA, SP-units could not be fed by the register file while DP-units where busy, because it was tailored to the latter's needs.

So what was the reasoning for lowering clocks so much then? GK110 based Tesla's also operated at 700-750 Mhz, no boost. Fermi and Maxwell Teslas were clocked closer to their consumer counterparts.

Just trying to make sense of it all.

CarstenS · Apr 6, 2016

I have no idea what the actual reason was. I found that OC worked indeed better and more reliably with a Titan in DP-Mode, because you could set the clock speed without boost interfering.

Probably, it was just precautionary?

fellix · Apr 6, 2016

Benetanegia said:
Fermi and Maxwell Teslas were clocked closer to their consumer counterparts.

I think Fermi Tesla SKU had a lower clock for the ROP/L2 domain than the GeForce and Quadro units.

Benetanegia · Apr 6, 2016

CarstenS said:
I have no idea what the actual reason was. I found that OC worked indeed better and more reliably with a Titan in DP-Mode, because you could set the clock speed without boost interfering.

Probably, it was just precautionary?

You could reach higher clocks than boost clocks were reaching in SP mode? Because otherwise that sounds like just placebo to me, no offense.

Anyway, regarding this:

It somehow eludes me, how you should use more energy with only two thirds of the calculations and data moved around.

Didn't dispatch/register file "inefficiencies" basically keep Kepler at 2/3rds of its capacity anyway?

CarstenS · Apr 6, 2016

I don't know, it just was more stable.
Kepler was 2/3rds of peak throughput, if you actually tried to read from 3 different registers for a, b and c. If you re-used one slot, you could come close to maximum rate. So, instead of a × b + c = d, you' d have to to sth like a × b + a = d

CSI PC · Apr 6, 2016

Benetanegia said:
I'm not saying there's any issues. What I'm saying is that instead of operating at lower frequencies like they did on Titan, they simply increased the TDP.

Ah sorry, I interpreted your post as meaning they will not hit their higher clock speed for FP64/FP32.
Cheers

Benetanegia · Apr 6, 2016

CarstenS said:
I don't know, it just was more stable.
Kepler was 2/3rds of peak throughput, if you actually tried to read from 3 different registers for a, b and c. If you re-used one slot, you could come close to maximum rate. So, instead of a × b + c = d, you' d have to to sth like a × b + a = d

Doen't that kind of explain this?

with only two thirds of the calculations and data moved around

In both cases only 2/3rds of calculations (compared to theoretical maximum) and data moved around. Right? Although DP:SP ratio is supposed to be 1:3, doesn't that limitation make it more like 1:2 in practice?

And then, is the data moved around really the same? DP doesn't increase it even just a little bit? Does 1:2 rate translate to just requiring half the amount of threads or warps (excuse my ignorance), and thus potentially 1/2 the required operands, or do you in practice still require a higher number of warps to keep the execution units busy? To me it would look like statistically DP would prompt more memory reads, hence stalls, meaning more than 1/2 the work items required to mantain same occupancy. (real sorry if I used the terms wrong, I hope the concept is still understandable)

pjbliverpool · Apr 6, 2016

To put the context around how GP100 compares to Titan X in terms of raw metrics, I've put the following numbers together (both based on rated boost clocks):

Pixel Fill Rate: + 37.5%
Texturing: + 60.5%
Compute: + 60.5%
Geometry: + 37.5%
Bandwidth: + 114.5%

This assumes no change in geometry throughput and no increase in ROPs. It's also before efficiency improvements and likely higher desktop clocks. So it certainly shouldn't be a slouch compared to the current generation.

CarstenS · Apr 6, 2016

Benetanegia said:
Doen't that kind of explain this?
In both cases only 2/3rds of calculations (compared to theoretical maximum) and data moved around. Right? Although DP:SP ratio is supposed to be 1:3, doesn't that limitation make it more like 1:2 in practice?

My hunch is, the actually hard part was to design the register file accordingly and it was specced to fully saturate the 960 DPFP units (each with 3 64b reads and 1 64b write, maybe the write port could even be a shared port, since the write occurs after the read). With fully independent operands, that would suffice for two SP-FMAs instead of one 64b, thus 64 out of 192 SP-ALUs running dry.

Benetanegia said:
And then, is the data moved around really the same? DP doesn't increase it even just a little bit? Does 1:2 rate translate to just requiring half the amount of threads or warps (excuse my ignorance), and thus potentially 1/2 the required operands, or do you in practice still require a higher number of warps to keep the execution units busy? To me it would look like statistically DP would prompt more memory reads, hence stalls, meaning more than 1/2 the work items required to mantain same occupancy. (real sorry if I used the terms wrong, I hope the concept is still understandable)

How can it? It's 64 bits per Operand instead of 32, but with only 960 instead of 2880 ALUs, so 2/3rds.

---
To add to the reason why Nvidia would probably not boost under DP-Mode: Chances are, that when using DP, you're running something scientific, which would create a more constant and first and foremost much longer load on the ALUs instead of the highly variable load of a game or normal application.

Nakai · Apr 6, 2016

I really don't get, how NV did FP64 on Pascal.

According to some presentation, Pascal can issue a pair of fp16 per clock, a fp32-op per clock and a fp64-op every two clocks.
The devblog still states that there are dedicated fp64 units on Pascal.

If Pascal has only half the dp64 units, and can only issue an fp64 op every two clocks, does this lead to a 1:4-ratio rather than a 1:2-ratio?
Can somebody explain this?

Nvidia Pascal Announcement

lanek

Arun

Unknown.

fellix

CSI PC

CSI PC

gamervivek

Benetanegia

CSI PC

Benetanegia

CarstenS

Moderator

Benetanegia

CarstenS

Moderator

fellix

Benetanegia

CarstenS

Moderator

CSI PC

Benetanegia

pjbliverpool

B3D Scallywag

CarstenS

Moderator

Nakai

Similar threads