Nvidia Pascal Announcement

Berek · Apr 6, 2016

Did anyone catch which DisplayPort version they are using? I assume at least 1.3, but 1.4 just came out and could be integrated as well, though it's perhaps a bit early yet for 1.4.

Razor1 · Apr 6, 2016

Nah they didn't mention any of that.

Ext3h · Apr 6, 2016

I don't think the GP100, respectively the boards, even HAVE a DisplayPort, or any display adapter at all.

CarstenS · Apr 6, 2016

Benetanegia said:
OK. So to narrow it down, because I'm still confused about your position. You don't think locality has a great effect (from your last post, this does not seem likely). Or you don't think DP would significantly (to the point of increasing power by 10% that is) decrease locality. I'd bet it's the latter for obvious reasons. But I can't see how it wouldn't. There's half as many operands to choose from. And I don't agree that ALUs are going to be sitting around doing nothing when a higher level memory access is in effect, surely they'll find another thread/warp with a higher level of residency to work on...

I'm having trouble following your train of thoughts as well.

To make things clear:
- I’m talking about the reason why Geforce GTX Titan is not allowed to boost in DP-mode and has a lower baseclock there as well
- I’m arguing, that this is merely a precautionary measure than actually a problem with power consumption in DP mode
- I reason that this is because the chip has to feed only 1/3rd the number of ALUs compared to SP-Mode on twice as much/wide/many data paths

Ext3h · Apr 6, 2016

CarstenS said:
- I’m talking about the reason why Geforce GTX Titan is not allowed to boost in DP-mode and has a lower baseclock there as well

Might as well be just the delicate timing when feeding the DP ALUs in serial, which, as long as the hardware doesn't have 64bit wide data paths, does require twice as long as regular.
Or just a path of critical length in the DP ALUs which don't respond well to even slightly exceeding boost speeds. If SP mode was actually just a frequency devider on each DP, that would explain that.

dnavas · Apr 6, 2016

Xmas said:
The simplest way to think about FP16 instructions is that they really are instructions using 32-bit wide registers, just like any FP32 and INT32 operations.

Yeah, the reason I asked is that it seemed ... interesting that the number of fp32 units per SM went down, while the number of (functionally) fp16 units per SM remained the same -- the picture would be essentially the same had the SM shown twice the number of fp16 units and called out issue rates as once, half, and quarter, except that that may not be true if what we have are VLIW2 or some other register-layout-driven sharing arrangement. Is the issue width the same for fp16 and fp32? Does one issue a full-width warp VLIW2, a double-sized warp with a single op across both "halves" of each fp32 register, or two full-width warps each using half sized registers?

And on a different tack, what does it mean that we have a dedicated fp64 unit, but no dedicated fp16 units? Should I look at the delayed Volta and come to the conclusion that Pascal is a dry-run on unified alu architecture, but aimed at the easier fp16/fp32 "split"? [What else is Volta going to bring if not that?]

Berek · Apr 6, 2016

Ext3h said:
I don't think the GP100, respectively the boards, even HAVE a DisplayPort, or any display adapter at all.

That's a good point, I think you're right. I'm getting ahead of myself here on the products for consumers I suppose. I was excited to see something about that at GTC, but I guess we'll be waiting for Computex. Has to be soon though with the schedule they've been talking about.

Benetanegia · Apr 6, 2016

CarstenS said:
I'm having trouble following your train of thoughts as well.

To make things clear:
- I’m talking about the reason why Geforce GTX Titan is not allowed to boost in DP-mode and has a lower baseclock there as well
- I’m arguing, that this is merely a precautionary measure than actually a problem with power consumption in DP mode
- I reason that this is because the chip has to feed only 1/3rd the number of ALUs compared to SP-Mode on twice as much/wide/many data paths

But do you see how your posts are conflicting? For example, you insisnt on this line:

I reason that this is because the chip has to feed only 1/3rd the number of ALUs compared to SP-Mode on twice as much/wide/many data paths

But you've been made abundantly clear, and it's a point that has been made multiple times here on B3D by multiple posters, that Kepler can't actually use all of its ALUs most of the times. And the times in which it can, it's because it can do so without increasing the data being moved around. Hence Kepler's ALU utilization is limited entirely by how much data can be moved around. Since 1/3rd of ALUs are idling most of the time, for all intents and purposes, on DP-mode the chip is feeding half the number of ALUs compared to SP-Mode on twice as much/wide/many data paths. So the amount of work done and data moved is "at the very least" the same in both SP and DP. Now, correct me if I'm wrong but DP ALUs consume a little more than 2x as much as SP ALUs and factor in a little bit of extra memory accesses, due to decreased locality and I don't see how that wouldn't result in slightly higher power consumption...

fellix · Apr 6, 2016

The way Pascal handles FP16 data is probably no different than what Nvidia has implemented in Tegra X1 -- the compiler packs together two independent op's from the same type (add, mul) that are statically scheduled for execution, as a single op by the FP32 ALUs.

trinibwoy · Apr 6, 2016

Benetanegia said:
Since 1/3rd of ALUs are idling most of the time, for all intents and purposes, on DP-mode the chip is feeding half the number of ALUs compared to SP-Mode on twice as much/wide/many data paths. So the amount of work done and data moved is "at the very least" the same in both SP and DP.

Good point.

OlegSH · Apr 6, 2016

http://on-demand.gputechconf.com/gtc/2016/presentation/S6224-Mark-Harris.pdf
http://on-demand.gputechconf.com/gtc/2016/presentation/S6176-Mark-Harris-Lars-Nyland.pdf

fellix · Apr 6, 2016

OlegSH said:
http://on-demand.gputechconf.com/gtc/2016/presentation/S6176-Mark-Harris-Lars-Nyland.pdf

File not found.

Razor1 · Apr 6, 2016

second link not workin

OlegSH · Apr 6, 2016

I took them from this page - https://developer.nvidia.com/pascal

Razor1 · Apr 6, 2016

CarstenS · Apr 6, 2016

Benetanegia said:
But do you see how your posts are conflicting? For example, you insisnt on this line:

No, and that line alone is without the context I already gave multiple time, which you acknowledge first in the part below only to immediately to discard it.

What's true is that Kepler is not able to feed more than 2/3rds of its SPFP-ALUs in the worst case. You otoh make this the default behaviour („for all intents and purposes ...“) which it is definitely not.

Benetanegia said:
But you've been made abundantly clear, and it's a point that has been made multiple times here on B3D by multiple posters, that Kepler can't actually use all of its ALUs most of the times. And the times in which it can, it's because it can do so without increasing the data being moved around. Hence Kepler's ALU utilization is limited entirely by how much data can be moved around. Since 1/3rd of ALUs are idling most of the time, for all intents and purposes, on DP-mode the chip is feeding half the number of ALUs compared to SP-Mode on twice as much/wide/many data paths. So the amount of work done and data moved is "at the very least" the same in both SP and DP. Now, correct me if I'm wrong but DP ALUs consume a little more than 2x as much as SP ALUs and factor in a little bit of extra memory accesses, due to decreased locality and I don't see how that wouldn't result in slightly higher power consumption...

True, the DPFP ALUs should be (a bit?) more power hungry than the SP-ones. I'll throw another thing in the mix then: Rasterizers, Geometry, Tessellators, ROPs. All mostly unused in what's a typical application with DPFP I'd wager. Plus I am quite positive that Ld/St and SFUs cannot be tasked in parallel with the DPFP ALUs edit: This is probably not true, since it says in the Kepler Whitepaper: "Unlike Fermi, which did not permit double precision instructions to be paired with other instructions, Kepler GK110/210 allows double precision instructions to be paired with other instructions." (just not necessarily all other instructions). I don't know from the top of my head whether or not the INT functions might require larger-than.minimally viable adders and multipliers as well as datapaths from the SPFP-ALUs as well, but I would tend to believe so as well.

And then there's more: SP-Mode does not only employ all those extra circuitry quite regularly, but it does so while utilizing a higher frequency, requiring a higher voltage, thus in turn upping power consumption even more.

I frankly admit that I do not have exact power numbers for all this, but the odds are increasingly thin for 1/3rd of the units (albeit being utilized equivalently to more than their number alone suggests), making way less frequent use of other resources in the chip while causing (slightly?) higher data movement, being power limited below the normal operation mode's base frequency.

Having said this, I'm quite open to other aspects of this discussion, things we have not yet touched. But just repeating the same thing over and over again seems quite pointless.

Benetanegia · Apr 6, 2016

Still, "file not found" on the architecture one.

CarstenS said:
No, and that line alone is without the context I already gave multiple time, which you acknowledge first in the part below only to immediately to discard it.

What's true is that Kepler is not able to feed more than 2/3rds of its SPFP-ALUs in the worst case. You otoh make this the default behaviour („for all intents and purposes ...“) which it is definitely not.

I make it the default behavior because as per your own words, for all intents and purposes, and in regards to power consumption, it is the default behavior. Unless what you said here is false, that is:

Kepler was 2/3rds of peak throughput, if you actually tried to read from 3 different registers for a, b and c. If you re-used one slot, you could come close to maximum rate. So, instead of a × b + c = d, you' d have to to sth like a × b + a = d

So in worst case only 2/3rds of peak throughput. In best case, full throughtput but with the exact same data movement as in the worst case. Since the grand majority of power consumption comes from moving data around, both cases are similar in regards to power consumption.

How are Rasterizers, Geometry, Tessellators, ROPs. etc. at all relevant? We are talking about FP32 compute vs FP64 compute here afaik. What are those used for FP32 computing that couldn't be used when computing FP64??

One of the biggest problems is that you dismiss Nvidia's explanation, while not being able to provide any valid alternative. The closest you came to having an explanation was:

To add to the reason why Nvidia would probably not boost under DP-Mode: Chances are, that when using DP, you're running something scientific, which would create a more constant and first and foremost much longer load on the ALUs instead of the highly variable load of a game or normal application.

DP-Mode does probably mean something scientific 99% of times. But the opposite is not true. SP-Mode does not prevent scientific code being run and/or does not ensure in any way a less "constant and first and foremost much longer load on the ALUs instead of the highly variable load of a game or normal application." Whatever normal application means to you here. If the applications being run are the "problem" there's hardly any differentiation between FP32 and FP64, unless for whatever reason FP64 is less efficient.

I'd also like to know, even in case you were right (or Ext3h or any other alternative), why you think that telling that TDP was the reason, is in any way better than saying the actual reason. Especially with all the PR nightmare Fermi was precisely because of TDP. It makes no sense to me. If for example the reason was what Ext3h said, why wouldn't they simply state so? Even mentioning TDP spells trouble for the general consumers. A more technical reason on the other hand, something the general public doesnt even know about would be a much better explanation. Or even excuse.

Ailuros · Apr 6, 2016

You may excuse the interruption, but Mr. Triolet @hardware.fr seems also certain that there will be something along the line of a "GP102":

https://translate.googleusercontent...0.html&usg=ALkJrhiJmqpGFprnBDes2q7JStRXfIRhUg

This is actually an assumption we make for some time and that obviously is becoming a reality.Although we did not have official confirmation from Nvidia, several external sources confirmed that our intuition was correct and that there will be no GeForce based on the first big GPU Pascal.

This is not to say that Nvidia abandons the players, on the contrary! A dedicated GPU, clearly upscale oriented is being finalized and should be announced soon according to our information. This should be content with GDDR5 or GDDR5X and operate all of its transistors to the record real time.

ninelven · Apr 6, 2016

This is actually an assumption we make for some time and that obviously is becoming a reality.Although we did not have official confirmation from Nvidia, several external sources confirmed that our intuition was correct and that there will be no GeForce based on the first big GPU Pascal.

Not sure I buy this.... I mean why bother with texturing at all if it is strictly an HPC part.

Jawed · Apr 6, 2016

fellix said:
The weird thing is that on some P100 modules the HBM stacks are obviously smaller than on the others and the smaller ones don't reflect the light on:
[...]
We know that HBM2 is supposed to have larger die area than HBM1, so what's the deal with the disparity here? Is Nvidia using mechanical samples with dummy filler for the missing memory stacks to cover all the eight sockets for this demo unit?
The yields must truly in the drain, if this is the case.

I think the non-HBM2 cards are HBM1 prototype units. I suspect that NVidia decided to "pack" the chips like that so that they could apply active cooling according to the spec for HBM2.

HBM2 modules are taller than HMB1 and the packaging for the GPU die looks designed to take account of that, so that GP100 and HBM2 dies have top surfaces that match up. So I'm guessing that the HBM1 prototypes were packaged up in this peculiar way to make a mechanically more sound mating with the heatsink.

The wrinkle in this theory is the idea that HBM2 is taller. It may not be if only 4-high HBM2 is used, instead of 8-high.

What's potentially of greater concern is that there are so few GP100s available that the majority of them shown were non-HBM2 variants. Well, to be honest, if NVidia's talking about this being in full production near the end of the year, this is arguably just a reflection of productisation.

Nvidia Pascal Announcement

Berek

Razor1

Ext3h

CarstenS

Moderator

Ext3h

dnavas

Berek

Benetanegia

fellix

trinibwoy

Meh

OlegSH

fellix

Razor1

OlegSH

Razor1

CarstenS

Moderator

Benetanegia

Ailuros

Epsilon plus three

ninelven

PM

Jawed

Similar threads