Nvidia Pascal Speculation Thread

Status
Not open for further replies.
One of the great gifts of hardware.fr is this page, where all recent SMs with its internal architecture are placed next to each other and you can immediately see the changes as you mouse over the different links. Compare a GK110 against a GK104 and it's obvious that, in broad strokes, there isn't much different in architecture. There is no reason to believe that something similar can't be done for the Maxwell SM. In other words, I do think that it's mostly a matter of tacking on more FP64 units.
I don't see why the cache should be in any way different? Neither do I see why the register file and the way the array is fed would have to be changed significantly. At best, FP64 will have half the performance of FP32, so the logical way to go about it is to simply use 2 adjacent 32-bit registers and fetch them sequentially. And if Pascal implements FP16 the way it's done Tegra X1 (see this Anandtech article), it doesn't require major architectural plumbing either.


Can they be done simultaneously in the same pipeline? (without the huge performance penalty we see right now)

This is what I was talking about, took me a while to find the document

http://docs.nvidia.com/gameworks/co...daexperiments/kernellevel/pipeutilization.htm
 
Last edited:
From WCCFTech: "NVIDIA Updates Pascal GPU Board – Four HBM2 Stacks and Massive Die Previewed Ahead of Launch in 2016, 200 GB/s NVLINK Interconnect."

NVIDIA-Pascal-GPU-20151.jpg


WCCFTech points out that this picture is not the same as the previously shown Pascal picture (shown below).

NVLink.jpg


EDIT: I tried to estimate the die size by comparing it to a Fiji picture. Due to the nature of the two images, I couldn't get an accurate estimate, but the two do seem to be similar in size. If I had to pick an inequality (or equality), I would say size(this Pascal) < size(Fiji).

Attached is my attempt. Comments/suggestions?
 

Attachments

  • Processor dies.014.png
    Processor dies.014.png
    839.9 KB · Views: 34
Last edited:
Looks like an early mechanical sample. The interposer seems to be covered with some thick "mask" for protection.
 
From WCCFTech: "WCCFTech points out that this picture is not the same as the previously shown Pascal picture (shown below).
The original picture has no interposer. So it certainly couldn't have been HBM. It could have been HMC, which would be amusing.

But I think it's just a photochop.
 
I believe at that point they had already publicly stated that they were going to use HBM.

The device was a glued mockup.

This time around without using wood screws?
 
Last edited:
So, what will Pascal actually change?

  • New mixed precision ALU.
    Most probably VLIW4 or VLIW2 for half precision at virtually the same latency as single precision.
    Double precision most likely at quad latency, but no longer dedicated DP ALUs.
    No details released yet, but sounds the most reasonable given the targeted use cases.
  • All new lower memory stack. Meaning HBM, new memory controller, new L2, all new MMU.
  • New NvLink, also implying: Reworked copy engines, probably revised grid management unit.
  • If, but only if, the grid management unit was reworked, also working async compute in DX12.

Stuff probably not changing:
  • Graphics command processor, that thing wasn't touched since Tesla at it looks like.
  • SMX/M/- layout. They would have announced if they had changed anything to that.
    Assuming VLIW4 or VLIW2 for half precision, there is no need to introduce new 16bit lines at all.
    Which means everything can stay pretty much as it is.

Stuff I would personally hope for:
  • Dedicated L1 and LSHM memory. One(!) of the reasons why mixed compute/3D load still sucks in Maxwell.
  • Bigger work distributor. That thing is so tiny in comparison to what AMD has to offer.
    (Queue depth of 32 in terms of grids in flight. Total of 640 grids in flight an AMDs Fiji if you count in GPC + all ACEs.)
 
Where was the new ALU confirmed?
Let me put it like this: They are announcing 4x the half precision throughput. But they can't decrease single precision throughput, or customers would play hangman.

The new manufacturing process also gives them ONLY about 60% more density, if, and only IF they put all the gains from FF+ into size reduction and none into power efficiency.

Which means: HP and SP can only be the very same ALU. And since they also announced an increased DP performance, it means they can no longer use dedicated function units for that.

There's simply no die space left to do that. Count in the savings from trashing the GDDR5 controller, and from getting rid of the dedicated DP ALUs, and that's barely enough to double the number of SP ALUs. That's not good enough for the numbers claimed by Nvidia.

Which means mixed mode ALUs, similar to the ones you had in older CPU generations, minus the latch / scheduler since that adds latency and costs more space. Plus they also hired a research team focusing on such ALUs about 1 1/2 years ago, if my memory isn't off.
 
  • If, but only if, the grid management unit was reworked, also working async compute in DX12.
Earlier NV slides say straight forward they're still going with just pre-emption, althoug finer grained than before
 
Earlier NV slides say straight forward they're still going with just pre-emption, althoug finer grained than before
That would in fact mean a reworked GMU.

That thing was already supposed to support async compute with seamless interleaving / parallelism on a grid level, and "preemption" only on the SMM/SMX level. Latter one because apparently there is a resource conflict between waves emitted by the GPC and regular compute shaders. Most likely due to that shared memory block for L1 and LSHM, according to my understanding.

The async disaster with Maxwell originates from the fact that they can't use the GMU for DX12 for undisclosed reasons. So they are currently in the fallback mode dating back to Tesla, where the GPCs need to be turned off to switch into compute mode - which results in full command list level "preemption" and the horrid performance.

It is already working like that if the compute workload is scheduled with CUDA which can in fact use the GMU. But CUDA isn't DX12, in fact, compute engines DX12 have a few features which CUDA doesn't have.

("Preemption" because it isn't even real preemption. The hardware can't do that for any grid which is already in flight. Less so on the SMM level, they can only wait for completion. They are lying outright on this point.)

Still much worse than AMD though, as AMD can in fact interleave 3D and compute issued workload on every single level in hardware, no exceptions. (Except for the front end with ACEs not handling draw calls.)
 
Wrong Thread.
 
Last edited by a moderator:

They seem to be saying it's a lower end - mid range option although it all seems like speculation. GDDR5X sounds very interesting though if it's really offering 2x the bandwidth for the same speed/bus. It would actually be a pretty decent competitor for HBM2.
 
It would actually be a pretty decent competitor for HBM2.
Not really. Maybe in terms of memory bandwidth when using a 512bit interface, but not in terms of die space, and most certainly not in terms of energy efficiency. Not to mention that this will probably mean larger PCBs for mid-range than for high-end cards.
 
Status
Not open for further replies.
Back
Top