Nvidia’s architects, on the other hand, chose to increase focus on increased performance in the Pascal GPU architecture, which lead to an increased design complexity, transistor count, and die size of the “data center” version of Pascal. With over 15 billion transistors and a die size of 610 mm^2, this first Pascal GPU is one of the largest processors ever manufactured, and the largest so far in a 16 nanometer FinFET process.
The Pascal GPU implements a scalable compute pipeline. Nvidia organizes their CUDA cores into Streaming Multiprocessors, or SMs, which makes it easy for software developers to code and debug parallel processing resources at manageable scale. Pascal’s 32-bit single precision floating point (FP32) core design supports performing two simultaneous half precision floating point operations (FP16) at the same clock speed as one FP32 operation, using the same compute path. Pascal implements FP32 and double precision floating point (FP64) units in a 2:1 ratio. So, in the time one FP64 operation can be executed, the Pascal architecture can execute four simultaneous FP16 operations. However, only one type of FP instruction – FP16, FP32, or FP64 – may be executed simultaneously in within a single Pascal SM.
Note that Nvidia’s FP16 compute format is fully compliant with IEEE 754-2008, using round-to-nearest-even for all arithmetic. FP16 fully supports subnormal values and they run at the same speed as normalized values.
Also new to the Pascal architecture is unified memory across CPU and GPU physical and virtual memory. A 49-bit virtual address space enables GPU and CPU memory to exist in the same address space while a hardware “page migration engine” globally manages page faults across the unified memory space.
Nvidia also added “compute preemption” to the Pascal architecture – which is a fundamental enabling feature for an operating system or virtual machine to control task execution on Pascal GPUs. Compute preemption coupled with unified memory across CPUs and GPUs will enable Pascal generation GPUs to look and act like a virtualized, composable pool of physical compute resources. The result is that HPC customers can start talking about GPUs in the context of sharable cloud resources, just like CPUs. This will have a huge impact from a software development point of view.
...
The P100 module is as much of an innovation as HBM memory. Instead of mounting the CoWoS package to an add-in board with a standard PCIe Gen3 x16 interconnect, Nvidia surface mounted the CoWoS package to a dedicated module and connectors. This enabled Nvidia to:
- Vastly increase the number of signal pins to support four NVLink connections instead of only 16 PCIe Gen3 lanes
- Lower the profile of the host chassis
- Supply 300W of power through the motherboard instead of via separate “over the top” wires for power delivery
- The P100 module assumes the cost of the 300W power transfer to the CoWoS package via the large, dedicated power management chips (PMICs, shown in the photo)
- No over the top power supply enables use of mechanically simple large passive heat sinks for both the CoWoS package and the PMICs
Air cooling these 300W modules at 35°C ambient temperature is a significant feat of design.