Nvidia Pascal Speculation Thread

Status
Not open for further replies.
The SFU units are dedicated to handle higher-order non-pipelined operations, like RCP, RSQ, SIN, COS, MOV, attribute interpolation and similar. In some past NV architectures, the SFU was able to execute a parallel FMUL op in certain conditions.

Okay, i haven't expressed myself exactly: What about the functions in the SFU? Are they still the same and can they run Ops withoput "closing" the affected pipe for the duration?

The SFUs are still there and haven't really changed. There are 8 per SMM partition.

http://images.anandtech.com/doci/7764/SMMrecolored.png

As said above: Didn't meant the numbers, but thx. :)
 
This slide:
VrptSfb.jpg

seems to imply 4:2:1 ratio for HP : SP : DP, optimized so that all levels take maximum use of on-chip throughput.
 
Well, one could theoretically interpret that slide as half-precision having 4x throughput of single-precision (with double-precision throughput undefined), but that would be really really odd...! :p

I'm sure DP:SP:HP won't REALLY be 1:2:4 in consumer boards though, because this is evil fucking nvidia we're talking about here. They'll find a way to knock it down to like, 1 op per 32 cycles or some shit like that in the Geforce line.
 
It's a single GPU solution ....

Probably, entirely within bounds of reason if it's 1:2 as well. Theoretically a Fury X would already get 4 teraflops DP if it had the hardware to go 1:2. Trouble is, an increase to 8 teraflops SP is, theoretically, only a 30% increase from a Titan X. That would normally be plenty big, but not for a jump as big as is expected from 28nm to 16nm. I'd have expected a jump of 50% or more... maybe their 1:2 ratio really does take a lot of silicon though.
 
Why it doesn't match HBM2 bandwidth then?
If DP rate was 1/4 with 2 GPUs, bandwidth would be rated at 2 TBytes/second with 2xHBM2. K80 bandwidth perfectly matches it's flops in the charts on page 7 - http://www.ecmwf.int/sites/default/files/HPC-WS-Posey_0.pdf
I didn't know this pdf. I just ask SMCI at SC15, and the guy at their booth said they will build some servers for the next dual GPU Tesla, and it will provide 4 TFLOPS DP.
 
Probably, entirely within bounds of reason if it's 1:2 as well. Theoretically a Fury X would already get 4 teraflops DP if it had the hardware to go 1:2. Trouble is, an increase to 8 teraflops SP is, theoretically, only a 30% increase from a Titan X. That would normally be plenty big, but not for a jump as big as is expected from 28nm to 16nm. I'd have expected a jump of 50% or more... maybe their 1:2 ratio really does take a lot of silicon though.
GM200 is a 600 mm2 chip on a very mature and relatively inexpensive process. I wonder if Nvidia is being more conservative with GP100, since it is their first 16nm product.
 
Probably, entirely within bounds of reason if it's 1:2 as well. Theoretically a Fury X would already get 4 teraflops DP if it had the hardware to go 1:2. Trouble is, an increase to 8 teraflops SP is, theoretically, only a 30% increase from a Titan X. That would normally be plenty big, but not for a jump as big as is expected from 28nm to 16nm. I'd have expected a jump of 50% or more... maybe their 1:2 ratio really does take a lot of silicon though.
The GK104 SMX is 16 mm^2 and the GK110 SMX is 22-23 mm^2 (my estimates), which is a 40% increase in SMX area. To my understanding there are other changes between the two chips besides DP rate, but I still expect a decent increase in area from a GM20x SMM to a Pascal 1:2 DP SM, ignoring the process change.
 
The GK104 SMX is 16 mm^2 and the GK110 SMX is 22-23 mm^2 (my estimates), which is a 40% increase in SMX area. To my understanding there are other changes between the two chips besides DP rate, but I still expect a decent increase in area from a GM20x SMM to a Pascal 1:2 DP SM, ignoring the process change.

Yet, if we just go by DP and SP rate, a Fury X replacement could increase by just 25% to say, 10 teraflops SP, have the same amount of room to increase DP to 1:2 on their high end compute focused cards as Nvidia does (as both Fury X and Titan X are around the same PCB size with the same gaming first/DP almost not at all concentration) and still come out ahead of Nvidia by 25% the DP rate.

Nvidia, with it's CUDA environment and 12gb of ram, did well in compute with the Titan X, and thanks to some large bottleneck in design with the Fury X did better in gaming as well. But assuming any bottleneck there is removed in a new high AMD GPU design, they'd have the clear perf advantage for gaming, and since both vendors are using HBM2 with assumedly the same yields, they'd end up with the same RAM size for both chips. So a 4 teraflop performance for Nvidia's highest end compute seems, at least in speculation, a comparatively low target. Especially if AMD manages to hit their "2x perf per watt over previous generation" that they've stated. Which would, at a 250 watt tdp, put their top end GPU at 67% higher perf than a Fury X. I'm discounting a huge share of this to assume they mean gaming performance only, and a large part of that coming from having no design bottleneck choking performance. But even assuming that it would seem AMD might have a very large advantage.
 
GM200 is a 600 mm2 chip on a very mature and relatively inexpensive process. I wonder if Nvidia is being more conservative with GP100, since it is their first 16nm product.
NVidia's first 40nm chip was the tiny 57mm 16-core GT218 in November 2009. The 529mm^2 GF100 didn't ship until March 2010, 6 months later.
NVidia's first 28nm chip was GK104 in March 2012. GK100 didn't come until November 2012 (in the form of Tesla K20), 7 months later.

So it would be reasonable to see NVidia launching 14nm with GP104, and GP100 following 6 months later.
The introduction of HBM2 memory could disrupt this, though. We don't know if GP104 will have HBM2 or not, and if it doesn't, perhaps NVidia would launch the flagship first to boost the Pascal performance marketing halo.
 
GM200 is a 600 mm2 chip on a very mature and relatively inexpensive process. I wonder if Nvidia is being more conservative with GP100, since it is their first 16nm product.

It'll be really disappointing if it is only 8TF. With 1TB of memory bandwidth (3x Titan-X) 8TF (Fury-X is already has that beat) sounds pretty pathetic.
 
NVidia's first 40nm chip was the tiny 57mm 16-core GT218 in November 2009. The 529mm^2 GF100 didn't ship until March 2010, 6 months later.
NVidia's first 28nm chip was GK104 in March 2012. GK100 didn't come until November 2012 (in the form of Tesla K20), 7 months later.

So it would be reasonable to see NVidia launching 14nm with GP104, and GP100 following 6 months later.

No delay this time. The GP100 will come nearly at the same time as the GP104 because Nvidia needs it for HPC and also to compete against Intel.
 
It'll be really disappointing if it is only 8TF. With 1TB of memory bandwidth (3x Titan-X) 8TF (Fury-X is already has that beat) sounds pretty pathetic.
1. Aren't the figures based upon Tesla parts generally attributed to base clocks? K80 for example ships with a base clock of a meager 560MHz for an FP64 theoretical throughput of 1.87 TF (close to the graph/slide shown earlier), yet the K80 boosts to 875MHz (theoretical max of 2.91 TF) depending upon thermal/power limits (spec PDF). The K40 plot point in the graph/slide/pdf posted earlier also pertains to base clock (745MHz) and not the 810/875MHz boost.
B0laHsv.jpg

2. I'd be wary of translating Tesla number to a GeForce product line. GK110B in Tesla has pretty conservative clocks (K40 has a 745MHz base, 810 or 875MHz boost as I just noted). GeForce GK110B's, somewhat higher at 889MHz base, 980MHz nominal boost for the Titan Black. Incidentally, even though boost is disabled for the Titan Black when executing FP64 at its native 1:3 ratio, the Titan Black's theoretical throughput is higher (1.7TF) than the K40 (1.43TF). The FP32 figures of course are vastly different. K40's specification is 4.29TF, while Titan Black's nominal spec is 5.1TF which would rise to 5.6 - 5.9TF with actual observed boost applied.
 
No delay this time. The GP100 will come nearly at the same time as the GP104 because Nvidia needs it for HPC and also to compete against Intel.

reading this make me think, that the GP100Gaming based gpu will still come 6-8 months after the GP104 based (like GK110 at this time ) ... But itwill be available for HPC, workstation market before ..

or do you think, they will release the GP100 gaming gpu at the same time of the GP104 one ?
 
*Very long*

Huh, different reporting methods is entirely possible. I would have assumed Nvidia would put its absolute best foot forward for PR purposes, but maybe they only consider base clock for compute centric cards out of concern for perf/watt and etc.

Regardless, GP100 will almost certainly come out as a compute/professional card first, with a slightly binned consumer version coming later. That's been the case for both the Titan and Titan X, so I don't see a reason it wouldn't continue. WHEN it will come out will be dictated far more by yields numbers than by competition concerns. As the biggest card it'll need the most reliable yield unless Nvidia wants to throw a huge amount of defective cards just for the sake of market share. I'm sure, with their profits this year, they'll feel confident in waiting a few months for yields to improve if that's what it takes. They've already pulled back their promises for RAM over yield concerns with HBM2 after all.
 
I remember the 2.91TFLOPs figure for the K80 being thrown around in the media which was decidedly the boost clock.

http://images.anandtech.com/doci/8729/TK80.jpg

Also see the table here,

http://www.anandtech.com/show/8729/nvidia-launches-tesla-k80-gk210-gpu

:LOL:

But with those graphs I haven't seen such discrepancy as above. Here's one with both K80 and K40 with boosts,

http://cdn.wccftech.com/wp-content/...d-computing-path-forward-page-004-635x357.jpg

Maxwell based Tesla cards don't seem to be that handicapped relative to their geforce versions, in fact the Titan X based Tesla hits even higher numbers, 7TF vs. 6.7 from TR's review. Some of that might be due to having same amount of vram(besides having DP at the same rate), but then HBM would allow better clocks as well. The base clock for M40 is 948 according to nvidia's site(though the boost is lower than AT's reported, but still higher than TX's),

http://images.nvidia.com/content/tesla/pdf/tesla-m40-product-brief.pdf

A 1Ghz base clock of 4096 shaders part with 8TF SP and 4TF DP seems a good bet. It'd also look way better marketing to compare the base clock now in those graphs vs. K80.
 
.... but then HBM would allow better clocks as well. The base clock for M40 is 948 according to nvidia's site(though the boost is lower than AT's reported, but still higher than TX's)


HBM doesn't automatically allow higher clocks, higher clocks are through design and nodes, not by choice of vram. It might allow for more silicon to be used for the shader array though.
 
Huh, different reporting methods is entirely possible. I would have assumed Nvidia would put its absolute best foot forward for PR purposes, but maybe they only consider base clock for compute centric cards out of concern for perf/watt and etc.

Regardless, GP100 will almost certainly come out as a compute/professional card first, with a slightly binned consumer version coming later. That's been the case for both the Titan and Titan X, so I don't see a reason it wouldn't continue. WHEN it will come out will be dictated far more by yields numbers than by competition concerns. As the biggest card it'll need the most reliable yield unless Nvidia wants to throw a huge amount of defective cards just for the sake of market share. I'm sure, with their profits this year, they'll feel confident in waiting a few months for yields to improve if that's what it takes. They've already pulled back their promises for RAM over yield concerns with HBM2 after all.


Well the other side of bringing out the professional series first is they do have competition with Knights Landing, so I think that is correct we will see them first before the consumer cards.
 
HBM doesn't automatically allow higher clocks
Maybe it allows for lower clocks, in a way, since now the RAM sits right under the heatspreader with the CPU core, meaning it gets grilled to whatever temp the core is running at... I'm no expert about these things, but hasn't it said that DRAM performance characteristics is affected by its running temperature?
 
Status
Not open for further replies.
Back
Top