Nvidia Pascal Announcement

Not sure I buy this.... I mean why bother with texturing at all if it is strictly an HPC part.
Perhaps there are still algorithms where texture-based memory accesses on contemporary NVidia chips outperform straight memory based algorithms. This has certainly been true in the past.

Also, with so much emphasis by NVidia on image processing (to the extent that they seem to have caught up with GCN's image processing ops for working on 8-bit data, for example) for deep learning, texture units are really good at fast access paths for < 32-bit data types.
 
Not sure I buy this.... I mean why bother with texturing at all if it is strictly an HPC part.

Dumb question: considering TMUs are integrated into SMMs, would it had been relatively easy to remove them?
 
There are plenty of examples in the CUDA SDK that use textures. Both to read images, 3D textures, use the interpolation HW etc.
 
I make it the default behavior because as per your own words, for all intents and purposes, and in regards to power consumption, it is the default behavior. Unless what you said here is false, that is:
Kepler was 2/3rds of peak throughput, if you actually tried to read from 3 different registers for a, b and c. If you re-used one slot, you could come close to maximum rate. So, instead of a × b + c = d, you' d have to to sth like a × b + a = d
I bolded a relevant part for your convenience. And that largely makes the rest of your lenghty posting obsolete, since you insist on calling the worst case a typical case. And with that, I rest my case - I don't have anything to gain to convince you, especially since I am in no way seeing my opinion as set in stone or the absolute truth.

How are Rasterizers, Geometry, Tessellators, ROPs. etc. at all relevant? We are talking about FP32 compute vs FP64 compute here afaik. What are those used for FP32 computing that couldn't be used when computing FP64??
But of course they could. I just have yet to see an example where they actually are.
 
Dumb question: considering TMUs are integrated into SMMs, would it had been relatively easy to remove them?
I wondered already why Nvidia revived the term „TPC“ again for Pascal. Haven't seen it since GT200, IIRC. In Fermi a SM took over the role of a TPC for all intends and purposes. Maybe they're not so integrated any more and one of the architecture's variable parts?

And btw: How many transistors does GP100-Pascal have?
15.3 bln as said in the devblog:
https://devblogs.nvidia.com/parallelforall/inside-pascal/

18 bln as said in the Pascal architecture overview?
https://developer.nvidia.com/pascal

or 150 (!) bln, as stated by the GPU-architecture microsite:
http://www.nvidia.com/object/gpu-architecture.html

While the latter has to somehow include the DRAM dies, it also clearly statest "With 150 billion transistors built on bleeding-edge 16 nanometer FinFET fabrication technology, Pascal GPU is the world's largest FinFET chip ever built"
 
Last edited:
My theory is that Nvidia always disables multiprocessors at TPC level, whether a TPC is defined by a single (Fermi ~ Maxwell) or multiple (G80 ~ GT200) multiprocessors.
Now in Pascal, a TPC consists of two MPs, so salvage parts should scale down by disabling pairs of multiprocessors.
 
Last edited:
since you insist on calling the worst case a typical case.

I didn't do that. I said that the worst case and best case probably are not as different as you make them be in regards to total power consumption, since the bulk of the consumption comes from data movement and data movement remains the same in both best and worst cases.

iirc Nvidia quotes 10 pJ/DP Flop @ 1000Mhz @16nm in their presentations. From that I'd say 10 pJ for single precision @ 900Mhz @28nm would be a good guess, if not on the high side. So for 4.5 TFlops in the Titan that would amount to 45W being consumed by the entirety of the FP32 ALUs working at once, in a 250W card. 1/3rd of that or 15W in the best case scenario doesn't seem like much (8-10W average?) and I can totally see an increase in data movement producing a larger impact on power consumption.
 
I wondered already why Nvidia revived the term „TPC“ again for Pascal. Haven't seen it since GT200, IIRC. In Fermi a SM took over the role of a TPC for all intends and purposes. Maybe they're not so integrated any more and one of the architecture's variable parts?

Since as ninelven said above, removing them from the clusters is trivial and considering Jawed's legitimate points above, would they really need 240 TMUs for texture based memory accesses and/or imaging stunts? If not why then have a quad TMU for each cluster and not just dumb it down to one quad TMU for each two clusters? 120 TMUs less in theory isn't exactly small in terms of die area.

And btw: How many transistors does GP100-Pascal have?
15.3 bln as said in the devblog:
https://devblogs.nvidia.com/parallelforall/inside-pascal/

18 bln as said in the Pascal architecture overview?
https://developer.nvidia.com/pascal

or 150 (!) bln, as stated by the GPU-architecture microsite:
http://www.nvidia.com/object/gpu-architecture.html

While the latter has to somehow include the DRAM dies, it also clearly statest "With 150 billion transistors built on bleeding-edge 16 nanometer FinFET fabrication technology, Pascal GPU is the world's largest FinFET chip ever built"

When those transistors get wet, they blow up like beans :D Jokes aside there are a ton of question marks still about GP100. As a reminder as up to recently we had 17b transistors for Pascal, so it probably comes to down how creative marketing can be to range from 15.3 to 150b transistors. IMHO the chip itself is at 15.3b which gives a transistor density of 25.08Mio/mm2 roughly 86% higher than with GM200@28HP TSMC + the higher frequencies.

One of the further question marks would be if smaller chips will follow a similar trend or if they've invested there in higher densities with lower frequency increases to their predecessors.
 
Since as ninelven said above, removing them from the clusters is trivial and considering Jawed's legitimate points above, would they really need 240 TMUs for texture based memory accesses and/or imaging stunts? If not why then have a quad TMU for each cluster and not just dumb it down to one quad TMU for each two clusters? 120 TMUs less in theory isn't exactly small in terms of die area.
[…]
One of the further question marks would be if smaller chips will follow a similar trend or if they've invested there in higher densities with lower frequency increases to their predecessors.
I was not trying to suggest the existence or non-existence of TMUs, just musing generally about the resurfacing of TPC which, btw, I have seen written out as „thread processing cluster“ as well.
 
Now, correct me if I'm wrong but DP ALUs consume a little more than 2x as much as SP ALUs
Multipliers scale n^2. A 64 bit multiplier is 4x of a 32 bit multiplier. But the input/output data size is only 2x, so the power cost is likely closer to 2x than 4x.
 
Now the dust has settled and we know the GP100 will not be for our mere mortals, what can we expect as the next Pascal in reach ?
That would be the GP104.
What about 4 GPCs with each 10 SM, for a total of 40x64 = 2560 FP32 cores and 160 TMUs ?
Perhaps 64 ROPs and 256-bit GDDR5X.
Without DP and NVlink would be in the 300-400 sqmm range.
A clock of around 1.6 Ghz.
Such a configuration would still be good enough to slightly outrun current high end Maxwell GM200.
 
Now the dust has settled and we know the GP100 will not be for our mere mortals, what can we expect as the next Pascal in reach ?
That would be the GP104.
What about 4 GPCs with each 10 SM, for a total of 40x64 = 2560 FP32 cores and 160 TMUs ?
Perhaps 64 ROPs and 256-bit GDDR5X.
Without DP and NVlink would be in the 300-400 sqmm range.
A clock of around 1.6 Ghz.
Such a configuration would still be good enough to slightly outrun current high end Maxwell GM200.

How about a higher transistor density and lower clocks instead?
 
I think we won't see a Compute Capability v6.0 GPU for consumer SKUs. Instead there would be a v6.1 ISA refresh with still unknown changes to the SM configuration (except reduced DP throughput). And probably an additional TMU quad and doubled L1T cache.
 
How about a higher transistor density and lower clocks instead?
We already know yield is pretty low given 4 defects are taken into account with GP100.
Even with a die half the size still 2 SMs would need to be disabled. So going large doesn't seem very economical.
 
hmm doesn't work that way, errors in a die get more as the size of the die increases (you tend to get more errors at the outer limits of a chip), this is normal because there is more area at the outside. So bigger chips have more chances of having more errors than smaller chips.
 
hmm doesn't work that way, errors in a die get more as the size of the die increases (you tend to get more errors at the outer limits of a chip), this is normal because there is more area at the outside. So bigger chips have more chances of having more errors than smaller chips.
What?
 
Back
Top