Nvidia Pascal Announcement

Market segmentation is one thing; here NV is in reality segmenting one particular segment of its market; the same segment.

That doesn't sound very wise.

Have you compared the featuresets of various SKUs within the same product family? S vs T vs K vs no-suffix, it's a mess. ECC, HT, at one point even the various VT technologies were segregated by SKU.
 
Having thought about it, i spotted your trap. GP100 has no dedicated FP16 because it has dedicated FP64 units.
Nope, no trap intended.

If it makes sense to make a dedicated DP SIMD to sit alongside a dedicated SP SIMD, then why doesn't it make sense to build a dedicated HP SIMD?

A DP multiplier contains enough bits in mantissa processing to easily support paired SP (54 bits versus 2 x 24 bits). Exponent computation for paired SP, which amounts to 2x 8-bit isn't enough, though, as DP has 11-bit exponent processing. But exponent processing is trivial (addition).

So a DP/SP multiplier has overheads in routing for SP mode and routing from look-up tables (which will be different for SP and DP). Those overheads can be measured in at least three ways: latency (interaction with RF), area and power. Now, how do those overheads compare between the dedicated-DP/dedicated-SP and multi-precision implementations, bearing in mind that dedicated SP at double rate requires distinct area from DP, which implies extra routing and more power complexity.

Put another way, if Intel is building a single throughput SIMD ALU in Knights Landing that supports 8-bit/16-bit/32-bit/64-bit operands why is NVidia not? Will NVidia go multi-precision, eventually? Or will Intel go dedicated?

Feeding out of the same set of registers endlessly? Rasterizers, ROPs and TMUs running empty?
That would be a data point, but I'm interested in what's the worst case power usage for real work and whether SP or DP causes the most throttling. Was that part of NVidia's balancing act when determining the ratio of SP and DP or even a key point that lead to the decision to go with dedicated SP and DP SIMDs?
 
Regarding, the expense of trying to support multiple precisions for things like Deep Learning + general HPC, in Deep Learning even FP16 is no longer relevant, they're able to do it all with INT8 operations (supported in next release of CUDNN).

The new vec4 INT8 operations are supported by all the recent GP102, GP104 (sm_61), GP106 (sm_61), etc but not the GP100 (sm_60). Theres really no Point for them to support FP16 WRT Deep Learning, the INT8 throughput is insanely high AND MOST IMPORTANTLY those simple integer ALUs require very Little die space compared to those beefy FPUs.

Regarding KNL Deep Learning performance, there is so far ZERO evidence that the KNL could come Close to compete with modern GPUs in Deep Learning EXCEPT for one intel marketing slide that the very biased/untalented journalist Nicole Smith at NextPlatform recently used in her intel Puff piece. Given that KNL doesnt even support FP16, and the fact that regular high end Xeons keep getting destroyed in thirde party benchmarks VS GPUs, I'm pretty darn certain the KNL doesn't have a snowflakes chance in hell to compete in this DL segment.

Intels real push in DL will arrive with their CPU+FPGA servers that are hardwired for Deep Learning processing, that is when intel might actually become competitive.

Now is KNL a threat to Nvidia in the conventional HPC space, probably most likely so! But right now they are actually not stealing market share from nvidia, they are expanding the market by getting customers previously on the fence about accelerators.

Both Nvidia and Intel are reporting huge growth within HPC for their accelerators, the market is expanding.
 
Regarding, the expense of trying to support multiple precisions for things like Deep Learning + general HPC, in Deep Learning even FP16 is no longer relevant, they're able to do it all with INT8 operations (supported in next release of CUDNN).

The new vec4 INT8 operations are supported by all the recent GP102, GP104 (sm_61), GP106 (sm_61), etc but not the GP100 (sm_60). Theres really no Point for them to support FP16 WRT Deep Learning, the INT8 throughput is insanely high AND MOST IMPORTANTLY those simple integer ALUs require very Little die space compared to those beefy FPUs.
.

Not quite so. Training requires floating point. FP32 and with some difficulties also partly FP16.
GPUs have been used in Deep Learning almost exclusively for training purpose.

Once the NN is computed (trained) it can be used (ie inferencing).
This can be done to some extend with INT8.
 
That would be a data point, but I'm interested in what's the worst case power usage for real work and whether SP or DP causes the most throttling. Was that part of NVidia's balancing act when determining the ratio of SP and DP or even a key point that lead to the decision to go with dedicated SP and DP SIMDs?
One hint would be the original Titan, which didn't have Boost enabled in DP-Mode. But then, Boost was very new at the time and maybe Nvidia just didn't feel comfortable with it yet and didn't want to take any risks with the professional crowd.

FWIW, I think that including Raster, ROPs and TMU-filters, which are more likely used in conjunction with SP rather than DP (I guess here mostly the data-patch from the TMUs fetchers comes into play), the highest power would be seen in SP loads.

But it'd be hard to test for two reasons: I know of no DP-workload that also uses most of the fixed function stuff. Second and more important: It's highly likely that even in SP with boost, the card will run into it's power limit. If it does as well for DP workloads, we cannot be sure which one would cause higher power draw if unthrottled.
 
Nope, no trap intended.
If it makes sense to make a dedicated DP SIMD to sit alongside a dedicated SP SIMD, then why doesn't it make sense to build a dedicated HP SIMD?
One thing that comes to mind: Both, the SP and the DP blocks are there already. Pascal really seems not like a major overhaul of the architecture to me. Just added little bits here and there. And I also think that P100 might have started development maybe a bit too early to be really sure whether or not a completely separate HPC chip would be economincally viable. All of Jen-Hsun's talk about exploring new markets aside, I think Nvidia is playing most of their bets rather safe.

A DP multiplier contains enough bits in mantissa processing to easily support paired SP (54 bits versus 2 x 24 bits). Exponent computation for paired SP, which amounts to 2x 8-bit isn't enough, though, as DP has 11-bit exponent processing. But exponent processing is trivial (addition).
Without major stunts (actually I cannot think of anything that would work) you would loose half your potential FP32 throughput as soon as you cannot find paired instructions anymore in addition to the more complex instruction routing.

Of course I could be wrong with a high likelihood, but for a couple of generations now, power seems to be the main concern, not area anymore. So you would have to have a whole line of your GPUs totally dedicated to HPC, or your other chips (think about power) would have to carry the more complex mulitpliers (and adders and muxes) in their guts as well. Correct me if I'm wrong, but even 53x53 MULs would be ok for DP, right? For iterating over the SP-ALUs, you'd need 27x27, which is a ~26,5% increase over the 24-bit MULs. I am not sure, if you can effectively mask the addtional bits out so they do not use any energy anymore.


So a DP/SP multiplier has overheads in routing for SP mode and routing from look-up tables (which will be different for SP and DP). Those overheads can be measured in at least three ways: latency (interaction with RF), area and power. Now, how do those overheads compare between the dedicated-DP/dedicated-SP and multi-precision implementations, bearing in mind that dedicated SP at double rate requires distinct area from DP, which implies extra routing and more power complexity.

Put another way, if Intel is building a single throughput SIMD ALU in Knights Landing that supports 8-bit/16-bit/32-bit/64-bit operands why is NVidia not? Will NVidia go multi-precision, eventually? Or will Intel go dedicated?
Intel already has these AVX-based ALUs and plans on using them in their regular Xeon processors as well. They need their code compatibility which has been touted as (one of) the big advantages of going Intel from day one.

My feeling is, that maybe even with Volta, we might see a completely separated lineup for HPC (FP64+FP32-focus), Deep-Learning (FP16+INT8-focus and other uses such as gaming (FP32 focus plus whatever you can cram in for free, INT8?).
 
Have you compared the featuresets of various SKUs within the same product family? S vs T vs K vs no-suffix, it's a mess. ECC, HT, at one point even the various VT technologies were segregated by SKU.
Yeah, Intel market segmentation is pretty messy, but you don't need to buy two different xeon SKUs to get full performance out of your setup if you're say in the oil and coal prospecting business (or whatever.)
 
You actually cannot. But you might in the future. Just think about a high-speed serial processore (traditional Xeon) and the parallel-version (KNL). Should be pretty useful, to have best of both worlds able to be combined in one machine. WRT to Deep Learning, you probably would use other machines in the field for Inferencing than you would use in the lab for training.
 
Have you compared the featuresets of various SKUs within the same product family? S vs T vs K vs no-suffix, it's a mess. ECC, HT, at one point even the various VT technologies were segregated by SKU.
Where Tesla competes and also the narrative for Pascal Titan, you need to look at Knights Landing though rather than just traditional Xeon/PCs segmentation.
This is what the focus should be looking at from Intel rather than traditional Xeons.
Cheers
 
Last edited:
That's a really interesting and perceptive question. The energy needed to compute a FP64 FMA is about four times the energy needed for FP32. So 1/2 rate FP64 ALU's could in theory use twice the wattage of FP32! But is that increase minor compared to the significant overhead of data transfer and static RAM register memory access? My immediate guess is "the power difference is ignorable" but there's evidence that it's not. Modern high core count Xeons have to power gate and downclock 20% or more when running AVX code, showing the ALUs are using a large fraction of the Xeon's power budget. A GPU is even more ALU dense so it should be even more sensitive.

An easy way to test this is to take a Kepler Titan, Quadro, or Tesla (with unlocked 1/3 FP64 rate) , and run say both SGEMM and DGEMM and look at the wall socket power use. Any power difference will be even more distinct in P100 with its 1/2 FP64 rate.
Can you really compare the Kepler DP Cuda core to that of the P100 and how to calculate the efficiency improvement jump going Kepler-Maxwell-Pascal?
Unfortunately any answers to the recent various questions really needs someone with a P100 to post various test results, but we are not going to see them anytime soon on Nvidia Devblog, nor any reviews/independent benchmarks looking at its performance/efficiency.
Thanks
 
My feeling is, that maybe even with Volta, we might see a completely separated lineup for HPC (FP64+FP32-focus), Deep-Learning (FP16+INT8-focus and other uses such as gaming (FP32 focus plus whatever you can cram in for free, INT8?).
Games will be using mixed FP16 + FP32 as soon as we get consoles that support FP16. When developers get their shader code bases optimized for FP16 (and INT16), there will be a clear advantage in games for GPUs that support FP16/INT16 (double rate math + half register storage). AMD's entire desktop lineup already supports FP16/INT16 (Tonga, Fiji, Polaris10/11). I am sure Nvidia follows the suit when there are real performance gains to be seen in most of the games.

Personally I don't see much use for INT8 math in games. HDR output demands at least 10 bits per channel and FP processing suits HDR pipelines better than integer. INT16 is useful for many compute tasks (2d pixel coordinate math, etc). INT16 can replace INT32 in most cases (for zero loss of quality).

Similarly I don't see need for FP64 in games anytime soon. 3xINT32 position data is sufficent to model the whole earth in millimeter precision. FP32 is enough for view space (and/or camera centered) math.
 
So, in a years time I need to get a Vega10 or P100? :D
I think Nvidia are lucky that so many PS4 and xbox-ones have been sold and need supporting, never thought I would say that :)
Otherwise they would have customer fatigue with their products always missing a nice feature.
That is assuming they actually give gaming cards a mixed precision Cuda core with Volta, with Nvidia who freaking knows.
By the time PS4/Xbox-one is retired/less than the 1.5 update, it will be interesting to see the customer card demograph for Nvidia.
Still could be a fair few Pascal gamers.
Cheers
 
Now we just have GP107 and 108 left.
GP108 is bound to match Sky/Kaby Lake's GT4e, whose premium has been going down if we're to trust Intel's MSRPs.
If GP107 goes down to ~30W on mobile like P11, there's a chance we won't see a GP108.
 
GP108 is bound to match Sky/Kaby Lake's GT4e, whose premium has been going down if we're to trust Intel's MSRPs.
If GP107 goes down to ~30W on mobile like P11, there's a chance we won't see a GP108.

How many laptops with GT4e are available compared to 940M/MX?

We will see a GP108. Performance aside (I do expect it to beat GT4e anyway), there is a marketing aspect to having a dedicated GPU as compared to "Intel Iris Graphics"
 
How many laptops with GT4e are available compared to 940M/MX?

The consumer versions of GT4e chips aren't out in any consumer product except for NUCs, AFAIK.
Intel delayed the availability of GT4e chips a lot, but the difference between the 6700HQ and 6770HQ is only $50.

I don't think the back-to-school and holidays laptops will have <$50/60 discrete GPUs anymore. It wouldn't make much sense given the performance targets for these bottom-of-the-barrel GPUs.
 
We will see a GP108. Performance aside (I do expect it to beat GT4e anyway)
I am not entirely sure. GM108 can't quite keep up with Skylake GT3e albeit it depends heavily on the benchmark - when paired with ddr3. Of course, in the bandwidth department this is a hugely unfair match - GT3e has twice the memory bandwidth to begin with (when paired with dual channel memory) AND on top of that 64MB LLC...
I think that's the biggest issue GP108 would have, ddr3 64bit just isn't going to cut it (improved compression or not) no matter how fast the chip itself may be. So I think they have to be either serious about gddr5 this time (albeit the non-clamshell mode would "only" give 2GB of memory) or at least adapt ddr4 if gp108 wants to stand a chance against Kaby Lake GT4e.
 
Yeah, Intel market segmentation is pretty messy, but you don't need to buy two different xeon SKUs to get full performance out of your setup if you're say in the oil and coal prospecting business (or whatever.)

If that's the argument then Nvidia isn't doing that either. Tesla, Quadro, Titan, and Geforce are all different product families, not simply different SKUs within the same family.
 
The consumer versions of GT4e chips aren't out in any consumer product except for NUCs, AFAIK.
Intel delayed the availability of GT4e chips a lot, but the difference between the 6700HQ and 6770HQ is only $50.

I don't think the back-to-school and holidays laptops will have <$50/60 discrete GPUs anymore. It wouldn't make much sense given the performance targets for these bottom-of-the-barrel GPUs.

Ok so how many laptops with Broadwell or even Haswell GT3e then?

These "bottom of the barrel" GPUs are usually paired with mid range i5 processors and not the much more expensive gt3e/gt4e variants and they typically perform much better than the HD 5500/HD 520, especially since most laptops are hobbled by single channel RAM. Even compared to GT3e and GT4e the performance is comparable, especially if they are equipped with GDDR5 (eg 940MX). And like I said..do not underestimate the marketing aspect of a discrete GPU.
I am not entirely sure. GM108 can't quite keep up with Skylake GT3e albeit it depends heavily on the benchmark - when paired with ddr3. Of course, in the bandwidth department this is a hugely unfair match - GT3e has twice the memory bandwidth to begin with (when paired with dual channel memory) AND on top of that 64MB LLC...
I think that's the biggest issue GP108 would have, ddr3 64bit just isn't going to cut it (improved compression or not) no matter how fast the chip itself may be. So I think they have to be either serious about gddr5 this time (albeit the non-clamshell mode would "only" give 2GB of memory) or at least adapt ddr4 if gp108 wants to stand a chance against Kaby Lake GT4e.
Yep..64 bit ddr3 would not cut it anymore. Even Higher clocked DDR4 (Say 3000 mhz) would not be enough IMHO. They'd have to move to GDDR5 and there's evidence they are moving in this direction as we saw with the 940MX. 4 GB with clamshell should be possible. FWIW I dont expect Kaby Lake to be much of an improvement. We'll find out soon enough though..Kaby Lake is shipping to OEMs already.
 
Back
Top