Nvidia Pascal Speculation Thread

silent_guy · Jan 16, 2016

Pixel said:
Will gamers even at 4K really see any noticeable improvement in performance or visuals of games using HBM2 over the upcoming GDDR5X?

If the GPU has 50% more calculation power and the BW can be made 50% faster as well with GDDR5X, then it doesn't HBM2 isn't going to add a whole lot of extra performance.

http://www.techpowerup.com/forums/threads/article-just-how-important-is-gpu-memory-bandwidth.209053/

Nice! I didn't know that 'Memory Controller Load' as a recordable metric has existed for almost years.
In their quest to provide unique content, it seems like such a logical thing for review website to write articles about that, but I don't think that has ever been done?

Benetanegia · Jan 16, 2016

silent_guy said:
Voxilla already posted a link to an EETimes article that says that it can do 8 bit operations. Isn't that specialized enough?
Google "deep neural net 8 bit" and you find plenty of article claiming that 8 bit is enough for deep learning. What more do you want to hear?

To me that didn't sound specialized enough, no. Isn't 8-bit useful for anything else anymore? Any reason they wouldn't mention 4x 8-bit rate? Afaik latest iterations of GCN support those instructions too, but they don't do it at a higher rate.
So to me, when Ryan said "special case operation (ala the Fused Multiply-Add)", that's exactly what I understand...

Voxilla · Jan 17, 2016

Benetanegia said:
To me that didn't sound specialized enough, no. Isn't 8-bit useful for anything else anymore? Any reason they wouldn't mention 4x 8-bit rate? Afaik latest iterations of GCN support those instructions too, but they don't do it at a higher rate.
So to me, when Ryan said "special case operation (ala the Fused Multiply-Add)", that's exactly what I understand...

For computing neural networks you need multiplication and addition, don't expect a magical new special operation.
Typically the neuron outputs are 8 bit unsigned and the neural net weights are 8 bit signed.
For fully connected layers of say 1024 inputs and 1024 outputs, you have a 1024x1024 matrix in between.
All computation goes into multiplying a 1024 vector with a 1024 x 1024 matrix.
In case of 8 bit 'special' hardware can speedup this by doing for example n0*w0 + n1*w1 + n2*w2 + n3*w3
and accumulate this with a 32-bit accumulator, the multiplications being 8 bit. Hence the mixed precision.

To wow the crowds, sure 24 tera Deep Leanrning operations per second sound more impressive than 24 tera 8-bit mixed precision operations.

Jawed · Jan 17, 2016

There's a more fundamental problem with the idea of hundreds of watts of GPUs doing sensors in a car: a 5W FPGA will do the same. And a 1W ASIC will do the same.

silent_guy · Jan 17, 2016

Jawed said:
There's a more fundamental problem with the idea of hundreds of watts of GPUs doing sensors in a car: a 5W FPGA will do the same. And a 1W ASIC will do the same.

Yeah, no.

RecessionCone · Jan 17, 2016

Jawed said:
There's a more fundamental problem with the idea of hundreds of watts of GPUs doing sensors in a car: a 5W FPGA will do the same. And a 1W ASIC will do the same.

This is not true. Think about it for a second.

Voxilla · Jan 17, 2016

Jawed said:
There's a more fundamental problem with the idea of hundreds of watts of GPUs doing sensors in a car: a 5W FPGA will do the same. And a 1W ASIC will do the same.

There is some truth in this as dedicated hardware can greatly reduce power consumption taking advantage of some of the properties of neural networks.
Even with 8 bit weights, there is still much redundancy in the computations, and it is hard to avoid with GPU SIMD style computations.
Nvidia is also aware of this, see for example paper
http://arxiv.org/pdf/1510.00149v3.pdf

The trick is to compress the neural network using techniques such as pruning, vector quantization and huffman encoding.
That way the whole network may fit in on chip memory given compression by up to 30x.
Custom hardware can avoid a lot of the unnecessary computations where neuron outputs or weights are 0.
That can result in >10 times less computations.
So both computations and required memory bandwidth can be greatly reduced with dedicated hardware.
(see also http://www.nextplatform.com/2015/12/08/emergent-chip-vastly-accelerates-deep-neural-networks/)

Jawed · Jan 17, 2016

Let me give an example from bitcoin: a 28nm GPU gets about 3 million hashes per joule. A good FPGA achieved about 20MH/J. The newest ASIC does 4000MH/J. Let's be generous and say a GPU on the same process would get to 10MH/J.

Do the math.

silent_guy · Jan 17, 2016

Jawed said:
Let me give an example from bitcoin: a 28nm GPU gets about 3 million hashes per joule. A good FPGA achieved about 20MH/J. The newest ASIC does 4000MH/J. Let's be generous and say a GPU on the same process would get to 10MH/J.

Do the math.

Because shifting bits around (something FPGAs excel at) is totally the same as doing tons of math and memory operations (which they are ok at.)

Come on, man. FPGAs are pretty decent at neural networks (see the Microsoft paper on this, even though that was for inference only, but that should be ok for the car things), but you're claiming a factor 50 in terms of perf/W whereas Microsoft claims less than a factor of 5.

Arun · Jan 17, 2016

I agree FPGAs are not a serious risk to GPUs for neural networks. However ASICs could be; consider that Mobileye currently dominates the ADAS market with a specialised DSP-based solution. As the requirements become more obvious, I would hope they make more and more of the functionality fixed-function. The market is already there so I don't think it's a technical question - it's purely a practical one of whether the right semiconductor vendors will come up with the right solutions at the right time or not. It might or might not happen; I don't know.

The nextplatform article on SRAM-centric designs is very interesting. In my mind, it is *NOT* only about the power efficiency of the SRAM, but also the area you save from not having to hide as much memory latency. GPUs could be a lot denser if we didn't have to worry about 200-600 cycles memory latencies (depending on target markets, including the memory hierarchy inside the GPU, excluding MMU misses, etc...). The problem is that the amount of latency tolerance required is a step function; either the dataset fits or it doesn't. If you see a gradual (rather than sudden) improvement in latency tolerance (and not only bandwidth - caches are great there for spatial locality!) with larger cache sizes, I believe it is typically because some *timeslice* of the workload/algorithm dataset fits. This is very obvious in 3D graphics where different render passes have very different memory characteristics.

The other problem is that memory technology densities are also a step-function. Typically we go straight from on-chip SRAM to off-chip DRAM which has a massive gap in efficiency (latency/bandwidth/power); if your dataset is too large to fit in SRAM but doesn't need that much DRAM, you're effectively leaving a lot of efficiency on the table. I've always been hoping for eDRAM (or other so-close-but-so-far memory technologies that never work out, e.g. ZRAM) to become more prevalent to bridge that gap, but at the moment it's still very niche (e.g. IBM POWER8... I am not sure I would describe Intel's L4 as eDRAM personally although it has the same benefits). HBM helps a bit although more for bridging the bandwidth gap than the latency gap AFAIK... In the case of DNNs, a large part of the dataset is read-only afaik (needs to be reflashable), but again there isn't a viable memory technology to benefit from that trade-off today...

The article says that Song Han works under Bill Dally (who is still at Stanford in addition to his Chief Scientist role at NVIDIA). Given how much of his life Bill Dally has spent researching locality and on-chip networks, and given the implications for NVIDIA, I wonder what he personally thinks will happen...

Jawed · Jan 17, 2016

silent_guy said:
Come on, man. FPGAs are pretty decent at neural networks (see the Microsoft paper on this, even though that was for inference only, but that should be ok for the car things), but you're claiming a factor 50 in terms of perf/W whereas Microsoft claims less than a factor of 5.

My actual claim is about ASICs, because vehicles is a market measured at around 100 million annually, easily big enough to sustain ASICs. But you're right a factor of 5 to 10 on the same process for FPGA is probably fair.

The bitcoin FPGA data is skewed by the fact that FPGAs were short-lived, being swept away by ASICs. It's unclear, but the FPGAs that swept away 28nm GPUs were on 45nm or worse nodes. So the bitcoin comparison I've referenced favours GPUs over FPGAs.

The ASICs that originally swept away FPGAs were on nodes such as 110nm...

Microsoft's paper indicates that the move to 20nm Arria (from Stratix V on 28nm) delivers over 70% efficiency gain. 20nm is not known for its efficiency gains, so it's hard to judge how much of that is due to the node.

fuboi · Jan 17, 2016

Jawed said:
Microsoft's paper indicates that the move to 20nm Arria (from Stratix V on 28nm) delivers over 70% efficiency gain. 20nm is not known for its efficiency gains, so it's hard to judge how much of that is due to the node.

IIRC Arria has fixed function SP ALUs, but maybe it was the next litho node Arria.

On latencies: and 8GB HBM stack could be a 1GB SRAM HBM, but the cost...

silent_guy · Jan 17, 2016

Still, the Bitcoin comparison was ridiculously skewed in favor of FPGAs, with 90% of the logic probably just sitting idle.

As for ASICs, it's inevitable that dedicated silicon will enter the market and be probably very good at it, especially if it's inference only. But not 1W good, at least not for same absolute performance and doing the same workload.

silent_guy · Jan 17, 2016

Rule one about reading breakthrough research results is to look about what it doesn't do.
The article only shows performance improvements for the fully connected layers and doesn't talk about convolutional layers at all.

AFAIK the latter accounts for a major part of the calculation resources. Conv layers also use a comparatively low amount of weights, since the kernels are typically very small (3x3, 5x5), so compression wouldn't be as spectacular either and could fit inside the on-chip caches.

Not saying that it isn't an impressive result, it is. But I'm not convinced it's apples to apples.

Frenetic Pony · Jan 17, 2016

Arun said:
I agree FPGAs are not a serious risk to GPUs for neural networks. However ASICs could be; consider that Mobileye currently dominates the ADAS market with a specialised DSP-based solution. As the requirements become more obvious, I would hope they make more and more of the functionality fixed-function. The market is already there so I don't think it's a technical question - it's purely a practical one of whether the right semiconductor vendors will come up with the right solutions at the right time or not. It might or might not happen; I don't know.

The nextplatform article on SRAM-centric designs is very interesting. In my mind, it is *NOT* only about the power efficiency of the SRAM, but also the area you save from not having to hide as much memory latency. GPUs could be a lot denser if we didn't have to worry about 200-600 cycles memory latencies (depending on target markets, including the memory hierarchy inside the GPU, excluding MMU misses, etc...). The problem is that the amount of latency tolerance required is a step function; either the dataset fits or it doesn't. If you see a gradual (rather than sudden) improvement in latency tolerance (and not only bandwidth - caches are great there for spatial locality!) with larger cache sizes, I believe it is typically because some *timeslice* of the workload/algorithm dataset fits. This is very obvious in 3D graphics where different render passes have very different memory characteristics.

The other problem is that memory technology densities are also a step-function. Typically we go straight from on-chip SRAM to off-chip DRAM which has a massive gap in efficiency (latency/bandwidth/power); if your dataset is too large to fit in SRAM but doesn't need that much DRAM, you're effectively leaving a lot of efficiency on the table. I've always been hoping for eDRAM (or other so-close-but-so-far memory technologies that never work out, e.g. ZRAM) to become more prevalent to bridge that gap, but at the moment it's still very niche (e.g. IBM POWER8... I am not sure I would describe Intel's L4 as eDRAM personally although it has the same benefits). HBM helps a bit although more for bridging the bandwidth gap than the latency gap AFAIK... In the case of DNNs, a large part of the dataset is read-only afaik (needs to be reflashable), but again there isn't a viable memory technology to benefit from that trade-off today...

The article says that Song Han works under Bill Dally (who is still at Stanford in addition to his Chief Scientist role at NVIDIA). Given how much of his life Bill Dally has spent researching locality and on-chip networks, and given the implications for NVIDIA, I wonder what he personally thinks will happen...

The biggest problem I see here is that there's a huge die space cost for these types of RAM to begin with, so it's questionable how much, if anything, you're saving. The XBO GPU has 32mb of SRAM and consequently has a very small GPU die space. In fact the total size of SRAM and GPU is about the same size as the PS4's GPU, except the PS4 has 30% more GPU resources. Besides, with things like async compute GPU bubbles can be filled at least somewhat efficiently while you say, wait for latency and other non full utilization halts. So I don't see this being popular anytime soon.

This might change once graphene hits in a decade or less. With a huge jump in frequency your latency penalty will go up by a huge factor. Though since you're talking about graphene you can also talk about optical chip interconnects, which is going to bring latency penalties back down.

Jawed · Jan 18, 2016

silent_guy said:
Still, the Bitcoin comparison was ridiculously skewed in favor of FPGAs, with 90% of the logic probably just sitting idle.

Which is why you don't want to use an embedded GPU for stuff that isn't graphics/gaming. All that general purpose compute almost entirely mis-aligned for the problem at hand.

Also, FPGAs come with varying amounts of functionality that would also have sat idle in this particular comparison.

As for ASICs, it's inevitable that dedicated silicon will enter the market and be probably very good at it, especially if it's inference only. But not 1W good, at least not for same absolute performance and doing the same workload.

I dare say you're lacking some perspective here:

http://www.btcpedia.com/asicminer-usb-block-erupter/

This is a 130nm chip released in the spring of 2013 mining at the same speed as a 28nm graphics card (released 2 years earlier). 2.9W versus 130W. To be fair the GPU can probably be under-volted to run at around 90W. That's four full node process differences (130->90->65->40->28).

What can be done with 1W or 5W or 25W of computing power in a car is obviously an open question.

It's early days in the field of in-car AI that goes beyond the simplistic sensing that cars currently have. NVidia's rig looks like a great way to do R&D.

Or you could do the R&D on an FPGA, without the plumbing. But NVidia has code, so plumbing it is.

silent_guy · Jan 18, 2016

Jawed said:
Which is why you don't want to use an embedded GPU for stuff that isn't graphics/gaming. All that general purpose compute almost entirely mis-aligned for the problem at hand.

So, except for the fact that there isn't anything that comes close to GPUs in terms of matrix-like math in terms of absolute and perf/W performance, they really should't be used for stuff like that. Because it's not graphics. You'd have a hard time finding a workload that's more efficient at extracting maximum performance out of a GPU than matrix-like math. This makes no sense.

Also, FPGAs come with varying amounts of functionality that would also have sat idle in this particular comparison.

Yes, the SERDES stuff and the DSPs. And they occupy a relatively minor amount of area on the die. The rest are LUTs and those are absolutely packed to the brim from bitcoin, since all calculations are local (no issue with routing resource limitations, as is the case for most other FPGA code), and there's no memory communication either. SHA256 maps ridiculously well onto this kind of logic, especially compared to a serial SIMD machine that's not optimized for bit swizzles at all. Nobody in their right mind would ever design a high throughput SHA256 engine with something like an SM.

A flexibly deep learning engine is a natural for a GPU.

I dare say you're lacking some perspective here:

Maybe the lack of perspective is in understanding the details of the workload.

Or you could do the R&D on an FPGA, without the plumbing. But NVidia has code, so plumbing it is.

The only reasonable usage for an FPGA has been for a frozen design deployment. FPGAs are useless for R&D.

Jawed · Jan 18, 2016

You believe that the NN functionality deployed in cars will be continually learning? So every car will be different?

Dense matrix math is fundamentally on-die data pipelining through ALUs. GPUs require horribly complex MM algorithms because constructing that pipeline is so difficult and it continuously teeters on the edge of stalling.

Pascal appears to be a VLIW GPU when including scenarios such as fp16-math - though for this MM discussion it might be better to characterise it as a 128-lane fp16 SIMD, say. If Pascal's register file is big enough, then fp16 MM could be done with a fairly simple algorithm.

FPGAs obviously suffer if MAC has to be built from LUTs. Some FPGAs have built in single-precision math ALUs. Obviously those don't help with fp16. Fixed block FPGA functionality will go where the market demands, I suppose.

I'm listening to a hi-fi DAC whose FPGA code was changed on a daily basis for listening evaluations. Another DAC in the range is able to completely change the code loaded into the FPGA, so the user can change the processing to suit the format of the music (PCM or DSD). The code for either format entirely fills the FPGA (not strictly true, but good enough for this example). Some manufacturers advertise field upgradeable DACs, with FPGA code changes, as a feature (seems like a lack of thoroughness in the original design to me).

CarstenS · Jan 18, 2016

For autonomous driving, cars should be able to „learn“ different sets of geographically and culturally relevant characteristics above their pre-trained set patterns, yes.

Ext3h · Jan 18, 2016

Jawed said:
You believe that the NN functionality deployed in cars will be continually learning? So every car will be different?

There is no way that could possibly pass the regulations. The shipped systems need to be static, and every single alteration of the net needs to be tested for compliance.

Besides, if we are talking tech: IMHO, the most relevant technology for artificial NNs for mass production would be memristors or an equivalent tech, because that enables to rebuild the trained NN as an analogue circuit, which is definitely more compact and power efficient than any possible arithmetic based, digital approach. Still leaves the option to flash it FPGA-like. (Even though I'm not yet sure how to build a neuron with FET, it's quite simple with junction type transistors.)

Nvidia Pascal Speculation Thread

silent_guy

Benetanegia

Voxilla

Jawed

silent_guy

RecessionCone

Voxilla

Jawed

silent_guy

Arun

Unknown.

Jawed

fuboi

silent_guy

silent_guy

Frenetic Pony

Jawed

silent_guy

Jawed

CarstenS

Moderator

Ext3h

Similar threads