I agree FPGAs are not a serious risk to GPUs for neural networks. However ASICs could be; consider that Mobileye currently dominates the ADAS market with a specialised DSP-based solution. As the requirements become more obvious, I would hope they make more and more of the functionality fixed-function. The market is already there so I don't think it's a technical question - it's purely a practical one of whether the right semiconductor vendors will come up with the right solutions at the right time or not. It might or might not happen; I don't know.
The nextplatform article on SRAM-centric designs is very interesting. In my mind, it is *NOT* only about the power efficiency of the SRAM, but also the area you save from not having to hide as much memory latency. GPUs could be a lot denser if we didn't have to worry about 200-600 cycles memory latencies (depending on target markets, including the memory hierarchy inside the GPU, excluding MMU misses, etc...). The problem is that the amount of latency tolerance required is a step function; either the dataset fits or it doesn't. If you see a gradual (rather than sudden) improvement in latency tolerance (and not only bandwidth - caches are great there for spatial locality!) with larger cache sizes, I believe it is typically because some *timeslice* of the workload/algorithm dataset fits. This is very obvious in 3D graphics where different render passes have very different memory characteristics.
The other problem is that memory technology densities are also a step-function. Typically we go straight from on-chip SRAM to off-chip DRAM which has a massive gap in efficiency (latency/bandwidth/power); if your dataset is too large to fit in SRAM but doesn't need that much DRAM, you're effectively leaving a lot of efficiency on the table. I've always been hoping for eDRAM (or other so-close-but-so-far memory technologies that never work out, e.g. ZRAM) to become more prevalent to bridge that gap, but at the moment it's still very niche (e.g. IBM POWER8... I am not sure I would describe Intel's L4 as eDRAM personally although it has the same benefits). HBM helps a bit although more for bridging the bandwidth gap than the latency gap AFAIK... In the case of DNNs, a large part of the dataset is read-only afaik (needs to be reflashable), but again there isn't a viable memory technology to benefit from that trade-off today...
The article says that Song Han works under Bill Dally (who is still at Stanford in addition to his Chief Scientist role at NVIDIA). Given how much of his life Bill Dally has spent researching locality and on-chip networks, and given the implications for NVIDIA, I wonder what he personally thinks will happen...