Anarchist4000
Veteran
Presumably because they take up space and make scheduling a bit more complicated and weren't needed with the Tensor cores. In regards to gaming performance it's a different question as the vast majority of consumer Pascals won't have the capability. So that FP16 vs FP32 rate I've been quoting would be the highest flop rate for their respective parts.FP 16 calculations are part of nV's mixed precision capabilities in Pascal, another words the ALU's can do it, they are not removing that in Volta. Why would they remove that in an already done pipeline and just do it with Tensor cores? They already have it, they don't need to mention it again. They even added it to maxwell tegra x1.
As for Volta there is still very little information there and it's somewhat ambiguous. No guarantee consumer parts get the double rate FP16 like Pascal. They'd be in serious trouble if they didn't so I'd imagine it's included. That said all the official literature I've seen only lists FP32/64 and tensor ops. Tensor ops on cores they keep drawing as distinct units, although that seems unlikely to be the case. As I pointed out above, I have a feeling the tensor cores re-purpose the SM hardware and Nvidia hasn't shown that. That makes far more sense with the parallel INT32 pipeline.
Technically they only need 20 operands per clock with some relatively simple broadcasts. Sixteen elements repeated sixteen times. The entire operation should only require 2x16 values to be read in and broadcasted. With accumulators and forwarding, operand bandwidth shouldn't be a problem at all. Keep in mind an accumulate doesn't actually need an operand as the result gets forwarded. You just have to wait a cycle before reading it which is where the single clock cycle is questionable. The tensor operations likely wouldn't be writing out any data until flushed and I'd almost guarantee it's a pipelined operation that is probably alternating accumulators. So in a single clock cycle there are 64 multiplications and 128 partial accumulations from the previous multiplications.Getting 64 parallel multiplies sourced from 16-element operands, followed by a full-precision accumulate of those multiplies and a third 16-element operand into a clock cycle efficiently and at the target clocks sounds non-trivial.
D=A*B+C is the normal FMA operation, but accumulation would be D+=A*B from an operand standpoint.