What will it get destroyed in the server market by ? Vega is no faster than a Quadro GP100. Which has higher memory bandwidth and comparable FP32/FP16 resources (but no DP4A) You are assuming it will have half the hardware idling, we still don't have any concrete information on mixed Tensor + traditional pipeline usage. IIRC GV100 is 815mm^2? Vega 10 was estimated to be around 530. 53% larger die, something like 8x FP64 rates, and 5x higher throughput in DL. Compares quite favorably to me.
You have a quote for Vega's tensor flops? Not the FP16 figures, but when using all the hardware in a pipelined fashion for tensor operations on an architecture that hasn't been detailed yet? Right now you seem to be comparing apples and oranges.
It is starting to seem like you are deliberately ignoring nvidia's very clear comments regarding this, they are entirely separate. Tensor cores do not use existing FP32 units, because it was explicitly stated that full throughput FP32 would saturate only half the dispatch capacity per cycle, the remaining half can be used for all other instructions/units.
I'm not ignoring them, you just don't understand what they mean. The quotes you've provided say nothing about Tensors. In fact they said the INT32 pipeline was the other half of that dispatch capacity. Or probably dual issued FP16 where Vec2s aren't required. That's exactly what I proposed for Vega a while ago where the programmer didn't have to pack anything. It just seems silly to add limited FP32 cores to replace FP32 cores that already exist when they won't in all likelihood be running concurrently.
when it is abundantly clear this is not so. You have been in such a rush to make these statements that you apparently forwent reading what limited details there are available and mistakenly believed this was doing so called tensor products.
You'll have to explain this abundantly clear part. Because the Nvidia statements run counter to what you've been saying. I'm not sure they say what you think they say. They explicitly state one thing, like FP32 and INT32 concurrently, and you come up with something completely different.
The multiplication of two 4x4 matrices will result in a third 4x4 matrix with 16 elements, each element consists of 4 FMA operations, 16 elements = 64 operations. The data in matrix C can be loaded into the accumulator from the beginning, and each FMA of FP16 elements of matrices A and B are fed into that accumulator directly.
So you're suggesting one cycle to load values into an accumulator, 4 to process the multiplication thanks to dependencies on the adder, one to write out the value of the accumulator, then repeat that process? As the products of the matrices are being added to sequential matricies, it would seem far simpler to just stay decomposed and add up the components once you finish all the multiplications. Completely avoiding the dependencies. One cycle per multiplication as opposed to the 4-6 cycles you propose or interleaving operations with more complicated data paths.
Am not certain how having four (4 pairs of FP16 mult) operands needing to be accumulated in one cycle will work, as it is a unary operator usually, it may well be pipelined, the question is how deep, but the wording of this suggests this is not the case and i am also very curious as to how this is handled in terms of dispatch as this clearly far exceeds the capacity of the AWS on paper
There is no such thing as a four input adder. At best it's a series of dependent adders hidden in one really long clock cycle. The only equivalent of a multiple input adder that comes to mind is quantum computing or analog computation involving opamps. I don't foresee either of those in Volta.
How is that relevant to this discussion? We're talking about a pretty small amount of area.
Small relative to what though? The entire chip or the area dedicated to logic? It's relevant because it should provide a means to a more efficient chip.
At more than 1000mm2, Fiji's interposer exceeded the limits of today's lithography machines, so they needed double exposure for the interposer. The core die did not. Why GV100 would be any different?
I've never seen anything about Fiji using a double exposure on the interposer. My understanding was that it was as large as conventionally possible which defined the chip dimensions. If that wasn't the case Fiji wouldn't have had any trade-offs.
But there are good arguments to not do it this way: additional power consumption for the non-tensor FP16 and FP32 cases, simplicity of the design, ease of adding or removing a tensor core to an SM (or replace it with an even faster integer equivalent for the inference versions.)
What consumes extra power though? The non-tensor cores are there regardless, so might as well make use of them. This does seem ridiculous to me as it's being proposed to disable FP32/16 cores so more FP32/16 cores can be added. The whole concept of the tensor core from my view is that a single tensor operation is being executed across all the blocks concurrently. Use the FP16 units for the multiplication, forward the result to the FP32+INT cores for adding/accumulation, then repeat. The only difference is that instead of running 16 threads across 16 hardware lanes an entire matrix operation is being completed in a single cycle using all of them and piplining sequential operations across the blocks with specialized paths. That's why it seems ridiculous to me to replace hardware that already exists with more hardware.