Look at slide 8 of the Fiji presentation at HotChips 2015 (
https://www.hotchips.org/wp-content...-GPU-Epub/HC27.25.520-Fury-Macri-AMD-GPU2.pdf) : "Larger than reticle interposer".
Macri said in a Fiji pre-launch interview that "double exposure was possible but might be cost prohibitive". Maybe Fiji was cost prohibitive? Can you hear Totz' head explode?
I'm not sure how the following would be interpreted:
http://techreport.com/review/28499/amd-radeon-fury-x-architecture-revealed/2
The reason why Fiji isn't any larger, he said, is that AMD was up against a size limitation: the interposer that sits beneath the GPU and the DRAM stacks is fabricated just like a chip, and as a result, the interposer can only be as large as the reticle used in the photolithography process. (Larger interposers might be possible with multiple exposures, but they'd likely not be cost-effective.) In an HBM solution, the GPU has to be small enough to allow space on the interposer for the HBM stacks. Koduri explained that Fiji is very close to its maximum possible size, within something like four square millimeters.
It's worded as if Fiji's sizing was up to the limit of the interposer's reticle. I thought AMD might have fudged things by having the chips on the interposer overhang onto areas not exposed for 65nm patterning.
On one hand, the extra power from making a narrow function, a pure FP16 MAD, more general. This will always cost you one way or the other.
On the other hand, you should be able to save a considerable amount of logic and power for the pure tensor core case as well. If you know you're always going to add 4 FP16 numbers and, ultimately, are always going to add them into an FP32, there should be a plenty of optimizations possible in terms of taking short cuts with normalization etc. For example, for just a 4 way FP16 adder, you only need 1 max function among 4 exponents instead of multiple 2 way ones. There's no way you won't have similar optimizations elsewhere.
Some thoughts I had on the customized units were that while there is the truism that more silicon driven more slowly is better, more and longer wires is consistently not.
Perhaps more accurate accounting needs to take that into account, in order to differentiate between two tightly optimized units versus one larger unit and whether the tradeoff in extra sequencing, signal travel, leakage, and other pitfalls of complexity can shift the balance. The granularity of power gating is usually coarser such as the SIMD block level, and its effectiveness may be hampered if the gating had to be integrated at a sub-unit granularity and on a block that cannot fully idle.
Knowing that only a specific sequence of operations will occur in a physical space can remove a lot of mystery as to what wires need to go where. For example, if it's known that the adder phase isn't ever going to forward its results to the adder inputs, the option, its wires, and the multiplexing in the path can be removed.
One item I think might be a win with a dedicated unit is designing it to minimize the impact of data amplification.
If Nvidia's description is accurate as to whether there are 64 multiplications in parallel, each element is used 4 times--and in this thread there is the claim that it's simpler and more efficient to go from 16x4 to 16x16.
The casual use of the word "forwarding" implies crossing the edges of pipeline stages, and if using the general units it follows that their pipeline latches, forwarding networks, and any lane-crossing paths stand to grow by up to an order of magnitude.
I would have thought the SMs were already under some pressure for wiring congestion, if the design is optimized for density and the clock speeds Nvidia's GPUs have reached for. Volta's plain single-lane FMAs operate at a 4 cycle latency, which might not have happened if several adders, additional cross-lane permutations, and 4-16x the bypassing were hanging off of it.
Not knowing how many stages could be drive stages, whatever number of stages that need to latch the 4-16x more 32-bit values would expand the lane.
It's why I'm leery of using the existing critical path for methods that could generate KBs of extra context and run thousands of extra wires into the existing paths. That's more layers of logic, and I have doubts that the SMs are so free of congestion that they can swallow that many more wires without losing density, adding repeaters or more logic, and possibly losing a significant amount of clock speed.
Potentially, a tensor unit could calculate what it could for shifting and control words on the 16 input elements first, then use bespoke logic that can only duplicate and broadcast the operands for this specific operation to the multipliers within a clock cycle.
Since there is no mystery as to where the results must go physically, the 64 outputs from the multipliers could have physically direct and short paths into the adders--whatever form they might take.
I would figure that the operation depth would be reduced with adders that took more than 2 inputs, and by adding things in parallel to the point that there's 32 or 16 values that need to move to the next pipe stage.
A less than fully pipelined tensor unit or some kind of skewed pipelining might let the tensor unit dispense with intermediate storage. I'm not sure if the whole matrix operation could truly be fit into a general FP unit's clock cycle, but the dedicated tensor unit might not need the FP unit's clock cycle--and the FP unit's clock cycle wouldn't need to fit the tensor.
Any non-standard timings or behaviors could be aided by the unit being separate and having dedicated sequencing logic, rather than expanding the general-purpose pipeline's sequencing behavior that already covers the behaviors of standard instructions.