DavidGraham
Veteran
VLIW scheduling is done entirely in software. This super SIMD concept sounds like a combination of software and hardware.What's the difference with VLIW?
VLIW scheduling is done entirely in software. This super SIMD concept sounds like a combination of software and hardware.What's the difference with VLIW?
SUPER SINGLE INSTRUCTION MULTIPLE DATA (SUPER-SIMD) FOR GRAPHICS PROCESSING UNIT (GPU) COMPUTING
No idea if these are Vega or not, but currently published so running with it.
This isn't flexible scalar as it's still SIMD. Cascaded SIMDs or Nvidia's Register File Caching would be more representative. The theory I had a while back was one or more flexible scalars handling special instructions and fowarding the result temporally through a SIMD-like structure similar to this. Create a systolic array with the scalar handling the uncommon instructions and forwarding between SIMDs. Difference between this and my theory is that I assumed forwarding between lanes instead of SIMDs which should be superior to even this in regards to power. Gets tricky finding enough consecutive same instructions though.
The front end is fine, they just need to achieve higher average clocks. Super-SIMD could do that as it should significantly reduce power usage by avoiding the register file and longer data paths. Less power, higher sustainable clockspeeds.AMD limits are not Alu limits. AMD needs a better front end for workloaddistribution. The patent could be interesting for ray tracing. Maybe with this patent you save also some latency time.
Sounds like it would increase CU complexity. They might have figured more of simpler CU's would be more efficient overall when clocked down than complex ones that take more die space and therefore each needing higher clocks.AFAICT, a super instruction is a sequence of instructions where the result of prior instructions are fed as input to subsequent instructions without being written back to the register file.
Likely to save some register file bandwidth (potentially allowing for more ALUs) and a lot of power. I'm surprised AMD isn't doing this already.
Cheers
The only added complexity should be the forwarding network between SIMDs and a bit of creativity with the scheduler. Odds are the four(?) SIMDs are interlaced with corresponding lanes/quads/whatever next to each other to keep distances to a minimum. The scheduler part would just be tracking wave dependencies across SIMD boundaries. Forwarding not all that different from what's already there.Sounds like it would increase CU complexity. They might have figured more of simpler CU's would be more efficient overall when clocked down than complex ones that take more die space and therefore each needing higher clocks.
The only added complexity should be the forwarding network between SIMDs and a bit of creativity with the scheduler.
So, is this why Nvidia arch is more efficient?. Or this plus their tile based rasterizer?.Doesn't NVIDIA already do this since Maxwell with their operand reuse cache?
The different wavefronts part is what I'm not entirely clear on. Current GCN yes, but S-SIMD might be utilizing a deeper SIMD structure with a similar capacity as the current CU. Applying VLIW to the current design of a 64 lane MIMD with forwarding between quads(?). Lanes 0, 16, 32, and 48 possibly next to each other to simplify routing. Not necessarily just latching the results. Similar to the variable SIMD width design in a patent a while back. Not disputing what you said, just that there may be another possibility.You won't need forwarding between SIMD units because the other SIMDs are running completely different wavefronts.
The SIMD needs to be able to store a few (say, four) results locally. Instructions can then instruct the SIMD to latch the result in one of these temporaries instead of writing it back to the register file. Subsequent instruction can specifiy source operands from the register file or the temporaries.
The sequence of instructions in a super instruction must not be interrupted/preempted, since you now have implicit temporary state in the SIMD.
Cheers
Combination of both, but reuse is likely a good chunk of it.So, is this why Nvidia arch is more efficient?. Or this plus their tile based rasterizer?.
There aren't one or two features that explain why one architecture is more efficient than another.So, is this why Nvidia arch is more efficient?. Or this plus their tile based rasterizer?.
The scope of the "super-SIMD" patent is apparently:The different wavefronts part is what I'm not entirely clear on. Current GCN yes, but S-SIMD might be utilizing a deeper SIMD structure with a similar capacity as the current CU. Applying VLIW to the current design of a 64 lane MIMD with forwarding between quads(?). Lanes 0, 16, 32, and 48 possibly next to each other to simplify routing. Not necessarily just latching the results. Similar to the variable SIMD width design in a patent a while back. Not disputing what you said, just that there may be another possibility.
Combination of both, but reuse is likely a good chunk of it.
Tensor Core Equivalent?
http://www.freshpatents.com/-dt20180524ptan20180144435.php