AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

AFAICT, a super instruction is a sequence of instructions where the result of prior instructions are fed as input to subsequent instructions without being written back to the register file.

Likely to save some register file bandwidth (potentially allowing for more ALUs) and a lot of power. I'm surprised AMD isn't doing this already.

Cheers
 
AMD limits are not Alu limits. AMD needs a better front end for workloaddistribution. The patent could be interesting for ray tracing. Maybe with this patent you save also some latency time.
 
This isn't flexible scalar as it's still SIMD. Cascaded SIMDs or Nvidia's Register File Caching would be more representative. The theory I had a while back was one or more flexible scalars handling special instructions and fowarding the result temporally through a SIMD-like structure similar to this. Create a systolic array with the scalar handling the uncommon instructions and forwarding between SIMDs. Difference between this and my theory is that I assumed forwarding between lanes instead of SIMDs which should be superior to even this in regards to power. Gets tricky finding enough consecutive same instructions though.

AMD limits are not Alu limits. AMD needs a better front end for workloaddistribution. The patent could be interesting for ray tracing. Maybe with this patent you save also some latency time.
The front end is fine, they just need to achieve higher average clocks. Super-SIMD could do that as it should significantly reduce power usage by avoiding the register file and longer data paths. Less power, higher sustainable clockspeeds.
 
AFAICT, a super instruction is a sequence of instructions where the result of prior instructions are fed as input to subsequent instructions without being written back to the register file.

Likely to save some register file bandwidth (potentially allowing for more ALUs) and a lot of power. I'm surprised AMD isn't doing this already.

Cheers
Sounds like it would increase CU complexity. They might have figured more of simpler CU's would be more efficient overall when clocked down than complex ones that take more die space and therefore each needing higher clocks.
 
Sounds like it would increase CU complexity. They might have figured more of simpler CU's would be more efficient overall when clocked down than complex ones that take more die space and therefore each needing higher clocks.
The only added complexity should be the forwarding network between SIMDs and a bit of creativity with the scheduler. Odds are the four(?) SIMDs are interlaced with corresponding lanes/quads/whatever next to each other to keep distances to a minimum. The scheduler part would just be tracking wave dependencies across SIMD boundaries. Forwarding not all that different from what's already there.
 
The only added complexity should be the forwarding network between SIMDs and a bit of creativity with the scheduler.

You won't need forwarding between SIMD units because the other SIMDs are running completely different wavefronts.

The SIMD needs to be able to store a few (say, four) results locally. Instructions can then instruct the SIMD to latch the result in one of these temporaries instead of writing it back to the register file. Subsequent instruction can specifiy source operands from the register file or the temporaries.

The sequence of instructions in a super instruction must not be interrupted/preempted, since you now have implicit temporary state in the SIMD.

Cheers
 
You won't need forwarding between SIMD units because the other SIMDs are running completely different wavefronts.

The SIMD needs to be able to store a few (say, four) results locally. Instructions can then instruct the SIMD to latch the result in one of these temporaries instead of writing it back to the register file. Subsequent instruction can specifiy source operands from the register file or the temporaries.

The sequence of instructions in a super instruction must not be interrupted/preempted, since you now have implicit temporary state in the SIMD.

Cheers
The different wavefronts part is what I'm not entirely clear on. Current GCN yes, but S-SIMD might be utilizing a deeper SIMD structure with a similar capacity as the current CU. Applying VLIW to the current design of a 64 lane MIMD with forwarding between quads(?). Lanes 0, 16, 32, and 48 possibly next to each other to simplify routing. Not necessarily just latching the results. Similar to the variable SIMD width design in a patent a while back. Not disputing what you said, just that there may be another possibility.

So, is this why Nvidia arch is more efficient?. Or this plus their tile based rasterizer?.
Combination of both, but reuse is likely a good chunk of it.
 
Features at the frontend are really efficient. Every Polygon which is not transferred to a pixel because it is not needed will immediately save a lot of power and speeds up the system dramatically.

The tiled base Rasterizer is the key feature to more efficiency.
 
Also might as well consider the tiling architecture behaviour from a cache perspective that is designed to reduce use of BW and power; Nvidia mentioned L2 coherency also without being read in/out multiple times for single rendering process in the past to the press.
Part of that can be seen with the following two relevant Nvidia patents:
Tiled Cache Invalidation patent: http://www.freepatentsonline.com/y2014/0122812.html
Efficient Cache Management in a Tiled Architecture: http://www.freepatentsonline.com/y2015/0193907.html
 
The different wavefronts part is what I'm not entirely clear on. Current GCN yes, but S-SIMD might be utilizing a deeper SIMD structure with a similar capacity as the current CU. Applying VLIW to the current design of a 64 lane MIMD with forwarding between quads(?). Lanes 0, 16, 32, and 48 possibly next to each other to simplify routing. Not necessarily just latching the results. Similar to the variable SIMD width design in a patent a while back. Not disputing what you said, just that there may be another possibility.


Combination of both, but reuse is likely a good chunk of it.
The scope of the "super-SIMD" patent is apparently:

1. improving the existing 4R4W lane RF utilisation by introducing VLIW-2 and a second ALU per lane;
2. introducing an operand cache to optimise the movement & act as an extra read port; and
3. extending the issue logic to be able to select one bundled or two unbundled instructions from different wavefronts per cycle.

While we could interpolate such possibilities by looking at combinations of these patents/papers, it does seem a bit far fetched in the context of this paper.
 
Other stuffs they won't be able to exploit like Primitive Shaders ? I troll a little, but I don't really care about new functions or patents anymore, until I see them working and used in applications. I hope Navi & the futur chips (if ...) will make it "simpler" for the software guys.
 
SiSoft leak of 28CU Vega (2gb HBM2): https://wccftech.com/amd-fenghuang-apu-3dmark-specs-performance-leak/

A cut down Vega 12? Assuming a normal Vega 12 is 4-8gb (single stack) 32CU, IE half a Vega 64, then the above looks like the lowest end bin for such. Half a Vega 56 with even less memory. Could easily fit into a budget APU or mobile device. There's no accurate boost clock, so no telling if the rumored improvements to Vega's clockspeed/12nm are true. Could see it next week though at Computex, AMD already said they'll be showing off new stuff there.
 
Back
Top