AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

DavidGraham · May 7, 2018

Benetanegia said:
What's the difference with VLIW?

VLIW scheduling is done entirely in software. This super SIMD concept sounds like a combination of software and hardware.

Urian · May 8, 2018

Anarchist4000 said:
SUPER SINGLE INSTRUCTION MULTIPLE DATA (SUPER-SIMD) FOR GRAPHICS PROCESSING UNIT (GPU) COMPUTING

No idea if these are Vega or not, but currently published so running with it.

This sounds familiar to me.

Gubbi · May 8, 2018

AFAICT, a super instruction is a sequence of instructions where the result of prior instructions are fed as input to subsequent instructions without being written back to the register file.

Likely to save some register file bandwidth (potentially allowing for more ALUs) and a lot of power. I'm surprised AMD isn't doing this already.

Cheers

Digidi · May 8, 2018

AMD limits are not Alu limits. AMD needs a better front end for workloaddistribution. The patent could be interesting for ray tracing. Maybe with this patent you save also some latency time.

Anarchist4000 · May 8, 2018

Urian said:
This sounds familiar to me.

This isn't flexible scalar as it's still SIMD. Cascaded SIMDs or Nvidia's Register File Caching would be more representative. The theory I had a while back was one or more flexible scalars handling special instructions and fowarding the result temporally through a SIMD-like structure similar to this. Create a systolic array with the scalar handling the uncommon instructions and forwarding between SIMDs. Difference between this and my theory is that I assumed forwarding between lanes instead of SIMDs which should be superior to even this in regards to power. Gets tricky finding enough consecutive same instructions though.

Digidi said:
AMD limits are not Alu limits. AMD needs a better front end for workloaddistribution. The patent could be interesting for ray tracing. Maybe with this patent you save also some latency time.

The front end is fine, they just need to achieve higher average clocks. Super-SIMD could do that as it should significantly reduce power usage by avoiding the register file and longer data paths. Less power, higher sustainable clockspeeds.

Cat Merc · May 8, 2018

Gubbi said:
AFAICT, a super instruction is a sequence of instructions where the result of prior instructions are fed as input to subsequent instructions without being written back to the register file.

Likely to save some register file bandwidth (potentially allowing for more ALUs) and a lot of power. I'm surprised AMD isn't doing this already.

Cheers

Sounds like it would increase CU complexity. They might have figured more of simpler CU's would be more efficient overall when clocked down than complex ones that take more die space and therefore each needing higher clocks.

Anarchist4000 · May 8, 2018

Cat Merc said:
Sounds like it would increase CU complexity. They might have figured more of simpler CU's would be more efficient overall when clocked down than complex ones that take more die space and therefore each needing higher clocks.

The only added complexity should be the forwarding network between SIMDs and a bit of creativity with the scheduler. Odds are the four(?) SIMDs are interlaced with corresponding lanes/quads/whatever next to each other to keep distances to a minimum. The scheduler part would just be tracking wave dependencies across SIMD boundaries. Forwarding not all that different from what's already there.

Gubbi · May 9, 2018

Anarchist4000 said:
The only added complexity should be the forwarding network between SIMDs and a bit of creativity with the scheduler.

You won't need forwarding between SIMD units because the other SIMDs are running completely different wavefronts.

The SIMD needs to be able to store a few (say, four) results locally. Instructions can then instruct the SIMD to latch the result in one of these temporaries instead of writing it back to the register file. Subsequent instruction can specifiy source operands from the register file or the temporaries.

The sequence of instructions in a super instruction must not be interrupted/preempted, since you now have implicit temporary state in the SIMD.

Cheers

Cat Merc · May 9, 2018

Doesn't NVIDIA already do this since Maxwell with their operand reuse cache?

Love_In_Rio · May 9, 2018

Cat Merc said:
Doesn't NVIDIA already do this since Maxwell with their operand reuse cache?

So, is this why Nvidia arch is more efficient?. Or this plus their tile based rasterizer?.

Anarchist4000 · May 10, 2018

Gubbi said:
You won't need forwarding between SIMD units because the other SIMDs are running completely different wavefronts.

The SIMD needs to be able to store a few (say, four) results locally. Instructions can then instruct the SIMD to latch the result in one of these temporaries instead of writing it back to the register file. Subsequent instruction can specifiy source operands from the register file or the temporaries.

The sequence of instructions in a super instruction must not be interrupted/preempted, since you now have implicit temporary state in the SIMD.

Cheers

The different wavefronts part is what I'm not entirely clear on. Current GCN yes, but S-SIMD might be utilizing a deeper SIMD structure with a similar capacity as the current CU. Applying VLIW to the current design of a 64 lane MIMD with forwarding between quads(?). Lanes 0, 16, 32, and 48 possibly next to each other to simplify routing. Not necessarily just latching the results. Similar to the variable SIMD width design in a patent a while back. Not disputing what you said, just that there may be another possibility.

Love_In_Rio said:
So, is this why Nvidia arch is more efficient?. Or this plus their tile based rasterizer?.

Combination of both, but reuse is likely a good chunk of it.

3dcgi · May 13, 2018

Love_In_Rio said:
So, is this why Nvidia arch is more efficient?. Or this plus their tile based rasterizer?.

There aren't one or two features that explain why one architecture is more efficient than another.

Digidi · May 15, 2018

Features at the frontend are really efficient. Every Polygon which is not transferred to a pixel because it is not needed will immediately save a lot of power and speeds up the system dramatically.

The tiled base Rasterizer is the key feature to more efficiency.

CSI PC · May 15, 2018

Also might as well consider the tiling architecture behaviour from a cache perspective that is designed to reduce use of BW and power; Nvidia mentioned L2 coherency also without being read in/out multiple times for single rendering process in the past to the press.
Part of that can be seen with the following two relevant Nvidia patents:
Tiled Cache Invalidation patent: http://www.freepatentsonline.com/y2014/0122812.html
Efficient Cache Management in a Tiled Architecture: http://www.freepatentsonline.com/y2015/0193907.html

pTmdfx · May 16, 2018

Anarchist4000 said:
The different wavefronts part is what I'm not entirely clear on. Current GCN yes, but S-SIMD might be utilizing a deeper SIMD structure with a similar capacity as the current CU. Applying VLIW to the current design of a 64 lane MIMD with forwarding between quads(?). Lanes 0, 16, 32, and 48 possibly next to each other to simplify routing. Not necessarily just latching the results. Similar to the variable SIMD width design in a patent a while back. Not disputing what you said, just that there may be another possibility.

Combination of both, but reuse is likely a good chunk of it.

The scope of the "super-SIMD" patent is apparently:

1. improving the existing 4R4W lane RF utilisation by introducing VLIW-2 and a second ALU per lane;
2. introducing an operand cache to optimise the movement & act as an extra read port; and
3. extending the issue logic to be able to select one bundled or two unbundled instructions from different wavefronts per cycle.

While we could interpolate such possibilities by looking at combinations of these patents/papers, it does seem a bit far fetched in the context of this paper.

Digidi · May 18, 2018

Interesting new patent, maybe for the next generation geometry engine?

https://patents.justia.com/patent/20180137676

rwolf · May 29, 2018

Tensor Core Equivalent?
http://www.freshpatents.com/-dt20180524ptan20180144435.php

Ethatron · May 29, 2018

rwolf said:
Tensor Core Equivalent?
http://www.freshpatents.com/-dt20180524ptan20180144435.php

Too bad the additional ports are not used to share read-access between lanes. :-|

Rootax · May 30, 2018

Other stuffs they won't be able to exploit like Primitive Shaders ? I troll a little, but I don't really care about new functions or patents anymore, until I see them working and used in applications. I hope Navi & the futur chips (if ...) will make it "simpler" for the software guys.

Frenetic Pony · Jun 2, 2018

SiSoft leak of 28CU Vega (2gb HBM2): https://wccftech.com/amd-fenghuang-apu-3dmark-specs-performance-leak/

A cut down Vega 12? Assuming a normal Vega 12 is 4-8gb (single stack) 32CU, IE half a Vega 64, then the above looks like the lowest end bin for such. Half a Vega 56 with even less memory. Could easily fit into a budget APU or mobile device. There's no accurate boost clock, so no telling if the rumored improvements to Vega's clockspeed/12nm are true. Could see it next week though at Computex, AMD already said they'll be showing off new stuff there.

AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Rock Star