AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Discussion in 'Architecture and Products' started by ToTTenTranz, Sep 20, 2016.

  1. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,750
    Likes Received:
    2,518
    VLIW scheduling is done entirely in software. This super SIMD concept sounds like a combination of software and hardware.
     
    Benetanegia likes this.
  2. Urian

    Regular

    Joined:
    Aug 23, 2003
    Messages:
    621
    Likes Received:
    55
  3. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,521
    Likes Received:
    852
    AFAICT, a super instruction is a sequence of instructions where the result of prior instructions are fed as input to subsequent instructions without being written back to the register file.

    Likely to save some register file bandwidth (potentially allowing for more ALUs) and a lot of power. I'm surprised AMD isn't doing this already.

    Cheers
     
  4. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    225
    Likes Received:
    97
    AMD limits are not Alu limits. AMD needs a better front end for workloaddistribution. The patent could be interesting for ray tracing. Maybe with this patent you save also some latency time.
     
  5. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    This isn't flexible scalar as it's still SIMD. Cascaded SIMDs or Nvidia's Register File Caching would be more representative. The theory I had a while back was one or more flexible scalars handling special instructions and fowarding the result temporally through a SIMD-like structure similar to this. Create a systolic array with the scalar handling the uncommon instructions and forwarding between SIMDs. Difference between this and my theory is that I assumed forwarding between lanes instead of SIMDs which should be superior to even this in regards to power. Gets tricky finding enough consecutive same instructions though.

    The front end is fine, they just need to achieve higher average clocks. Super-SIMD could do that as it should significantly reduce power usage by avoiding the register file and longer data paths. Less power, higher sustainable clockspeeds.
     
    ImSpartacus likes this.
  6. Cat Merc

    Newcomer

    Joined:
    May 14, 2017
    Messages:
    124
    Likes Received:
    108
    Sounds like it would increase CU complexity. They might have figured more of simpler CU's would be more efficient overall when clocked down than complex ones that take more die space and therefore each needing higher clocks.
     
    milk likes this.
  7. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    The only added complexity should be the forwarding network between SIMDs and a bit of creativity with the scheduler. Odds are the four(?) SIMDs are interlaced with corresponding lanes/quads/whatever next to each other to keep distances to a minimum. The scheduler part would just be tracking wave dependencies across SIMD boundaries. Forwarding not all that different from what's already there.
     
  8. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,521
    Likes Received:
    852
    You won't need forwarding between SIMD units because the other SIMDs are running completely different wavefronts.

    The SIMD needs to be able to store a few (say, four) results locally. Instructions can then instruct the SIMD to latch the result in one of these temporaries instead of writing it back to the register file. Subsequent instruction can specifiy source operands from the register file or the temporaries.

    The sequence of instructions in a super instruction must not be interrupted/preempted, since you now have implicit temporary state in the SIMD.

    Cheers
     
  9. Cat Merc

    Newcomer

    Joined:
    May 14, 2017
    Messages:
    124
    Likes Received:
    108
    Doesn't NVIDIA already do this since Maxwell with their operand reuse cache?
     
  10. Love_In_Rio

    Veteran

    Joined:
    Apr 21, 2004
    Messages:
    1,444
    Likes Received:
    108
    So, is this why Nvidia arch is more efficient?. Or this plus their tile based rasterizer?.
     
  11. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    The different wavefronts part is what I'm not entirely clear on. Current GCN yes, but S-SIMD might be utilizing a deeper SIMD structure with a similar capacity as the current CU. Applying VLIW to the current design of a 64 lane MIMD with forwarding between quads(?). Lanes 0, 16, 32, and 48 possibly next to each other to simplify routing. Not necessarily just latching the results. Similar to the variable SIMD width design in a patent a while back. Not disputing what you said, just that there may be another possibility.

    Combination of both, but reuse is likely a good chunk of it.
     
  12. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,435
    Likes Received:
    263
    There aren't one or two features that explain why one architecture is more efficient than another.
     
    Cat Merc and Putas like this.
  13. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    225
    Likes Received:
    97
    Features at the frontend are really efficient. Every Polygon which is not transferred to a pixel because it is not needed will immediately save a lot of power and speeds up the system dramatically.

    The tiled base Rasterizer is the key feature to more efficiency.
     
  14. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Also might as well consider the tiling architecture behaviour from a cache perspective that is designed to reduce use of BW and power; Nvidia mentioned L2 coherency also without being read in/out multiple times for single rendering process in the past to the press.
    Part of that can be seen with the following two relevant Nvidia patents:
    Tiled Cache Invalidation patent: http://www.freepatentsonline.com/y2014/0122812.html
    Efficient Cache Management in a Tiled Architecture: http://www.freepatentsonline.com/y2015/0193907.html
     
  15. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    129
    The scope of the "super-SIMD" patent is apparently:

    1. improving the existing 4R4W lane RF utilisation by introducing VLIW-2 and a second ALU per lane;
    2. introducing an operand cache to optimise the movement & act as an extra read port; and
    3. extending the issue logic to be able to select one bundled or two unbundled instructions from different wavefronts per cycle.

    While we could interpolate such possibilities by looking at combinations of these patents/papers, it does seem a bit far fetched in the context of this paper.
     
    pharma likes this.
  16. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    225
    Likes Received:
    97
  17. rwolf

    rwolf Rock Star
    Regular

    Joined:
    Oct 25, 2002
    Messages:
    967
    Likes Received:
    51
    Location:
    Canada
  18. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    858
    Likes Received:
    260
  19. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,151
    Likes Received:
    571
    Location:
    France
    Other stuffs they won't be able to exploit like Primitive Shaders ? I troll a little, but I don't really care about new functions or patents anymore, until I see them working and used in applications. I hope Navi & the futur chips (if ...) will make it "simpler" for the software guys.
     
  20. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    324
    Likes Received:
    82
    SiSoft leak of 28CU Vega (2gb HBM2): https://wccftech.com/amd-fenghuang-apu-3dmark-specs-performance-leak/

    A cut down Vega 12? Assuming a normal Vega 12 is 4-8gb (single stack) 32CU, IE half a Vega 64, then the above looks like the lowest end bin for such. Half a Vega 56 with even less memory. Could easily fit into a budget APU or mobile device. There's no accurate boost clock, so no telling if the rumored improvements to Vega's clockspeed/12nm are true. Could see it next week though at Computex, AMD already said they'll be showing off new stuff there.
     
    ImSpartacus likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...