AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Discussion in 'Architecture and Products' started by Jawed, Mar 23, 2016.

Tags:
Thread Status:
Not open for further replies.
  1. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    A ray tracing only architecture could make sense in that timeframe. With all shaders more or less running the same path, a cascaded SIMD setup would make a lot more sense. Back to the variable SIMD designs from a year or so ago.

    With a chiplet design, multiple concurrent architectures could make sense for different tasks.

    True, but fully raytracing a scene gets around the serial nature of rasterization. Packing multiple adapters into a system could be doable. Even 6x low to mid range cards could provide acceptable performance while being cheaper than Nvidia's offerings.
     
    Tkumpathenurpahl likes this.
  2. BoMbY

    Newcomer

    Joined:
    Aug 31, 2017
    Messages:
    68
    Likes Received:
    31
    I wonder how well you could implement DXR in a small FPGA? And I also wonder why nobody so far has build a GPU with a small FPGA, which could be used to add/fix some functionality later? At least for professional GPUs the cost for a FPGA could be justified.
     
  3. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    332
    Likes Received:
    85
    Welp, there's the "half or less" performance of RTX in BFV, a 2080 can't even maintain a playable framerate at 4k max settings, this is for a feature that isn't even apparent unless there's a very smooth, shiny surface onscreen at the moment. I do wonder what exactly the bottleneck is, Turing has it's more divergent shading built in, but trying to grab random memory locations could be a major slowdown. Hooary latency! How would you even design a memory system for random access? I remember some experimental US military(?) project stating that was the goal of a custom type of supercomputer they were building, but it wasn't that long ago that it was announced so I doubt whatever system it's proposed to work on is even built yet.

    Regardless, the bigger news is that Sony is skipping E3 next year. Apparently due to a lack of games to show off, which kind of screams "PS5 2020". I'd more likely bet that Sony would stick with AMD than MS would, considering the success they've had with AMD. But in what fashion, will the success of the Switch make Sony go that route? Some back of the napkin math for 7nm and Vega's rather nice tdp at low clock speeds shows they could squeeze a PS4 Pro into a 12 watt tablet, maybe 10. Along with more ram and a much better CPU that'd make a justifiable "next gen" machine a bit larger and louder than a Switch, yet with triple the raw compute of an Xbox One. Is that good enough, or would they go with something like the rumored Navi mainstream card, and get 9/10+ teraflops for a stationary console?

    We already know Microsoft is looking at a dual regular/"cloud" console business model, which seems more Nvidia's thing even though AMD has stated their working on it too. It's good they have a regular console as a backup in case the cloud thing implodes though. Which it could, I'm in the Google game streaming beta (it streams AC Odyssey). And it require a 15mbps connection along with a very solid lack of packet drop. On a shared 50mbps connection that can be hard to get, and while that's not the max I could get where I live the 100mbps connection is rather pricey. I live in the SF bay area, so it's not like broadband options are lacking. While LTE ping and packetloss is great, bandwidth is expensive, I can't imagine many people would be able to "cloud" to their mobile device without racking up a huge bill. There seems to be a lot of video game execs straight up getting off on the idea of game streaming as the future, but MS had to quickly backpedal on so much as requiring and internet connection to activate games with the Xbox One. Thus requiring a stable, high bandwidth, low latency connection to game at all seems a stretch for high adoption in the next two or three years.
     
  4. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Not difficult, but perhaps a bit wasteful considering the simplicity of the network. AMD uses FPGAs(ACE/HWS) for the work scheduling already. For compute to make a few configurable instructions it could be useful, however there would be register bandwidth constraints. A better solution would be a slightly more complex forwarding network for all compute units. Pass along the result to another ALU instead of writing the result back to SRAM.

    More channels and really tight timings. Which likely means lots of raw bandwidth. Essentially crypto optimizations.
     
  5. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,173
    Likes Received:
    576
    Location:
    France
    Can a lot of cache help too ?
     
  6. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    542
    Likes Received:
    171
    That helps a lot when there is time/space locality in your memory access. There is a lot of that in the typical, non-streaming program usage what we normally call "random access". However, the crypto algorithms that have been tuned for random access exhibit "true" random access, where any address is equally likely. For that, caches do basically nothing at all.
     
    Anarchist4000 and Rootax like this.
  7. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,173
    Likes Received:
    576
    Location:
    France
    Ok thx, and, sorry if If this sound stupid, but, can a kind of "result cache" (like oracle does for software...) be implemented in hardware and help ? In some case I'm pretty sure same exact calculation have been done few ms / ns before and doesn't need to be done again... Eh... As I am typing that, I realize It may be dumb....
     
  8. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,496
    Likes Received:
    910
    Are you alluding to this sort of thing?

    https://hal.inria.fr/hal-01193175/document
     
  9. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,173
    Likes Received:
    576
    Location:
    France
  10. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    249
    Likes Received:
    129
    pharma and McHuj like this.
  11. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    The command processor and other front end components had their details discussed when the PS4 was hacked.
    https://fail0verflow.com/media/33c3-slides/#/74
    There's a repository with a disassembler of the ISA, which as noted deals with processing command packets from queues and referencing them with a stored microcode payload. The payload has been referenced at various times with regards to the additional features such as rapid response queues and HSA.
    https://github.com/fail0verflow/radeon-tools/tree/master/f32

    I believe at one point someone found a linkedin entry from an engineer discussing working on an F32 processor for the ACEs or some other front end block a number of years ago.

    An FPGA doesn't seem like it would serve AMD well in this part of the GPU, which is running a loop that is a well-settled algorithm that has been hardware-implemented since forever. Other parts of the GPU's functionality would also not be well-served due to the overhead in hardware and indirection needed to emulate in an FPGA what is again a well-stabilized execution loop. FPGAs work best when there's an algorithm or set of them subject to change or an implementation too niche to justify a physical implementation, or a vendor too limited in resources or needing a quick proof of concept.
     
    BoMbY likes this.
  12. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    For the command processor sure, but we're talking relatively simple tasks involving complex comparisons that can be updated. A simple PLD might be more reasonable. AMD was able to turn ACEs into HWS with some programmability that appears to occur in parallel. Simple binary comparisons are what were needed involving only a few bits per comparison. I'd have to dig a bit for a reference from a few years back.

    The reference provided above was also from 2009 I believe, prior to the existence of the ACE and async compute scheduling. An ASIC would be ideal, but overkill of a PLD or FPGA makes sense if they foresaw a need to update the logic to feed the command processor. Documentation did show the same hardware being reconfigured based on what the card was doing: ACE, HWS, or some compute thing whose name I can't recall ATM. Same hardware with different dispatch scenarios and microcode. The fact they were dealing with 3-4 bit or less comparisons for queue priority lends itself to the programmable logic. A straight microcontroller could be too slow on those tasks.
     
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    The tasks they run are updated by changing the contents of the microcode store, and there have been instances where the front ends could not be updated to change their tasks due to a lack of sufficient microcode store for the various packet formats or functionality changes.
    The custom processor can move data or update registers deeper in the GPU based on the microcode software.

    I provided a reference that is more recent and addresses a GCN variant.

    The marketing diagrams were updated to either match the hardware more closely or to match choices made in the microcode and which pipelines ran which payload.
    https://www.phoronix.com/forums/for...ource-amd-linux/856534-amdgpu-questions/page3
    (#25 discusses the ME and MEC blocks, #28 explains how slides changed from ACEs to HWS in marketing for the same front end)
    My interpretation is that the processors themselves were not changed, their execution loop is to load a packet, cross reference with microcode, and execute the microcode.

    Too slow to compare a few small values? Even if individually too slow to work through a single-digit number of alternatives, there are 4-8 of them working concurrently.
     
  14. Azhrei

    Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    15
    Likes Received:
    7


    Coreteks has a "source" that cites 96 CUs for Navi. 64 was the limit for GCN I believe, does anyone believe this? He also mentions a high end Navi which he calls Navi+ which his source says should be out by the end of next year. Hmm.
     
  15. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,496
    Likes Received:
    910
    There's nothing implausible in there, but it could still be entirely made up.
     
  16. del42sa

    Newcomer

    Joined:
    Jun 29, 2017
    Messages:
    166
    Likes Received:
    82
    completely made up ....

    here is another "plausible" contribution from the same youtuber:

     
    Lightman, Bondrewd and eddieobscurant like this.
  17. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,971
    Likes Received:
    4,565
    IIRC, 4x 16 CU "engines" was never a limitation in GCN.

    It's just that Fiji was already incredibly wide for its time and AMD thought it would be better to increase clocks + optimize geometry and ALUs + increase L2 cache instead.
    Wether that was the best choice with what they had or not is up to debate, though.
     
    AstuteCobra likes this.
  18. del42sa

    Newcomer

    Joined:
    Jun 29, 2017
    Messages:
    166
    Likes Received:
    82
    It was "never a limitation", blah blah blah..... The truth is that since Hawaii GPU AMD never get past 4 SE design, regardless CU/SE count ratio and three GCN generations of GPU´s
     
  19. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,183
    Likes Received:
    1,840
    Location:
    Finland
    Current GCN implementation has a limit of 4 SE's, 64 CUs, 64 ROPs, but I think one AMD engineer at some point confirmed they could put in more SE's (and thus CU's and ROPs) if they revamp the frontend
     
    Lightman likes this.
  20. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,971
    Likes Received:
    4,565
    https://www.anandtech.com/show/11717/the-amd-radeon-rx-vega-64-and-56-review/2
    These are AMD's statements. You may believe them or you may not.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...