AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Discussion in 'Architecture and Products' started by Jawed, Mar 23, 2016.

Tags:
Thread Status:
Not open for further replies.
  1. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    AMD has an inter chip connect for CPUs, where all CPU still very much act as individual agents. Nvidia has Nvlink which does the same for GPUs.

    There is not reason whatsoever to think that AMD knows quite a lot more from a practical point of view.
     
  2. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Exposing the ability to make multiple adapters act as one isn't in itself difficult, it's doing it efficiently. The API specifies geometry to be rasterized in the order it is submitted, even if this is seldom required nowadays. Very rarely if using OIT as blending actions are what cause the issues. Drivers already selectively bypass this with out of order raster, but technically it's cheating and not to specification. Nvidia has the same issue, but they have higher clocks driving their front end. MCM with the current front-ends would be along the lines of Vega56 to 64 comparisons. Added horsepower but not necessarily improving framerate all that much. Then you need divisible compute tasks that largely stay on their respective chips for efficiency. Not all that difficult, but developers need to code with tiles and a possibly dynamic distribution in mind.

    Currently there are 4, and occasionally 6, shader engine designs where geometry is binned into quadrants. This number can be increased, but then there are issues balancing the work distribution and any geometry crossing boundaries can be problematic. The gain is that most geometry doesn't cross boundaries and can be discarded is many cases.

    Still easier to use some form of split frame in that case. Each chip having it's own memory channel with different resources loaded/cached. The different perspectives, as you suggested, would likely heavily benefit from better caching. SFR theoretically doubling your cache size to save memory bandwidth where a scanline type approach would only double cache bandwidth. I can't immediately think of any scenario where SFR wouldn't be superior. At best there would be a purely computational load where the distribution wouldn't even matter.
     
  3. giannhs

    Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    37
    Likes Received:
    40
    i dont think its the same at all considering that amd does it inside the chip while nvlink basicly is a wider sli protocol that still needs a hardware switch to do the job for it and plus it doesnt have a direct link to the cpu's because of the qpi retardness
     
  4. BoMbY

    Newcomer

    Joined:
    Aug 31, 2017
    Messages:
    68
    Likes Received:
    31
    I don't think what was said is the full story. AMD is working on stuff like this for a longer time: https://patents.google.com/patent/US20160253774
     
  5. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,151
    Likes Received:
    571
    Location:
    France
    Yeah but I don't really trust AMD with this kind of thing after the Vega kind of fiasco. Like, they have very good ideas, but fail to execute / implement them. And they're have pretty limited resources.
     
    yuri and pharma like this.
  6. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    You are very confused about what nvlink can do.
     
  7. Wynix

    Veteran Regular

    Joined:
    Feb 23, 2013
    Messages:
    1,052
    Likes Received:
    57
    Vega fiasco? Are you talking about the availability problem?
     
  8. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,151
    Likes Received:
    571
    Location:
    France
    I'm talking about performances, which were a let down after all the hype, and can't even beat "old pascal" in most scenario. But the real fiasco for me was the Primitive Shaders thing. On paper it's was a really good idea. But it went from "we will auto enable it with driver" to "ok you will need a special api to use them", and right now there is no way to use them. So we have a O/C Fiji with more vram.

    That's why I said I don't "trust" their good ideas / patents.
     
    A1xLLcqAgt0qc2RyMz0y and yuri like this.
  9. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    225
    Likes Received:
    97
    The Primitive Shader thing is really disappointing. I don’t know why AMD don‘t release a demo or why they don‘t show the programm where they measured the 17 primitive/clock from their white paper.
     
    #549 Digidi, Jul 21, 2018
    Last edited: Jul 21, 2018
  10. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,491
    Likes Received:
    909
    My guess is that they've just decided to cut their losses on Vega and focus their ressources on making it work (probably better) on Navi. At this point, it would do little to increase Vega sales anyway.
     
  11. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,151
    Likes Received:
    571
    Location:
    France
    Or they're not focused on PC high end gaming graphics anymore, like some articles suggest. It's all about AI/compute/datacenter and consoles. PC graphics in the low end and middle, but no high end to compete with top nvidia cards. We already are in this situation since Polaris... And maybe it's better for them right now. They don't have the ressources to fight nVidia everywhere.
     
  12. BoMbY

    Newcomer

    Joined:
    Aug 31, 2017
    Messages:
    68
    Likes Received:
    31
    This was said in January:

    https://www.anandtech.com/show/1231...n-exclusive-interview-with-dr-lisa-su-amd-ceo
     
    ImSpartacus likes this.
  13. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    In reality they need both, they need AI, computing gpu's ( i mean computing on a large scale ( scientific, 3D modeling, rendering, data crunch etc ) and high end gaming.. this is where ressources become a problem )...
     
  14. iamw

    Newcomer

    Joined:
    Jul 20, 2010
    Messages:
    21
    Likes Received:
    44
  15. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,797
    Likes Received:
    2,056
    Location:
    Germany
    Computerbase did a test with SLI and Crossfire on some newer titles, doubling both computational power and memory bandwidth with the respective setups. They reported that it was actually quite hard for them to finde suitable titles that had Multi-GPU support in the first place.

    And yet, even though in some cases SLI and Crossfire managed to produce more frames per second, the results were in each and every case inferior to what a gamer would experience with one of the respective cards.
    https://www.computerbase.de/2018-07/crossfire-sli-test-2018/

    They also did re-test the SLI setup with a high speed SLI bridge, which improved results considerably compared with the the standard model. Maybe inter-die bandwidth and intelligent management of that data is more important for games after all?
     
    Silent_Buddha and pharma like this.
  16. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Still more beneficial to avoid the synchronization in the first place. AFR, which Crossfire and SLI will likely attempt, isn't going to work all that well. Really need SFR and a developer coding with it in mind. Not necessarily difficult, just need to avoid some of the guarantees. Throw triangles at screen space without consideration to order and pixels crossing boundaries it should work. Current limitations are a bit silly.
     
  17. yuri

    Newcomer

    Joined:
    Jun 2, 2010
    Messages:
    178
    Likes Received:
    147
    This patent seems to be interesting. Any conclusions?

    A quick layman skim through:
    Two 'GPUs' with good old 16-lane SIMD and 4 'geometry engines', etc. Work could be easily split and done in parallel after the GS stage (seems to be intuitive). Messing with vertices is harder and with tessellation included super-hard (and slow?)...
     
  18. BoMbY

    Newcomer

    Joined:
    Aug 31, 2017
    Messages:
    68
    Likes Received:
    31
    Well, if they had a good working Crossfire solution there wouldn't be any reason not to put multiple GPU dies on a single interposer anymore ...
     
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    From my initial read, the patent appears to outline two scenarios for multiple GPUs working together to render to a screen space that is subdivided into zones each GPU is responsible for.
    Both scenarios involve a generally identical stream of commands being fed to all the GPUs.
    The first and most straightforward scenario has similar language to other patents and disclosures that have AMD's graphics pipeline divided into world-space work (input, transform, tessellation, etc.) and screen space (pixel), and has every GPU duplicate the world-space work and pass along to the screen-space portion of the pipeline fragments that overlap with an area of the screen a GPU is responsible for. This sounds the most like the culling work done by primitive shaders and their methods for doing the same sort of coverage determination for the responsible shader engine and screen-space tiled ROPs within a GPU.
    It is more straightforward and requires a low amount of cross-communication and synchronization, but also does not leverage the extra hardware and resources very well prior to getting to the screen-space part of the process.

    The second scenario seeks to distribute world-space work among the different front ends with a dedicated work distribution facility. Each GPU still receives mostly the same commands, but the input assembly, setup, and geometry portions are farmed out in chunks for each GPU to take on individually. The patent puts forward round-robin distribution of sections of the index buffers and groups of setup primitives as an example.
    A work distributor is found between the primitive setup and geometry/vertex shader phases in the absence of tessellation, or the work distributor is found on the input and output ends of the tessellation unit.

    Each GPU's work distributor internally tracks the global API submission order with incrementing counters, a series of FIFOs for each geometry engine and tessellation block, and communications to and from other work distributors as to the status or ordering of various work items each GPU is responsible for.
    A work distributor will have a counter and FIFOs for each of its local geometry engines and shader launch pipes, as well as a series of FIFOs corresponding to the equivalent hardware belonging to other GPUs. Each distributor will run through the same evaluation process, and then it compares the calculated selection tags to what is available locally.

    The distributors accelerate the distributed work process by semi-independently incrementing the ordering count (each GPU derives its count from effectively the same command stream) and using the same load-balancing rules to rapidly pass data to local engines or discard elements that another GPU (independently making the same calculations) will cover. A lower number of updates related to completion status and the output of setup stages is broadcast from each GPU to all the others so that they have a consistent view of what the ordering is and what is in-progress. Some output data from the geometry engines is broadcast to the FIFOs in the other GPUs, whereas in other cases a stage that expands the amount of data like the tessellation stage might just have the work distributors pass the relevant ordering number to a GPU that will then fire up the selected tessellation unit, which will read in control points and feed the next surface/geometry shader locally.

    This allows the GPUs to provide more resources to hopefully speed up the world-space portion of the process, with a dedicated portion for maintaining ordering guarantees, broadcasting status and outputs, and for making accelerated culling decisions about whether their local GPU will be handing a set of inputs or not. While there is a work distributor of sorts mentioned in recent AMD GPUs, the last part concerning culling seems like it brings part of the culling duties of primitive shaders that might be part of the first scenario in the patent (and perhaps primitive shaders as we know them) and places the decision making in this dedicated logic stage.
     
    Cat Merc, iroboto and 3dcgi like this.
  20. BoMbY

    Newcomer

    Joined:
    Aug 31, 2017
    Messages:
    68
    Likes Received:
    31
    Someone added some GFX10 stuff to the settings_gfx9.json in GPUOpen-Drivers/pal:

    ----

    Gfx9UseDcc - "Bitmask of cases where DCC (delta color compression) surfaces will be used":

    New Option:

    Gfx10UseDccUav - "Shader writable surfaces(UAV)"

    ----

    Gfx10UseCompToSingle - "Whether we need to set first pixel of each block that corresponds to 1 byte in DCC memory, and don't need to do a fast clear eliminate based on image type."

    Options:

    Gfx10UseCompToSingleNone: "Use comp-to-reg all image type."
    Gfx10UseCompToSingle2d: "Use comp-to-single for 2d image."
    Gfx10UseCompToSingle2dArray: "Use comp-to-single for 2d array image."
    Gfx10UseCompToSingleMsaa: "Use comp-to-single for MSAA image."
    Gfx10UseCompToSingle3D: "Use comp-to-single for 3d image."
    Gfx10DisableCompToReg: "If set, comp-to-reg rendering will be disabled for images that were cleared with comp-to-single."

    ----

    SdmaPreferCompressedSource - "Affects tiled-to-tiled image copies on the GFX10 SDMA engine where both images are compressed Set to true to leave the source image in a compressed state, set to false to leave the dest image in a compressed state."

    ----

    Gfx10ForceWaveBreakSize - "Forces the size of a wave break; over-rides client-specified value."

    Options:

    Gfx10ForceWaveBreakSizeNone: "No wave breaks by region."
    Gfx10ForceWaveBreakSize8x8: "8x8 region size."
    Gfx10ForceWaveBreakSize16x16: "16x16 region size."
    Gfx10ForceWaveBreakSize32x32: "32x32 region size."
    Gfx10ForceWaveBreakSizeClient: "Use client specified value."
    Gfx10ForceWaveBreakSizeAuto: "Let PAL decide."

    ----

    And then there are some GFX9 options referencing a PAL_BUILD_NAVI10_LITE in addition to a PAL_BUILD_GFX10.

    Not exactly sure what it all could mean.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...