AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Discussion in 'Architecture and Products' started by Jawed, Mar 23, 2016.

Tags:
Thread Status:
Not open for further replies.
  1. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Best guess is foveated rendering and a larger implicit wave size. Adapting the scheduling to 1x/2x/4x wave size shouldn't be all that difficult and reduce pressure on instruction buffers/caches and the scheduler.

    Also possible it implies some form of wave packing and MIMD behavior. Four(?) sequencers shared by all lanes along with any scalars(lanes with more robust fetching, possibly inclusive). Technically a 64 lane SIMD and 3x scalars could execute simultaneously.

    Remove those ordering guarantees and it becomes a whole lot simpler. TBDR or OIT would allow you to defer the ordering at the possible expense of some culling opportunities and overdraw storing unnecessary samples. Could probably reduce that expense to cases involving successive geometry and a compaction process could limit the expense to cache bandwidth. Defer Z culling to a L2 backed ROP export. Lose some execution efficiency from unculled/masked lanes, but that shouldn't overly affect off chip memory accesses in most cases.

    There is probably some corner case involving successive overlapping geometry where OIT isn't sufficient or edge detection is involved, but that seems remote. You'd need a shader somehow reliant on the prior triangle affecting the outcome within the draw call where OIT was insufficient or overly costly. Perhaps some sort of particle effect operating in screen space or atomics? Even then you could probably composite the the entire draw call into it's own render target with HBCC and dynamic memory allocation and then use TBDR to composite everything in order.
     
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    This at least seems to indicate there may be instances where UAV reasources are able to use delta color compression.
    Per https://gpuopen.com/dcc-overview/, GCN disables compression for UAVs.

    There is is a subset of scenarios where ordering can be relaxed, with GPUopen listing scenarios where some kind of coverage and/or saturating target (non-blending G-buffer setup and depth-only rendering, respectively).
    Correct, consistent, or tractable for human understanding behaviors make API ordering important elsewhere.
    https://gpuopen.com/unlock-the-rasterizer-with-out-of-order-rasterization/

    OIT is something AMD specifically cites as using ordering guarantees, which seems to make sense in scenarios where the GPU may discard different primitives from buffers on a per-tile basis.
    I would need clarification on why losing ordering guarantees is beneficial for TBDR, which already has a significant synchronization point built into waiting for all primitives to be submitted before transitioning to the screen-space portion, and how losing ordering guarantees allows tiles to give consistent results for geometry that straddles their boundaries.

    The patent's scenario places a premium on having strong ordering. The distributed processing method used by the work distributors relies on them calculating the same sequencing and target hardware, with the same ordering counts generated and assigned. In the scenarios where out of order rasterization makes sense in existing GPUs, it may devolve into a set of additional barriers between the fully ordered and safely unordered modes (entering and leaving), where the arbiters' counters are partially ignored or possibly frozen at a fixed value.

    The ordering starts to matter early in the pipeline. How index buffers are chunked, which FIFOs are broadcast to and read from, and which units are locally selected or presumed by the distributor to be handled by a different GPU, are based on the sequentially equivalent behavior of the distributor blocks and their arbiters. The chunking of the primitive stream and handling of primitives that span screen space boundaries can be affected by what each GPU calculates is its particular chunk or FIFO tag. If a hull shader's output is broadcast by GPU A to a FIFO and tagged with ordering number 1000, it doesn't help if GPU B was expecting it at 1001.
    Deciding which primitives can be discarded in an overlapping scenario can cause inconsistencies if different tiles do not agree on the order.
     
  3. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Agreed, but in the case of mGPU the performance deltas would be far more substantial. Emphasis on quickly frustum culling triangles. More than likely entire draw calls could be culled from some sections of screen space so there would be a need to push ahead. That early culling pass would be clearing a lot of geometry.

    That is still an AMD extension, but should work for everyone easily enough.

    OIT is ordered, but as nothing is discarded the blending will be deferred until all samples are present. Exception for PixelSync or compression mechanisms discarding least relevant samples which are presumed lossy anyways. In application, any error from ordering should be falling into the inconsequential category of samples that gets compressed or discarded. With programmable blending it would be up to the developer to decide how to manage it. Frames likely wouldn't be reproducible, but the difference already determined to be inconsequential. Or all samples held and accuracy ensured at a significant performance cost.

    Not beneficial so much as irrelevant as overdraw should be very limited. TBDR has a sync point, but the execution can be overlapped with other frames and/or compute tasks. Even if stalling at a sync point utilization should remain high with async compute or rendering tasks.

    The patent was also assuming an ordering requirement as the status quo. Relaxing the restriction should eliminate the need for the patent in the first place.

    Order shouldn't matter given a developer flagging a relaxed state. In which case a FIFO wouldn't need to exist and the front-end operating with parallel pipelines. Given a relaxed state an arbitrary number of SEs could exist for the purpose of increasing geometry throughput. Not all that applicable for current hardware as the up to 4 SE design is rather efficient, but MCM, mGPU, or >4 SEs you should see close to linear scaling without the dependencies.
     
  4. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,773
    Likes Received:
    2,560
    Ashraf Eassaon Twitter:
    I just clarified with @AMD about the annual cadence of GPUs: they’re committing to annual products, not necessarily new architectures every year (e.g. RX 480 and RX 580 are different products, but same Polaris architecture)

     
  5. Malo

    Malo Yak Mechanicum
    Legend Veteran Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    7,028
    Likes Received:
    3,101
    Location:
    Pennsylvania
    Rebrands of existing GPUs for OEM markets are new products as well.
     
    liolio, DavidGraham, xpea and 3 others like this.
  6. Nebuchadnezzar

    Legend

    Joined:
    Feb 10, 2002
    Messages:
    974
    Likes Received:
    141
    Location:
    Luxembourg
    If they would actually do a refresh à la Ryzen2/Zen+ it would be fine, but just using the same chip re-branded as a continuing strategy is a bit sad.
     
    Cat Merc, jacozz and Ike Turner like this.
  7. eastmen

    Legend Subscriber

    Joined:
    Mar 17, 2008
    Messages:
    9,983
    Likes Received:
    1,494
    wonder if they can just get higher bins of the current vega chips. I mean its been a year now that they have been producing them , hopefully they can get higher clocks / lower power draw .
     
  8. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,495
    Likes Received:
    910
    Probably, but would it matter? It reminds me of this:
    https://www.amd.com/en/products/cpu/fx-9590
     
  9. eastmen

    Legend Subscriber

    Joined:
    Mar 17, 2008
    Messages:
    9,983
    Likes Received:
    1,494
    well the 56/64 compete with the cards in its price range. If they get higher bins they can travel upwards in pricing and keep some market share. IT is worrisome that they have nothing new for 2018. I am hoping they get back on track in 2019 having two strong graphics card makers is important just look at what amd has done in the cpu market
     
  10. yuri

    Newcomer

    Joined:
    Jun 2, 2010
    Messages:
    178
    Likes Received:
    147
    First they need to ditch the GCN. It is awfully outdated - just recall it has been designed for the era of games like TES V: Skyrim. The gaming tech has advanced a lot since than... hasn't it?

    They hit their top architecture config back in 2013 with Hawaii. The arch simply couldn't scale well past that point. It's been 5 years dealing with weird products killed by marketing bs (2.8x perf/W, 4GB HBM1, visibly higher IPC).

    Btw where are the promised '2018 Vega Mobile', 'Vega 10' or 'Vega 11' chips? Navi was probably cut in a Bulldozer-like fashion. They need to find their way outta their paper bag...
     
    Heinrich4, pharma, Newguy and 2 others like this.
  11. BoMbY

    Newcomer

    Joined:
    Aug 31, 2017
    Messages:
    68
    Likes Received:
    31
    So, then, why don't you take one of the many open positions at AMD and design them a complete new ISA which is better than GCN?
     
    hkultala likes this.
  12. no-X

    Veteran

    Joined:
    May 28, 2005
    Messages:
    2,298
    Likes Received:
    247
    Is it? It was released about 3 months before Kepler. From the gamer's standpoint, the only difference between Kepler and current Pascal generation is the resolved performance drop in DX12 games. The rest is raw performance (clocks, more units) and bandwidth optimisations. GCN is still well suited for current gaming workloads. The only problem is lack of proper optimisations for clockspeed and power consumption (most likely because of the lack of resources during Read's era). This problem can't be solved simply by switching to a different architecture. These optimisations have to be implemented (to GCN or any other new architecture).

    Once you underclock e.g. GTX 1080 Ti to the level of Vega 64 (similar die-size), resulting performance isn't much different, so performance per square millimeter per clock is similar. The real problem is power consumption and clock speed.

    It's possible, that it will be simplier for AMD to bring these changes together with a new architectures, but in terms of current gaming workloads a feature-set, GCN isn't outdated. I'd even say it was very future-proof.
     
    Lightman likes this.
  13. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    432
    Location:
    New York
    I don't think clocks are Vega's problem. Its deficits in almost every efficiency metric are very apparent without artificially crippling the competition. It would also fare poorly in a comparison vs an underclocked Ti just as it does vs the 1080.
     
  14. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,169
    Likes Received:
    576
    Location:
    France
    The geometry thing is lacking on Vega, even if AMD made small steps with Polaris, it's nowhere near nVidia. And, yes, power consumption. And it was like a year late... It performs like a OC Fiji without the 4gb vram constraint, despite all the blabla about primitive shaders (not working/available), packet math (not used a lot...), DSBR, etc...

    I don't know if it's outdated, but it seems to need ressources they don't have to make it work in a efficient way and drivers exposing all the functions...
     
    Heinrich4, DavidGraham and Ike Turner like this.
  15. Ike Turner

    Veteran Regular

    Joined:
    Jul 30, 2005
    Messages:
    1,884
    Likes Received:
    1,756
    I personally think that it's a resource and mismanagement thing which in turn resulted in shoddy products (Polaris/Vega etc). Raja's tenure as head of Radeon/RTG was shitty at best from the get go. Rumors of him shopping the Radeon group to Intel for acquisition then AMD forming RTG to appease this "rebellion". Trash tier marketing..but then again Roy Taylor was a the helm so this was expected he's always been kind of a slime ball.. Dude started his career at Nvidia in 2000 t'ill 2013 where he created the TWIMTBP program etc.. then switched to AMD messed things up for 4 years and bounced when Raja was booted (yup he arrived at AMD at the same time as Raja when he came back from Apple and left when Raja left).

    Anyway the GCN arch is fine and console devs seem pretty happy about it. Things are totally effed up on the PC side because AMD simply doesn't have the resource for great driver development, dev relations, R&D etc.
    With the right support GCN has no problem outclassing Nvidia's arch..but there's simply barely any support especially in pro apps because of CUDA which is the defacto standard. When things are on a even footing GCN is often faster than the equivalent NV arch at the time.

    TensorFlow 1.3 on Radeon vs TF1.6 on Nvidia (RocM port done by AMD version 1.8 released last month but I haven't seen benchs yet)
    [​IMG]

    For exemple, using the latest version of PhotoScan a FuryX running the OpenCL path is as fast as the 1 year younger GTX1080 running the CUDA path (but the 4Gb limitation on Fiji just makes it un-usable in some casses).
     
  16. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,927
    Likes Received:
    1,626
    No link to the review?
     
  17. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,183
    Likes Received:
    1,840
    Location:
    Finland
    pharma likes this.
  18. no-X

    Veteran

    Joined:
    May 28, 2005
    Messages:
    2,298
    Likes Received:
    247
    The front-end / geometry bottleneck was removed with Polaris and as a result, Polaris 10 at 1080p and lower resolutions offers performance, which is often pretty close to Fiji. I really don't understand the request of higher geometry performance. Geometry performance limits the frame rate at low-resolutions, not at high-resolutions. Vega's resolution scaling is almost the same as Pascal's. It definitely isn't limited by the geometry performance in current workloads.
    DSBR is used as bandwidth saving feature ("fetch once") and it works well, otherwise Vega wouldn't perform up to 50 % better than Fiji with its lower bandwidth. It also helped Raven Ridge to boost performance in its bandwidth-limited situations. What probably isn't enabled, is the "shade once" DSBR feature. There were some rumors, that it could be dependent on the Primitive Shader, but who knows. Anyway, the Primitive Shader per se wouldn't affect gaming performance more than by 1-2 %. Vega isn't geometry limited. I think it could be fill rate limited.
     
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Apologies for the time until I replied:

    PixelSync is an example of a class of OIT methods that provide a ceiling on the amount of context that needs to be maintained, providing a measure of consistency where pathological cases can cause significant performance or storage deltas between regions.
    Even without some kind of limit, which inevitably creates a boundary where ordering matters, other methods that do not place a storage bound use ordering to consistently handle fragments determined to be at the same depth.

    The assertion is that there are no inconvenient environments or data combinations where there's inconsistency between dominant contributors to the final blended pixels, and would not help in the case of errors or faults that would be more intractable to debug if the dynamic behavior of the pipeline is not consistent or loses the context needed to trace down the problem.

    This places responsibility on the developers while robbing them of the means to reason through the behavior of the system, and the alternative is to force a slower and non-representative fallback that wouldn't help them reason about the behavior of the lossy method they've been made responsible for.

    The other tasks have no dependence on the primitives being evaluated for coverage. Out of order rasterization concerns itself with avoiding back-pressure from fragments that have not reached the export stage from stalling the more serial rasterization and wavefront launch stages upstream. That works in cases where the desired output is indifferent to the specific primitive that comes out, or that there is something else that cleans up afterwards like depth checks.
    The stalling happens because in cases where the ordering matters, the front end is not certain about the timing of overlapping fragments reaching export. For TBDR, the deferred process eliminates all but the final contributor to a pixel, and so by definition there is no timing it needs to worry about.

    What benefit arises from using TBDR that refuses to accurately cull fragments it is in the position to perfectly determine coverage for is unclear, as the distance until they reach the point where OoO rasterization matters is well beyond the scope of the hardware pipeline the scheme is used for.


    Are the parallel GPUs reverting to the first scenario in the patent, where each GPU fully duplicates the world-space work and so cannot positively scale geometry throughput? Otherwise, what method allows for them to produce a consistent output?

    AMD has some other instances where the ordering matters as part of accelerating culling of in-flight primitives. In certain combinations of depth test and rasterization mode, it can discard all but the most recent in API ordering without duplicating depth checks or having to hold exports until all in-flight fragments reach the export stage.

    http://www.freepatentsonline.com/20180218532.pdf

    What mechanism exists to provide a consensus between all the SEs about the state of the pipeline or for elements that exist in more than one tile?
     
  20. giannhs

    Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    37
    Likes Received:
    40
    its not only geometry tho
    gcn essentially is a massive paraller monster and nvidia maxwell 1/2/2.2/3(you know what i mean ) is a speed demon
    amd traditionally has a horrible front end that chokes down the entire card and even in 2018 we saw things like the new final fantasy that the setting for tesselation was locked on x64 for amd users for some reason and they claim "it was a mistake"
    amd pretty much doesnt have ANYTHING till the new uarch comes in play (unless they are sandbagging with their ray tracing perfomance)
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...