AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Status
Not open for further replies.
Best guess is foveated rendering and a larger implicit wave size. Adapting the scheduling to 1x/2x/4x wave size shouldn't be all that difficult and reduce pressure on instruction buffers/caches and the scheduler.

Also possible it implies some form of wave packing and MIMD behavior. Four(?) sequencers shared by all lanes along with any scalars(lanes with more robust fetching, possibly inclusive). Technically a 64 lane SIMD and 3x scalars could execute simultaneously.

This allows the GPUs to provide more resources to hopefully speed up the world-space portion of the process, with a dedicated portion for maintaining ordering guarantees, broadcasting status and outputs, and for making accelerated culling decisions about whether their local GPU will be handing a set of inputs or not. While there is a work distributor of sorts mentioned in recent AMD GPUs, the last part concerning culling seems like it brings part of the culling duties of primitive shaders that might be part of the first scenario in the patent (and perhaps primitive shaders as we know them) and places the decision making in this dedicated logic stage.
Remove those ordering guarantees and it becomes a whole lot simpler. TBDR or OIT would allow you to defer the ordering at the possible expense of some culling opportunities and overdraw storing unnecessary samples. Could probably reduce that expense to cases involving successive geometry and a compaction process could limit the expense to cache bandwidth. Defer Z culling to a L2 backed ROP export. Lose some execution efficiency from unculled/masked lanes, but that shouldn't overly affect off chip memory accesses in most cases.

There is probably some corner case involving successive overlapping geometry where OIT isn't sufficient or edge detection is involved, but that seems remote. You'd need a shader somehow reliant on the prior triangle affecting the outcome within the draw call where OIT was insufficient or overly costly. Perhaps some sort of particle effect operating in screen space or atomics? Even then you could probably composite the the entire draw call into it's own render target with HBCC and dynamic memory allocation and then use TBDR to composite everything in order.
 
New Option:

Gfx10UseDccUav - "Shader writable surfaces(UAV)"

This at least seems to indicate there may be instances where UAV reasources are able to use delta color compression.
Per https://gpuopen.com/dcc-overview/, GCN disables compression for UAVs.

Remove those ordering guarantees and it becomes a whole lot simpler.
There is is a subset of scenarios where ordering can be relaxed, with GPUopen listing scenarios where some kind of coverage and/or saturating target (non-blending G-buffer setup and depth-only rendering, respectively).
Correct, consistent, or tractable for human understanding behaviors make API ordering important elsewhere.
https://gpuopen.com/unlock-the-rasterizer-with-out-of-order-rasterization/

TBDR or OIT would allow you to defer the ordering at the possible expense of some culling opportunities and overdraw storing unnecessary samples.
OIT is something AMD specifically cites as using ordering guarantees, which seems to make sense in scenarios where the GPU may discard different primitives from buffers on a per-tile basis.
I would need clarification on why losing ordering guarantees is beneficial for TBDR, which already has a significant synchronization point built into waiting for all primitives to be submitted before transitioning to the screen-space portion, and how losing ordering guarantees allows tiles to give consistent results for geometry that straddles their boundaries.

The patent's scenario places a premium on having strong ordering. The distributed processing method used by the work distributors relies on them calculating the same sequencing and target hardware, with the same ordering counts generated and assigned. In the scenarios where out of order rasterization makes sense in existing GPUs, it may devolve into a set of additional barriers between the fully ordered and safely unordered modes (entering and leaving), where the arbiters' counters are partially ignored or possibly frozen at a fixed value.

You'd need a shader somehow reliant on the prior triangle affecting the outcome within the draw call where OIT was insufficient or overly costly.
The ordering starts to matter early in the pipeline. How index buffers are chunked, which FIFOs are broadcast to and read from, and which units are locally selected or presumed by the distributor to be handled by a different GPU, are based on the sequentially equivalent behavior of the distributor blocks and their arbiters. The chunking of the primitive stream and handling of primitives that span screen space boundaries can be affected by what each GPU calculates is its particular chunk or FIFO tag. If a hull shader's output is broadcast by GPU A to a FIFO and tagged with ordering number 1000, it doesn't help if GPU B was expecting it at 1001.
Deciding which primitives can be discarded in an overlapping scenario can cause inconsistencies if different tiles do not agree on the order.
 
There is is a subset of scenarios where ordering can be relaxed, with GPUopen listing scenarios where some kind of coverage and/or saturating target (non-blending G-buffer setup and depth-only rendering, respectively).
Correct, consistent, or tractable for human understanding behaviors make API ordering important elsewhere.
https://gpuopen.com/unlock-the-rasterizer-with-out-of-order-rasterization/
AMDGPU said:
What kind of performance can you expect? We’ve seen increases in the 10% range. You won’t see any benefit if the driver had enabled it automatically of course (for instance, depth-only rendering). In nearly all other cases, the driver has to play safe and cannot enable it even though there wouldn’t be any visible artifacts. With this extension, we enable the application to decide on whether relaxed order rendering is sufficient and reap the performance benefits.
Agreed, but in the case of mGPU the performance deltas would be far more substantial. Emphasis on quickly frustum culling triangles. More than likely entire draw calls could be culled from some sections of screen space so there would be a need to push ahead. That early culling pass would be clearing a lot of geometry.

That is still an AMD extension, but should work for everyone easily enough.

OIT is something AMD specifically cites as using ordering guarantees, which seems to make sense in scenarios where the GPU may discard different primitives from buffers on a per-tile basis.
OIT is ordered, but as nothing is discarded the blending will be deferred until all samples are present. Exception for PixelSync or compression mechanisms discarding least relevant samples which are presumed lossy anyways. In application, any error from ordering should be falling into the inconsequential category of samples that gets compressed or discarded. With programmable blending it would be up to the developer to decide how to manage it. Frames likely wouldn't be reproducible, but the difference already determined to be inconsequential. Or all samples held and accuracy ensured at a significant performance cost.

I would need clarification on why losing ordering guarantees is beneficial for TBDR, which already has a significant synchronization point built into waiting for all primitives to be submitted before transitioning to the screen-space portion, and how losing ordering guarantees allows tiles to give consistent results for geometry that straddles their boundaries.
Not beneficial so much as irrelevant as overdraw should be very limited. TBDR has a sync point, but the execution can be overlapped with other frames and/or compute tasks. Even if stalling at a sync point utilization should remain high with async compute or rendering tasks.

The patent's scenario places a premium on having strong ordering. The distributed processing method used by the work distributors relies on them calculating the same sequencing and target hardware, with the same ordering counts generated and assigned.
The patent was also assuming an ordering requirement as the status quo. Relaxing the restriction should eliminate the need for the patent in the first place.

The ordering starts to matter early in the pipeline. How index buffers are chunked, which FIFOs are broadcast to and read from, and which units are locally selected or presumed by the distributor to be handled by a different GPU, are based on the sequentially equivalent behavior of the distributor blocks and their arbiters. The chunking of the primitive stream and handling of primitives that span screen space boundaries can be affected by what each GPU calculates is its particular chunk or FIFO tag. If a hull shader's output is broadcast by GPU A to a FIFO and tagged with ordering number 1000, it doesn't help if GPU B was expecting it at 1001.
Deciding which primitives can be discarded in an overlapping scenario can cause inconsistencies if different tiles do not agree on the order.
Order shouldn't matter given a developer flagging a relaxed state. In which case a FIFO wouldn't need to exist and the front-end operating with parallel pipelines. Given a relaxed state an arbitrary number of SEs could exist for the purpose of increasing geometry throughput. Not all that applicable for current hardware as the up to 4 SE design is rather efficient, but MCM, mGPU, or >4 SEs you should see close to linear scaling without the dependencies.
 
If they would actually do a refresh à la Ryzen2/Zen+ it would be fine, but just using the same chip re-branded as a continuing strategy is a bit sad.
wonder if they can just get higher bins of the current vega chips. I mean its been a year now that they have been producing them , hopefully they can get higher clocks / lower power draw .
 
Probably, but would it matter? It reminds me of this:
https://www.amd.com/en/products/cpu/fx-9590

well the 56/64 compete with the cards in its price range. If they get higher bins they can travel upwards in pricing and keep some market share. IT is worrisome that they have nothing new for 2018. I am hoping they get back on track in 2019 having two strong graphics card makers is important just look at what amd has done in the cpu market
 
IT is worrisome that they have nothing new for 2018. I am hoping they get back on track in 2019 having two strong graphics card makers is important just look at what amd has done in the cpu market
First they need to ditch the GCN. It is awfully outdated - just recall it has been designed for the era of games like TES V: Skyrim. The gaming tech has advanced a lot since than... hasn't it?

They hit their top architecture config back in 2013 with Hawaii. The arch simply couldn't scale well past that point. It's been 5 years dealing with weird products killed by marketing bs (2.8x perf/W, 4GB HBM1, visibly higher IPC).

Btw where are the promised '2018 Vega Mobile', 'Vega 10' or 'Vega 11' chips? Navi was probably cut in a Bulldozer-like fashion. They need to find their way outta their paper bag...
 
So, then, why don't you take one of the many open positions at AMD and design them a complete new ISA which is better than GCN?
 
First they need to ditch the GCN. It is awfully outdated
Is it? It was released about 3 months before Kepler. From the gamer's standpoint, the only difference between Kepler and current Pascal generation is the resolved performance drop in DX12 games. The rest is raw performance (clocks, more units) and bandwidth optimisations. GCN is still well suited for current gaming workloads. The only problem is lack of proper optimisations for clockspeed and power consumption (most likely because of the lack of resources during Read's era). This problem can't be solved simply by switching to a different architecture. These optimisations have to be implemented (to GCN or any other new architecture).

Once you underclock e.g. GTX 1080 Ti to the level of Vega 64 (similar die-size), resulting performance isn't much different, so performance per square millimeter per clock is similar. The real problem is power consumption and clock speed.

It's possible, that it will be simplier for AMD to bring these changes together with a new architectures, but in terms of current gaming workloads a feature-set, GCN isn't outdated. I'd even say it was very future-proof.
 
Once you underclock e.g. GTX 1080 Ti to the level of Vega 64 (similar die-size), resulting performance isn't much different, so performance per square millimeter per clock is similar. The real problem is power consumption and clock speed.

I don't think clocks are Vega's problem. Its deficits in almost every efficiency metric are very apparent without artificially crippling the competition. It would also fare poorly in a comparison vs an underclocked Ti just as it does vs the 1080.
 
The geometry thing is lacking on Vega, even if AMD made small steps with Polaris, it's nowhere near nVidia. And, yes, power consumption. And it was like a year late... It performs like a OC Fiji without the 4gb vram constraint, despite all the blabla about primitive shaders (not working/available), packet math (not used a lot...), DSBR, etc...

I don't know if it's outdated, but it seems to need ressources they don't have to make it work in a efficient way and drivers exposing all the functions...
 
The geometry thing is lacking on Vega, even if AMD made small steps with Polaris, it's nowhere near nVidia. And, yes, power consumption. And it was like a year late... It performs like a OC Fiji without the 4gb vram constraint, despite all the blabla about primitive shaders (not working/available), packet math (not used a lot...), DSBR, etc...

I don't know if it's outdated, but it seems to need ressources they don't have to make it work in a efficient way and drivers exposing all the functions...

I personally think that it's a resource and mismanagement thing which in turn resulted in shoddy products (Polaris/Vega etc). Raja's tenure as head of Radeon/RTG was shitty at best from the get go. Rumors of him shopping the Radeon group to Intel for acquisition then AMD forming RTG to appease this "rebellion". Trash tier marketing..but then again Roy Taylor was a the helm so this was expected he's always been kind of a slime ball.. Dude started his career at Nvidia in 2000 t'ill 2013 where he created the TWIMTBP program etc.. then switched to AMD messed things up for 4 years and bounced when Raja was booted (yup he arrived at AMD at the same time as Raja when he came back from Apple and left when Raja left).

Anyway the GCN arch is fine and console devs seem pretty happy about it. Things are totally effed up on the PC side because AMD simply doesn't have the resource for great driver development, dev relations, R&D etc.
With the right support GCN has no problem outclassing Nvidia's arch..but there's simply barely any support especially in pro apps because of CUDA which is the defacto standard. When things are on a even footing GCN is often faster than the equivalent NV arch at the time.

TensorFlow 1.3 on Radeon vs TF1.6 on Nvidia (RocM port done by AMD version 1.8 released last month but I haven't seen benchs yet)
cifar10_average.png


For exemple, using the latest version of PhotoScan a FuryX running the OpenCL path is as fast as the 1 year younger GTX1080 running the CUDA path (but the 4Gb limitation on Fiji just makes it un-usable in some casses).
 
The geometry thing is lacking on Vega, even if AMD made small steps with Polaris, it's nowhere near nVidia. And, yes, power consumption. And it was like a year late... It performs like a OC Fiji without the 4gb vram constraint, despite all the blabla about primitive shaders (not working/available), packet math (not used a lot...), DSBR, etc...

I don't know if it's outdated, but it seems to need ressources they don't have to make it work in a efficient way and drivers exposing all the functions...
The front-end / geometry bottleneck was removed with Polaris and as a result, Polaris 10 at 1080p and lower resolutions offers performance, which is often pretty close to Fiji. I really don't understand the request of higher geometry performance. Geometry performance limits the frame rate at low-resolutions, not at high-resolutions. Vega's resolution scaling is almost the same as Pascal's. It definitely isn't limited by the geometry performance in current workloads.
DSBR is used as bandwidth saving feature ("fetch once") and it works well, otherwise Vega wouldn't perform up to 50 % better than Fiji with its lower bandwidth. It also helped Raven Ridge to boost performance in its bandwidth-limited situations. What probably isn't enabled, is the "shade once" DSBR feature. There were some rumors, that it could be dependent on the Primitive Shader, but who knows. Anyway, the Primitive Shader per se wouldn't affect gaming performance more than by 1-2 %. Vega isn't geometry limited. I think it could be fill rate limited.
 
Apologies for the time until I replied:

OIT is ordered, but as nothing is discarded the blending will be deferred until all samples are present. Exception for PixelSync or compression mechanisms discarding least relevant samples which are presumed lossy anyways.
PixelSync is an example of a class of OIT methods that provide a ceiling on the amount of context that needs to be maintained, providing a measure of consistency where pathological cases can cause significant performance or storage deltas between regions.
Even without some kind of limit, which inevitably creates a boundary where ordering matters, other methods that do not place a storage bound use ordering to consistently handle fragments determined to be at the same depth.

In application, any error from ordering should be falling into the inconsequential category of samples that gets compressed or discarded.
The assertion is that there are no inconvenient environments or data combinations where there's inconsistency between dominant contributors to the final blended pixels, and would not help in the case of errors or faults that would be more intractable to debug if the dynamic behavior of the pipeline is not consistent or loses the context needed to trace down the problem.

With programmable blending it would be up to the developer to decide how to manage it. Frames likely wouldn't be reproducible, but the difference already determined to be inconsequential. Or all samples held and accuracy ensured at a significant performance cost.
This places responsibility on the developers while robbing them of the means to reason through the behavior of the system, and the alternative is to force a slower and non-representative fallback that wouldn't help them reason about the behavior of the lossy method they've been made responsible for.

Not beneficial so much as irrelevant as overdraw should be very limited. TBDR has a sync point, but the execution can be overlapped with other frames and/or compute tasks.
The other tasks have no dependence on the primitives being evaluated for coverage. Out of order rasterization concerns itself with avoiding back-pressure from fragments that have not reached the export stage from stalling the more serial rasterization and wavefront launch stages upstream. That works in cases where the desired output is indifferent to the specific primitive that comes out, or that there is something else that cleans up afterwards like depth checks.
The stalling happens because in cases where the ordering matters, the front end is not certain about the timing of overlapping fragments reaching export. For TBDR, the deferred process eliminates all but the final contributor to a pixel, and so by definition there is no timing it needs to worry about.

What benefit arises from using TBDR that refuses to accurately cull fragments it is in the position to perfectly determine coverage for is unclear, as the distance until they reach the point where OoO rasterization matters is well beyond the scope of the hardware pipeline the scheme is used for.


The patent was also assuming an ordering requirement as the status quo. Relaxing the restriction should eliminate the need for the patent in the first place.
Are the parallel GPUs reverting to the first scenario in the patent, where each GPU fully duplicates the world-space work and so cannot positively scale geometry throughput? Otherwise, what method allows for them to produce a consistent output?

AMD has some other instances where the ordering matters as part of accelerating culling of in-flight primitives. In certain combinations of depth test and rasterization mode, it can discard all but the most recent in API ordering without duplicating depth checks or having to hold exports until all in-flight fragments reach the export stage.

http://www.freepatentsonline.com/20180218532.pdf

Given a relaxed state an arbitrary number of SEs could exist for the purpose of increasing geometry throughput.
What mechanism exists to provide a consensus between all the SEs about the state of the pipeline or for elements that exist in more than one tile?
 
The geometry thing is lacking on Vega, even if AMD made small steps with Polaris, it's nowhere near nVidia. And, yes, power consumption. And it was like a year late... It performs like a OC Fiji without the 4gb vram constraint, despite all the blabla about primitive shaders (not working/available), packet math (not used a lot...), DSBR, etc...

I don't know if it's outdated, but it seems to need ressources they don't have to make it work in a efficient way and drivers exposing all the functions...
its not only geometry tho
gcn essentially is a massive paraller monster and nvidia maxwell 1/2/2.2/3(you know what i mean ) is a speed demon
amd traditionally has a horrible front end that chokes down the entire card and even in 2018 we saw things like the new final fantasy that the setting for tesselation was locked on x64 for amd users for some reason and they claim "it was a mistake"
amd pretty much doesnt have ANYTHING till the new uarch comes in play (unless they are sandbagging with their ray tracing perfomance)
 
Status
Not open for further replies.
Back
Top