AMD: Navi Speculation, Rumours and Discussion [2019-2020]

xEx · Jun 10, 2019

It is, they are obviously trying to fool people into thinking the hardware of the Xbox and what enables its RT is Nvidia. AMD should response.

DavidGraham · Jun 10, 2019

xEx said:
Btw is this even legal?

#RTXOn is simply a context to win prizes by leaving comments on social media about which game the user wants to see RTX integrated into it. It has nothing to do with Xbox or PlayStation or whatever. It's a running theme through all E3 content regardless of vendors.

snarfbot · Jun 10, 2019

xEx said:
You are forgetting that the infamouos R9 3xxx series and Fury was her fault.And assuming the Vega situation was inevitable yeah after that AMD(appears to) steered in the right direction.

Xbox is launch at the end of 2020 so it it more probably the next navi launched in 2020 to be the one with RT.

I dunno look at Pascal ray tracing, it doesnt have any hardware at all and is running it acceptably imo. At least for the methods implemented in actual games, like tomb raider or battlefield.

Also from what I've gathered its async compute performance is what holds it back in demanding scenes.

AMD async compute is much better so should perform better right?

I think it would be a mistake not to support it in some form even software would be sufficient. Sure they would get destroyed in benchmarks but they're gettin beaten pretty bad anyway so why not.

xEx · Jun 10, 2019

0.25 a first view.

3dilettante · Jun 10, 2019

snarfbot said:
Also from what I've gathered its async compute performance is what holds it back in demanding scenes.

AMD async compute is much better so should perform better right?

Nvidia's way of handling compute and graphics in parallel had issues in past generations. Nvidia's expectations seemed to have the two types more separate, or there was a conflict in how it tracked the two that forced broader stalls and context switches at first, and also less flexibility in allocating SMs to one type or the other. That improved with each generation, such that by Pascal I'd say feature-wise it was in the ballpark of at of some GCN implementations, though some of the later GCN additions might not have directly corresponding features.
The differences are less stark with modern hardware, and the benefits of GCN's implementation were sometimes debatable because its graphics context processing could bottleneck frame times to the point where an Nvidia GPU could get close or do better running the compute synchronously.

There are some possible caveats with ray-tracing, at least going by initial Turing implementations. Ray-tracing on an SM seemed to place a barrier that restricted it from also launching compute, though it's not clear if that was a long-term architectural or implementation limit or a case of teething issues.

xEx said:
0.25 a first view.

While I won't try to interpret the nature of the various rectangles too much, at a glance it seems like this GPU does move more elements into the center of the die, with the shader core section ringing what might be the command processors and part of the cache hierarchy.
While not a perfect match to other layouts, this may mean the L2 or whatever the upper level cache is came in from the sides of the die. That and the way the compute section appears have its blocks to flip orientation twice as much as prior GCN arrays, and without the same symmetry of clusters across all of the new mid-lines is another difference. (Not quite what Nvidia does, but it's closer than before.)
How the hardware for compute and graphics is distributed, or what all of the area in the middle strip and around the center portends should be interesting to hear more on.

del42sa · Jun 10, 2019

https://pbs.twimg.com/media/D8tQoMzWsAEAtdM.jpg[/IMG]

https://videocardz.com/80993/amd-radeon-rx-5700-xt-official-gaming-benchmarks-leaked

3dilettante · Jun 10, 2019

If the video is supposed to be of a 40 CU GPU, the arrangement has 20 blocks in the shader array section. Not sure which blocks to interpret as the front end or the L/S and cache blocks. It's possible that since the LLVM changes mention a mode where a workgroup can exist in two CUs that a pair of CUs sharing a front end and possibly other hardware is what was called a workgroup processor.

yuri · Jun 10, 2019

del42sa said:
https://videocardz.com/80993/amd-radeon-rx-5700-xt-official-gaming-benchmarks-leaked

Those 2x8 pins are confirmed now...

Arnold Beckenbauer · Jun 10, 2019

More pins - more power!

Pressure · Jun 10, 2019

So they keep the tradition alive by clocking it out of the sweet spot. VEGA 10 was pretty good at only 1.0V instead of 1.2V.

anexanhume · Jun 10, 2019

del42sa said:
https://pbs.twimg.com/media/D8tQoMzWsAEAtdM.jpg[/IMG]

https://videocardz.com/80993/amd-radeon-rx-5700-xt-official-gaming-benchmarks-leaked

Rough math says they're now 80-90% of Nvidia on FLOPs translating to game performance comparatively. Much better than Vega.

edit: This seems new.

https://videocardz.com/81012/amd-radeon-rx-5700-xt-and-radeon-rx-5700-final-specs

Radeon Multimedia Engine – Seamless Streaming

Improved Encoding (New HDR/WCG Encode HEVC)

8K Encode (HEVC & VP)

40% encoder speedups

Navi Stats

40 RDNA Compute Units

80 Scalar Proessors

2560 Stream Processors

160 64b bilinear filter units

Multilevel Cache

4MB L2, 512Kb L1

2x V$L0 Load Bandwidth

DCC Everywhere

Streamlined Graphics Engine

Geometry Engine (4 Prisms Shader Out, 8 Prim Shader In)

64 Pixel Units

4 Asynchronous Compute Enginers

Balanced Work Distirbution & Redistribution

Designed for higiher frequences at lower power

New Compute Unit Design
Great Compute Efficiency For Diverse Workloads

2x Instruction Rate (enabed by 2x Scalar Units and 2x Schedulers)

Single Cycle Issue (enabled by Executing Wwave32 on SIMD32)

Dual Mode Execution (Wave 32 and Wave 64 Modes Adapt for Workloads)

Resource Pooling (2 CUs Coordiate as a Work Group Processor)

Jawed · Jun 10, 2019

So this is what I think we might have:

In summary, I think the RF and LDS are shared by two compute units, each with 2x SIMD-32, SALU and TMU.

Each "quarter" is two shader engines, with 2 sets of 2xROP-4s, 64 ROPs in total.

Triskaine · Jun 10, 2019

Triskaine said:
AMD's next "Mid-Range" GPU is a 300 Watt Housefire with a blower fan, the memes write themselves at this point.

Okay, time for a slight revision: AMD's next Mid-Range GPU is a 225 Watt Mini-Housefire with a blower fan. Going by AMD's SOP it's also overvolted as hell to keep the yield up. In terms of process normalized performance per-watt it's still behind Turing by ~50%. With the HBM joker AMD can get to something High-End'ish next year, maybe around the 2080 Ti, but overall nothing that in any way threatens nVidia's dominant position. I'll probably buy one anyway, because I'm a sucker for housefire silicon.

Jawed · Jun 10, 2019

https://videocardz.com/81012/amd-radeon-rx-5700-xt-and-radeon-rx-5700-final-specs
https://videocardz.com/81012/amd-radeon-rx-5700-xt-and-radeon-rx-5700-final-specs
with typos fixed:

40 RDNA Compute Units

80 Scalar Processors

2560 Stream Processors

160 64b bilinear filter units

Multilevel Cache

4MB L2, 512Kb L1

2x V$L0 Load Bandwidth

DCC Everywhere

Streamlined Graphics Engine

Geometry Engine (4 Prims Shader Out, 8 Prims Shader In)

64 Pixel Units

4 Asynchronous Compute Engines

Balanced Work Distribution & Redistribution

Designed for higher frequencies at lower power

New Compute Unit Design
Great Compute Efficiency For Diverse Workloads

2x Instruction Rate (enabled by 2x Scalar Units and 2x Schedulers)

Single Cycle Issue (enabled by Executing Wwave32 on SIMD32)

Dual Mode Execution (Wave 32 and Wave 64 Modes Adapt for Workloads)

Resource Pooling (2 CUs Coordinate as a Work Group Processor)

This refers to SIMD32s

The scalar ALUs have been really beefed up.

Per Lindstrom · Jun 10, 2019

Jawed said:
So this is what I think we might have:

In summary, I think the RF and LDS are shared by two compute units, each with 2x SIMD-32, SALU and TMU.

Each "quarter" is two shader engines, with 2 sets of 2xROP-4s, 64 ROPs in total.

Intresting! what about the central area?

mczak · Jun 10, 2019

Does that mean wavefront size can be either 32 or 64 or do I misunderstand this...
Oh, and apparently full speed fp16 texture filtering, catching up with nvidia (since fermi IIRC) there...

Per Lindstrom · Jun 10, 2019

Jawed said:
So this is what I think we might have:

In summary, I think the RF and LDS are shared by two compute units, each with 2x SIMD-32, SALU and TMU.

Each "quarter" is two shader engines, with 2 sets of 2xROP-4s, 64 ROPs in total.

I only see 8 * 4 ROPS, what do I missing?

anexanhume · Jun 10, 2019

mczak said:
Does that mean wavefront size can be either 32 or 64 or do I misunderstand this...
Oh, and apparently full speed fp16 texture filtering, catching up with nvidia (since fermi IIRC) there...

I think so. AMD had a patent on variable wavefront sizing.

Jawed · Jun 10, 2019

Per Lindstrom said:
Intresting! what about the central area?

Stuff! Too coarse-grained to say much about those blocks.

mczak said:
Does that mean wavefront size can be either 32 or 64 or do I misunderstand this...

Fascinating that work can be issued in 32-work item hardware threads. I was expecting 64 and 128...

Per Lindstrom said:
I only see 8 * 4 ROPS, what do I missing?

I've added extra yellow blocks to the picture. Hopefully the picture will update soon to show them.

I also added L1, just for the sake of it.

Nemo · Jun 10, 2019

So Navi has 512Kb L1?

Vega10 has 16Kb L1.