AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
Are render targets these days typically 16, 32 or 64 bits per pixel? Maybe AMD is planning to “disrupt 4K gaming” by keeping primary buffers on chip and saving a whole load of ROP bandwidth.
 
not "RBE".

That's 16 bytes per pixel at 3840x2160. That's so much you will be able to see it from space. Godlike. If I was NVidia I would be scared.

Chip yield should be helped massively, too.

If this is true, then yes, this is a whole new era of graphics performance. This is XB360 on alien technology. This finally answers for me doubts about the consoles targetting 120fps for 4K TVs.

No need for a 384-/512-bit or HBM memory system.
?
Sorry I'm not following along.
How does screen tiled colour depth units (dramatically) lower the need for bandwidth (over the traditional RBE)
 
Last edited:
Are render targets these days typically 16, 32 or 64 bits per pixel? Maybe AMD is planning to “disrupt 4K gaming” by keeping primary buffers on chip and saving a whole load of ROP bandwidth.
a lot of lower precision ones with a couple higher precision ones. Some titles requiring up to 50+ various targets per frame.
 
How many of those are concurrently being written?
Hmm. I don’t know if it matters. I think what matters is how much fill-rate you require to complete the full frame. I don’t think games will usually max out full rate for the whole second.
 
Latest RedGamingTech video is interesting, because it's based on one his best source (sorry can't share from this device). But I wonder how a big cache will impact temps and powerdraw...

Edit :
nice! Since I am not in a hurry to decide my future graphics card since the 1080 is performing okay as of currently, this info has been the most valuable for now regarding AMD.

I am still missing a DLSS like solution, but something like the per watt performance seems to be improved a lot and given the fact that I want to play all games at 165fps 1440p native, and I have a limited power budget -550W PSU, yeah I thought about it before buying it and knew what I wanted-, things are getting more interesting by the day.
 
nice! Since I am not in a hurry to decide my future graphics card since the 1080 is performing okay as of currently, this info has been the most valuable for now regarding AMD.

I am still missing a DLSS like solution, but something like the per watt performance seems to be improved a lot and given the fact that I want to play all games at 165fps 1440p native, and I have a limited power budget -550W PSU, yeah I thought about it before buying it and knew what I wanted-, things are getting more interesting by the day.

Yeah, not in a hurry either, my watercooled vega fe is doing the job fine for now. I don't play fast fps, so being 40+ fps, high/max detail, for my rpg / adventure games, is ok to me, I've a freesync monitor, it's all right (...and I've bought a new car, and planning a Pixel 5 and a PS5, so I don't have money anymore :eek:).

I just want competition :)
 
You know the rumor does seems weirder and weirder the more you think about it. The obvious thing staring one in the face is, if 128mb of cache is so magical, why doesn't the PS5 or XSX use it, why did the latter go for a giant 320bit bus if it was so useful? And of course the XSX shouldn't need a 320bit bus if somehow a 20+ teraflop RDNA2 chip only needed 256.

I mean, what would you even do with 128mb, fit the world's thinnest 4k g-buffer?



What are you talking about, there isn't a single overall performance number for this in the whole paper.

Not that was even the point, just showing that other researchers can produce potentially realtime ai upscaling and it'd be a smart thing for AMD to do. Just going through their numbers it looks like it should be possible in under 3ms on their "unnamed high end GPU" as compared to UNET, other than a few hiccups they had ideas for solving but never got to.
Hasnt XSX APU 76 MB of ESRAM nobody has cleared where comes from?.
 
Vega has a lot of ram/cache too, it was on a slide,something like 45mo, but it was the addition of all the memory of the gpu. Maybe rdna2 has more, hence the 128mo number, but not as a singular l2 cache ?
 
If the BVH traversal logic is indeed handled by the CUs.. it's not a given the texture units fetch BVH and triangle data. Perhaps the CUs do it and send everything to the texture units to accelerate intersections.
Unknown at this time, but AMD's hybrid RT patent has the SIMD pass a BVH pointer and ray origin and direction data to the hardware in the texture block. The RT hardware returns a payload containing intersection test results and pointers to the next set of BVH nodes.
Whatever node payload and filtering of the BVH nodes being requested in the current iteration seems to be loaded and calculated by the RT block.

128MB huh? Perhaps this is why on this diagram it says "Color/Depth":
Microsoft has used its own particular wording for hardware blocks before, going by its naming convention for compute units in the current gen. The coloration of the color/depth blocks is also green versus the blue cache blocks in the diagram. Seems like an omission to not note a 128MB collection next to the 4MB L2, for example. As great as it might be to have a massive frame buffer on-die, it seems like something could be done to make more use of it than closing the ROP memory loop on-die. For example, the geometry engine and binning rasterizers would likely have a decent idea of how many screen tiles may be reasonably needed in a given time window, and that could leave much of that cache available for something other than ROP exports.

The "infinity cache" name may also hint at an arrangement where it's in or near the data fabric, outside of the L2 and outside the scope of the diagram.
However, the probable die area cost might have discouraged the cost-sensitive vendors from adopting something like this.

While the large cache concept has some difficulties that haven't been resolved, let's stipulate that it exists for the sake of continuing the argument for some kind of hybrid memory controller setup or chiplets. One possible way to have the necessary area is to have a cache layer inside of an active interposer containing the memory controllers and a fabric network.
That wouldn't crowd the logic above, and it might be more amenable to the thermal conditions below an active die and the disruption due to TSVs.

Such a GPU would also not be constrained to a single memory controller type, since it would be hosted by the interposer. Software may reference both modes of operation if an interposer can support both, or if the same GPU can be mounted on different interposers. An APU with integrated memory controllers like the consoles wouldn't have a memory controller die with its cache, or possibly the cost of such a thing is why the consoles remain fully integrated.
A high-end GPU might have the margin and lower volumes that don't exceed the capacity of the integration process for mass production.


You know the rumor does seems weirder and weirder the more you think about it. The obvious thing staring one in the face is, if 128mb of cache is so magical, why doesn't the PS5 or XSX use it, why did the latter go for a giant 320bit bus if it was so useful? And of course the XSX shouldn't need a 320bit bus if somehow a 20+ teraflop RDNA2 chip only needed 256.
The context seems to be that the cache is meant to make up for not having an extremely wide GDDR6 bus or HBM. A more modest GPU might be low enough to be satisfied with regular GDDR6 bus without incurring a significant die cost that a console may not be able to justify.
If there were some kind of advanced integration necessary, that might lead to such a thing being ruled out.

There's no way DRAM latency is lower than a cache hit since you need to check the cache before going to DRAM.
The comparison I was thinking of was between a more conventional hierarchy and one with a 128MB additive cache layer.
If the RT workload is random enough and more sensitive to round-trip latency, the RT blocks could see limited upside.
GCN's L1 hit latency is ~114 cycles, L2 is ~190, and an L2 miss ~350, per a GDC2018 presentation on graphics optimization.
RDNA has another layer of cache, although the exact latencies aren't clear. A ~10% improvement in overall latency was mooted, but attributed to the overall increase in capacity versus the layers being sped up.
If an L2 hit is on the order of 200 cycles and it's 4MB, a cache 32x larger and outside of the L2 could add a significant amount of additive latency such that the average latency is worse than if it weren't there for pointer-chasing RT blocks.

For AMD, it's not an uncommon case that on-die traffic is as bad or worse than missing to DRAM.
Naughty Dog listed a worst-case scenario for the PS4 CPU being a remote hit to a Jaguar L1, and Zen 1 had noticeable inter-CCX latency on the order of missing to DRAM.
 
Status
Not open for further replies.
Back
Top