AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
another reason amd should have planned for an earlier announcement. Now its just almost 2 months of rumors and the majority seem to be negative.

Not really. What AMD should do is follow their own release schedule and completely ignore all the noise. There's no need for subtle marketing attempts with thinking faces and alike. Let the product do the talking. If it's great it makes the doubters look like complete idiots. If it's not great you save your face by not hyping it up with poor marketing attempts.
 
The comparison I was thinking of was between a more conventional hierarchy and one with a 128MB additive cache layer.
If the RT workload is random enough and more sensitive to round-trip latency, the RT blocks could see limited upside.
GCN's L1 hit latency is ~114 cycles, L2 is ~190, and an L2 miss ~350, per a GDC2018 presentation on graphics optimization.
RDNA has another layer of cache, although the exact latencies aren't clear. A ~10% improvement in overall latency was mooted, but attributed to the overall increase in capacity versus the layers being sped up.
If an L2 hit is on the order of 200 cycles and it's 4MB, a cache 32x larger and outside of the L2 could add a significant amount of additive latency such that the average latency is worse than if it weren't there for pointer-chasing RT blocks.

For AMD, it's not an uncommon case that on-die traffic is as bad or worse than missing to DRAM.
Naughty Dog listed a worst-case scenario for the PS4 CPU being a remote hit to a Jaguar L1, and Zen 1 had noticeable inter-CCX latency on the order of missing to DRAM.

And if it doesn't have a latency advantage I don't see what it could be for. You're still going to get slowed down by a narrow bus anytime you read from gddr with this scheme. Yes, IBM claims their ed-ram cache hits 3tbps, very fast. But without a 512mb cache or so you'd still need to write and read with gddr a lot for the current frame's buffers, and the cache would take up die space enough to just widen the bus instead. Unless it is some chiplet scheme, but I'd figure the first use of such would be to separate I/O from logic like Intel did with Lakefield.

Overall giant ed-ram cache isn't a new idea but, afaik, IBM is the only one continuing to use it to much extent.
 
Not really. What AMD should do is follow their own release schedule and completely ignore all the noise. There's no need for subtle marketing attempts with thinking faces and alike. Let the product do the talking. If it's great it makes the doubters look like complete idiots. If it's not great you save your face by not hyping it up with poor marketing attempts.
the absence of official information is a problem. I don't know why AMD waits so long to release all their products. Oct 8th announcement barely puts them in time for holiday sales and unless thye have extremely good avalibilty it will make a lot of people switch companies
 
the absence of official information is a problem. I don't know why AMD waits so long to release all their products. Oct 8th announcement barely puts them in time for holiday sales and unless thye have extremely good avalibilty it will make a lot of people switch companies
They're not "waiting so long to release their products", they're releasing them as soon as they're ready to be released. Can you imagine the shitstorm if they did proper paper release at this point?
 
They're not "waiting so long to release their products", they're releasing them as soon as they're ready to be released. Can you imagine the shitstorm if they did proper paper release at this point?
wait so you expect it not to be a paper release in Oct ? I have a feeling Oct 8th is announcement with avalibilty late Oct or even earl Nov
 
128MB of L3 cache with "only" a 256 bit GDDR6 bus is plausible. It is hard to say how such an architecture would perform without doing a bunch of simulations of different workloads. The cache replacement policy would be critical to performance since you don't want to cache a bunch of data that is not going to be reused. It would to a large degree be an engineering trade-off between die area and bus width (and the costs associated with each).
 
Microsoft has used its own particular wording for hardware blocks before, going by its naming convention for compute units in the current gen. The coloration of the color/depth blocks is also green versus the blue cache blocks in the diagram. Seems like an omission to not note a 128MB collection next to the 4MB L2, for example.
As I noted more recently, the labelled XSX die shot has nothing like a massive quantity of RAM. So my comments about these "color/depth" blocks are irrelevant.

As great as it might be to have a massive frame buffer on-die, it seems like something could be done to make more use of it than closing the ROP memory loop on-die.
Xenos benefitted hugely from the daughter die's performance.

Xenos's daughter die didn't have the capacity for all render target formats that could be bound by the GPU, so there were performance uncertainties and cliffs (full speed 4xMSAA was limited to 720p, I think - fuzzy memory alert).

"128MB" is gargantuan in comparison, 16 bytes per pixel at 4K, but I guess there will be cases where delta colour compression or MSAA sample compression fail to satisfy the 16 bytes limit.

For example, the geometry engine and binning rasterizers would likely have a decent idea of how many screen tiles may be reasonably needed in a given time window, and that could leave much of that cache available for something other than ROP exports.
NVidia's tiled rendering (with tile sizes that vary depending upon the count of vertex parameters, pixel format, etc.) is some kind of cache scheme that only seems to "touch" relatively few tiles at any given time.

By comparison the rumoured "128MB Infinity Cache" seems like dumb brute force if it were dedicated solely to ROPs.

I'll be honest I think "128MB Infinity Cache" is a hoax or at the very least a grave misunderstanding.

One possible way to have the necessary area is to have a cache layer inside of an active interposer containing the memory controllers and a fabric network.
That wouldn't crowd the logic above, and it might be more amenable to the thermal conditions below an active die and the disruption due to TSVs.
I like this. I have a fuzzy memory of a previous discussion about an active interposer used this way.

Such a GPU would also not be constrained to a single memory controller type[...]
I'm still haunted by "Nexgen Memory":

bLXCAt7.jpg


The context seems to be that the cache is meant to make up for not having an extremely wide GDDR6 bus or HBM. A more modest GPU might be low enough to be satisfied with regular GDDR6 bus without incurring a significant die cost that a console may not be able to justify.
If there were some kind of advanced integration necessary, that might lead to such a thing being ruled out.
Frankly I don't believe there is such a large "cache". The frame rate targets for XSX/XSS without an obvious hunk of memory in the die shot (frame rates that we should expect from 6800XT/6700XT, it seems) tells me that "Infinity Cache" in a giant amount is not part of Navi 2x architecture.

I can believe "Infinity Cache" is a property of the architecture, but this magic number of 128MB has been conjured out of thin air by the leakerverse. I can believe that every type of memory on Navi 21 (registers, caches, buffers) adds up to a total of 128MB, but not that there is a cache of that size.
 
Xenos benefitted hugely from the daughter die's performance.
IMO immediate mode GPUs these days are too programmable to pull off the same feast again. Schemes like “tile based rendering” and “DSBR” are basically trying to improve spatial locality of caches within the current programming model. They don’t change the intrinsic nature of immediate mode not capable of guaranteeing that the GPU (or a specific screen-space partition of it) will only touch one rasteriser bin/tile at a time. So the worst case scenario remains chaotic/pathological API submissions that lead to little to no binning in practice, where the partition always have to readily deal with any amount of tiles simultaneously. In other words, we have no bound/fixed memory use, should on-chip tile memory be introduced to an immediate mode GPU. That’s unlike TBDR sorting & binning all primitives before rasterisation and fragment shading.

Larger caches (not on-chip tile memory) could in theory help, but it would never be as effective as TBDR intrinsically. All immediate mode GPU vendors seemingly have collectively decided so far that cost outweighed then benefit, or otherwise they have always been in a position to stack up the caches in this age of dark silicon (see GA100’s 40MB L2).
 
Last edited:
And if it doesn't have a latency advantage I don't see what it could be for. You're still going to get slowed down by a narrow bus anytime you read from gddr with this scheme.
The cited benefit was compensating for the lack of a major update to the external memory bus, so bandwidth amplification seems to be the primary motivation.
However, leaving things as-is and introducing another cache layer leaves a question of how much bandwidth this is supposed to amplify, and if it becomes large enough to reduce the relative effectiveness of the L2's amplification. The L2's parallelism and bandwidth is something that has had limited scaling from GCN generations through Navi. If the aggregate cache and memory bandwidth supplied into the rises, a cache pipeline that isn't rebalanced could find an L2 with marginally more bandwidth than prior generations and potentially no additional capability to avoid bank conflicts becoming a bottleneck.

Latency could become a problem, unless workloads that are known to be sensitive can bypass multiple layers of cache. GCN and RDNA have the ability to control this, although it's not clear that there are latency benefits since it seems like it's more about cache invalidation and miss control rather than avoiding long-latency paths in the pipeline. Additionally, having yet another cache leaves the question if it's being used differently or transparently to the GPU, because yet another layer of cache control seems like it's pushing things. The current level of exposure of architectural details and the hand-holding the ISA has for the cache hierarchy has already increased with RDNA1.


Overall giant ed-ram cache isn't a new idea but, afaik, IBM is the only one continuing to use it to much extent.
eDRAM has specific process needs. IBM's use extends to Power9, which is fabricated on the 14nm SOI process IBM sold to GF. IBM's Power10 is to be on Samsung's 7nm node and reverts to SRAM.
Per the following, it seems like space savings weren't particularly good or potentially negative at the speed and bandwidth level required for an L3 cache, but eDRAM did save on static leakage versus SRAM.
https://www.itjungle.com/2020/08/24/drilling-down-into-the-power10-chip-architecture/
This actually brings up a possible pain point for an AMD large-cache GPU if it's using SRAM in such quantity. Perhaps it's compensated for by reducing the need for a high-speed bus, but if this is supposed to scale to mobile levels 128MB of powered SRAM may need some attention paid to standby power.

IMO immediate mode GPUs these days are too programmable to pull off the same feast again. Schemes like “tile based rendering” and “DSBR” are basically trying to improve spatial locality of caches within the current programming model. They don’t change the intrinsic nature of immediate mode not capable of guaranteeing that the GPU (or a specific screen-space partition of it) will only touch one rasteriser bin/tile at a time. So the worst case scenario remains chaotic/pathological API submissions that lead to little to no binning in practice, where the partition always have to readily deal with any amount of tiles simultaneously.
There may be a limit of sorts to the pathological case, in that we don't know how many batches the DSBR can close and have in-flight for any given screen tile. The capacity for tracking API order for primitives in batches exists, or at least we know the opposite where the GPU can be told to ignore API order for cases like depth/coverage passes exists.
If it's known the hardware cannot generate an unbounded number of batches in-flight, the depths of its queues or backlog could provide decision data for the lifetime of a given screen tile in local storage.
 
Xenos's daughter die didn't have the capacity for all render target formats that could be bound by the GPU, so there were performance uncertainties and cliffs (full speed 4xMSAA was limited to 720p, I think - fuzzy memory alert).

"128MB" is gargantuan in comparison, 16 bytes per pixel at 4K, but I guess there will be cases where delta colour compression or MSAA sample compression fail to satisfy the 16 bytes limit.


NVidia's tiled rendering (with tile sizes that vary depending upon the count of vertex parameters, pixel format, etc.) is some kind of cache scheme that only seems to "touch" relatively few tiles at any given time.

By comparison the rumoured "128MB Infinity Cache" seems like dumb brute force if it were dedicated solely to ROPs.

I'll be honest I think "128MB Infinity Cache" is a hoax or at the very least a grave misunderstanding.

Afaik 2:1 compression is still there most common case for DCC, thus 16bytes per pixel only gives you a big g-buffer at 4k, and not anything else, or a non existent one and little else. You can't evict it back to gddr since you need it throughout much of the frame, and then you don't have room for all the rest of the frames data.

My feeling is someone saw that "leaked" pic, that had an asic label for who knows what reason, assumed it was true and came it up with a reason it could be the big card.

Besides, bus die size isn't what seems to be holding gpus back today, at least no more than nominally. Ampere shows how easily you can run out your thermal and power budget. Targeting that would seem a far more obvious target than wanting to maybe save a bit some die by going for a smaller bus.
 
Besides, bus die size isn't what seems to be holding gpus back today, at least no more than nominally. Ampere shows how easily you can run out your thermal and power budget. Targeting that would seem a far more obvious target than wanting to maybe save a bit some die by going for a smaller bus.

Power efficiency is always the main priority but it still needs to be balanced with other parts and the bandwidth requirement is near the top of that list to reach a high performance level. That's why Nvidia partnered with Micron to develop GDDR6x with a new signaling spec, higher speed and more power efficiency.
That's also why we have seen engineers preaching about locality because when you do need to go off-chip it is going to use more power. See video below.
It is also why we have seen more innovation at the marchitecture level along with other ways to feed/maintain performance levels with other means of I/O, like DirectStorage, Infinity Fabric, and NVLink.

I remember an interview with an AMD engineer around Fiji launch, talking about how the bandwidth requirements in the future was going to scale across the entire system, not just on-chip and/or to the memory. It seems like we are reaching that point.
I can picture a snapshot of the interview but I couldn't find it trolling through AMD Fiji youtube videos.

 
Last edited:
RGT is doubling down on 256bit bus (with ddr6). If true it will be interesting to see the bandwitdh efficiency of rdna2.
 
Besides, bus die size isn't what seems to be holding gpus back today, at least no more than nominally. Ampere shows how easily you can run out your thermal and power budget. Targeting that would seem a far more obvious target than wanting to maybe save a bit some die by going for a smaller bus.

Power usage for ampere is probably higher because it can run things more efficiently at the same time. The diagrams show tensor core, rt cores and regular shading/compute running at the same time. The memory bandwidth is too limited though as from those same diagrams it looks like there is barely any left for tensor cores to use. If you don't have the memory bandwidth then hw will idle more and power consumption would go down, though side effect would be slower end result
upload_2020-9-14_11-17-55.png

upload_2020-9-14_11-13-30.png
 
Last edited:
Status
Not open for further replies.
Back
Top