AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
Ooh! That does look low!

But the "arms" on the retention bracket are much longer in the "RedGamingTech" picture than in what I suppose is the RX 5700 XT that you're comparing with.

So the arms are making it look like the GPU is lower than it really is. So, not HBM in my opinion.


Based on that picture, the GPU for the 6900xt is mounted 4mm lower than the 5700xt

The retention bracket is about 20% bigger. The diametrically opposite screw mountings are 90mm apart vs 76mm on the 5700xt

This closely matches the vega7 mounting bracket which was also 90mm iirc. The vega7 had 4 stacks of hbm, and an interposer size of approximately
840mm2.

A 500mm2 die with 2 stacks of hmb2 would be close to the interposer size of a vega7.

Rogame indicates there are 4 variants of big navi, 3 of which should be for consumer cards, XTX, XT and XL. Perhaps
XTX is 80CU hbm, XT 72CU HBM, XL 72CU GDDR?

By having both HBM and GDDR phy/io, could AMD potentially increase their wafer yields?
 
By having both HBM and GDDR phy/io, could AMD potentially increase their wafer yields?

I mean, yes but that would be a but silly, the yields wouldn't be that much higher.

Though I suppose you could support more bins. Some super performance bin could get hbm and they could charge a mint and a half, while the more common bin gets gddr and is meant for the mass market.

Still, such a strategy would make much more sense with chiplets. Then you wouldn't waste die space on doubling up the memory interface, you just pair the right logic chiplets with the right memory controllers, like some sort of computer engineering lego.
 
Just watched Linus’ Tiger Lake “review” and he spent some time on the near dearth of AMD’s highly regarded 7nm mobile CPUs. Given that backdrop I don’t see how AMD can produce Navi2x in sufficient quantities unless they’ve been stockpiling for many months.
 
Just watched Linus’ Tiger Lake “review” and he spent some time on the near dearth of AMD’s highly regarded 7nm mobile CPUs. Given that backdrop I don’t see how AMD can produce Navi2x in sufficient quantities unless they’ve been stockpiling for many months.
There have been reports that Huawei bought a ton of short term TSMC capacity, and availability should thus be better going forward for AMD if that is behind the shortages.
 
Just watched Linus’ Tiger Lake “review” and he spent some time on the near dearth of AMD’s highly regarded 7nm mobile CPUs. Given that backdrop I don’t see how AMD can produce Navi2x in sufficient quantities unless they’ve been stockpiling for many months.

I mean, they might be. The contracts for these things are planned years in advance, I don't think they expected their mobile CPUs to do nearly as well as they did, and of course TSMC doesn't have any extra capacity whatsoever for short notice runs (do they? everything I've seen says they're nigh overbooked)

So it partially depends on how many RDNA2 cards they bet they'd sell years ago. That and the GDDR shortage causing all these "sold out instantly!" problems.
 
Just watched Linus’ Tiger Lake “review” and he spent some time on the near dearth of AMD’s highly regarded 7nm mobile CPUs. Given that backdrop I don’t see how AMD can produce Navi2x in sufficient quantities unless they’ve been stockpiling for many months.
Console SoCs have likely been taking lots of capacity for a while to build the initial stock. I assume they would have gone down by now as launches approach, so that might give space for discrete GPUs. That and also the freed capacity from some other TSMC clients (e.g. Apple moving to 5nm), which AMD is very likely keen on bidding.
 
Just watched Linus’ Tiger Lake “review” and he spent some time on the near dearth of AMD’s highly regarded 7nm mobile CPUs. Given that backdrop I don’t see how AMD can produce Navi2x in sufficient quantities unless they’ve been stockpiling for many months.
This could also be a result of the pandemic and subsequent explosion in tech sales which all major tech companies benefited from. AMDs projections for their mobiles chip may have not been sufficient to cover the additional demand and without an ability to increase production here we are.

I'm hopeful AMD will have more navi2x cards stockpiled than NVIDIA did of ampere
 
Hope you guys are right. We should see supply open up on the CPU front first as TSMC capacity is freed up. I have no delusions about getting a Zen 3 chip in Q4 but it would be nice.
 
New AMD Patent Application to reduce traffic from individual CUs to L2/memory by checking the data from other CUs first, resulting in many CUs configured in a crossbar.
20200293445 ADAPTIVE CACHE RECONFIGURATION VIA CLUSTERING
(One of the inventors mentioned is AMD fellow Gabriel Loh)
View attachment 4637
AFAICT this would provide a psuedo-fast memory pool *if* you can ensure that [dataset size] < [number of CUs on task] x [cache per CU]?

Which seems interesting - as it means that the optimal dataset size would be somewhat proportional to the number of CUs on task - which seems to suggest that different RDNA2 cards may have very different performance in similar scenarios.
 
AFAICT this would provide a psuedo-fast memory pool *if* you can ensure that [dataset size] < [number of CUs on task] x [cache per CU]?

Which seems interesting - as it means that the optimal dataset size would be somewhat proportional to the number of CUs on task - which seems to suggest that different RDNA2 cards may have very different performance in similar scenarios.
Different GPUs have different sizes of L2 cache, already. This is because L2 cache slices are allocated to memory channels. Also, RDNA is designed explicitly with variation in the size of L2 size per memory channel: from 64KB to 512KB per slice.
 
Not at all. But what it does solve, is the latency for a cache line conflict between two CUs, as they can share ownership of the cache line directly rather then sending a request, then waiting for the other CU to write back (to a "dummy communication node" for bouncing), only to reload later on. Think of it as a fast path solution to the false sharing problem.

What the patent describes is how CUs are coupled with full replication of their corresponding L1 contents, keeping the CUs in sync with a latency as low as possible.
If the CUs individual cache miss rates exceed as threshold (indicating that the replication was wasting too much L1 capacity), then the synchronization of the CU cluster is broken up and the contents of the L1 caches are no longer replicated, re-instantiating the explicit cache line ownership transfer.

So it's actually quite the opposite of what you described, the fast path is activated when [dataset size] <= [cache per CU], in that case writes to L1 are immediately broadcasted to all CUs, reducing the latency and bandwidth requirements for coherency to 1 cross-bar transaction, down from 3. Tremendously helpful if you have e.g. all CUs performing a reduction operation using atomics such as building a histogram.

And it's neither substitute or anything else for GL1 either, that one didn't even need coherency protocols to begin with since being read-only.
It doesn't read full replication at all to me — there are repeated descriptions and references to fine-grained address interleaving across participating caches in clusters to dynamically deliver larger effective cache capacity. Paragraph 62, for example, specifically addresses that the number of CU clusters is to be decreased if lines are expericing high level of sharing (i.e., replication level), and having fewer CU clusters mean larger pool of caches for interleaving in the cluster.
 
It doesn't read full replication at all to me — there are repeated descriptions and references to fine-grained address interleaving across participating caches in clusters to dynamically deliver larger effective cache capacity. Paragraph 62, for example, specifically addresses that the number of CU clusters is to be decreased if lines are expericing high level of sharing (i.e., replication level), and having fewer CU clusters mean larger pool of caches for interleaving in the cluster.
Thanks, should have read the full patent text. My bad.

For a given number of CUs (e.g., N number of CUs 112 of FIG. 1), the number of CU clusters 120 determines a maximum number of cache line replicas at the GPU 104. Generally, increasing the number of CU clusters 120 (such as from two CU clusters in the embodiment of FIG. 1 to three or more) results in a smaller effective L1 cache capacity within each CU cluster and an increase in the number of cache line replicas at the GPU 104. Further, increasing the number of CU clusters 120 increases miss rates to the L1 caches (due to the smaller effective L1 cache capacity at each cluster) but decreases access latency to the L1 caches (due to fewer number of L1 caches at each CU cluster to traverse through for searching to locate a requested cache line). Similarly, decreasing the number of CU clusters 120 results in a decrease in the number of cache line replicas at the GPU 104 and a larger effective L1 cache capacity within each CU cluster that decreases miss rates to the L1 cache at the computational expense of longer L1 access latency. By increasing the effective L1 cache capacity, some applications may increase the L1 cache 116 hit rate, and therefore decrease L2 cache 118 pressure. Further, in some embodiments, the processing system 100 balances competing factors of the L1 cache 116 miss rate and L1 cache 116 access latency to fit a target application profile by dynamically changing the number of clusters.
So trading access latency (when clustered and interleaved) vs small L1 (when not clustered / interleaved).

Even though I'm having trouble seeing where this is applicable? It appears to be focused primarily at reducing the load on the (shared) L2 cache?
Is the L2 cache on GPU actually still that much slower than L1, so that even 3 trips over the crossbar can outweigh an L2 hit?

Rather than utilizing ring interconnects for CU-to-CU communication (such as previously discussed with respect to FIGS. 4-5), the GPU 600 of FIG. 6 positions one or more dummy communication nodes (e.g., dummy communication nodes 604, 606) on a side opposite that of the CUs 112 to receive requests from a CU and forward the requests to other CUs 112. As used herein, the term “dummy communication node” refers to a module or other structure that receive a request/reply from a CU, buffers the request/reply, and then forwards the request/reply to a destination CU.
Even if you consider that in the context of the patent even a store & forward approach was suggested for CU to CU communication, so this is potentially quite a few cycles wasted?
Further, GPUs often experience performance limitations due to LLC bandwidth in some workloads. Accordingly, the CU clustering discussed herein reduces pressure on LLC and increases compute performance by improving L1 hit ates.
By the looks of it, yes. (3.90TB/s L1 bandwidth, 1.95TB/s L2 bandwidth on the Radeon RX 5700XT.)

For what's it's worth, the author also often spoke only about small number of CUs in each cluster, and about locality on the chip. So possibly this isn't even aiming at a global crossbar, but actually only at 4-8 CUs maximum in a single cluster?
I suppose 8 CUs with 128kB of L1 each still yield a 1MB memory pool. And you got to keep in mind L0 cache still sits above that, so at least some additional latency isn't going to ruin performance.

An interesting question, can this reliably reduce the bandwidth requirements on the L2 so far, that distribution of L2 cache slices across dies becomes viable yet?
A clear "no" to that. Even when increasing the L1 size 8x this way, L1 cache misses are unavoidable in the bad cases. Maybe a 30-50% reduction of L2 hits on average, but not even proportional to the number of CUs participating in each cluster. And nothing changed on write back bandwidth. Rough estimates, not properly founded. Still not even remotely enough to allow splitting the LLC.

Should still amount to a reasonable performance uplift in bad cases which had been suffering from excessive L1 miss, but good L2 hit ratio before. I could see this pushing the viable size limit for lookup tables by a fair amount.
 
Even though I'm having trouble seeing where this is applicable? It appears to be focused primarily at reducing the load on the (shared) L2 cache?
Is the L2 cache on GPU actually still that much slower than L1, so that even 3 trips over the crossbar can outweigh an L2 hit?

Even if you consider that in the context of the patent even a store & forward approach was suggested for CU to CU communication, so this is potentially quite a few cycles wasted?

By the looks of it, yes. (3.90TB/s L1 bandwidth, 1.95TB/s L2 bandwidth on the Radeon RX 5700XT.)

For what's it's worth, the author also often spoke only about small number of CUs in each cluster, and about locality on the chip. So possibly this isn't even aiming at a global crossbar, but actually only at 4-8 CUs maximum in a single cluster?
I suppose 8 CUs with 128kB of L1 each still yield a 1MB memory pool. And you got to keep in mind L0 cache still sits above that, so at least some additional latency isn't going to ruin performance.
L1 is shared by all CUs in a shader array, 10 in RX 5700 XT for example.

An interesting question, can this reliably reduce the bandwidth requirements on the L2 so far, that distribution of L2 cache slices across dies becomes viable yet?
A clear "no" to that.
ROPs are clients of L1 in Navi. ROP write operations are L1-write-through though, i.e. L1 doesn't support writes per se, so L2 is updated directly by ROP writes.

What proportion of frame-time is taken up with ROP writes? What proportion of VRAM bandwidth is taken up by ROP writes?

Even when increasing the L1 size 8x this way, L1 cache misses are unavoidable in the bad cases. Maybe a 30-50% reduction of L2 hits on average, but not even proportional to the number of CUs participating in each cluster. And nothing changed on write back bandwidth. Rough estimates, not properly founded. Still not even remotely enough to allow splitting the LLC.
I disagree. L1s need to be able to talk to all L2 slices. That isn't changed by a "chiplet" design where LLC is spread amongst chiplets. 2-4TB/s bandwidth amongst chiplets over an interposer seems pretty easy.

Should still amount to a reasonable performance uplift in bad cases which had been suffering from excessive L1 miss, but good L2 hit ratio before. I could see this pushing the viable size limit for lookup tables by a fair amount.
The rumours are a shitshow right now, but there has been a rumour of a dramatic increase in L1 size. Nothing to do with the "128MB Infinity Cache" rumour, but of course it could be a component of an "Infinity Cache" system.
 
Status
Not open for further replies.
Back
Top