AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
The whole 'cleverness' of Ampere's setup was that since INT32 tends to be used less in games, you get more effective utilization of the available cores by letting those INT32 also do double duty as FP32 when needed.

If Lovelace is simply 'adding more FP32' cores, then that's.....nothing. I mean, not nothing, but there's nothing special or interesting about that. It's an extremely normal way of adding performance.
 
In any case, the point about Ada/Ampere differences is still valid.
How is it valid? Separating INT SIMD from FP32 SIMD while still having one 32T/clock scheduler will result in the exact same math throughputs as those on Ampere because said scheduler will still be able to load only two (out of three) SIMDs with work.
Complicating SM level scheduling h/w to provide a separate launch port for some execution units (INT+SFU?) would help with filling all three of them with work per each cycle but will probably lead to lower overall h/w utilization due to scheduling conflicts (Kepler, anyone?).
Neither of these two options seem like an improvement in comparison to Ampere, especially since moving INT load on a separate h/w is essentially a regression back to Turing which would result in more h/w idling inside the GPU while performance gains are likely to be rather limited.
 
How is it valid? Separating INT SIMD from FP32 SIMD while still having one 32T/clock scheduler will result in the exact same math throughputs as those on Ampere because said scheduler will still be able to load only two (out of three) SIMDs with work.

Exactly. Lots of hype over nothing.
 
Exactly. Lots of hype over nothing.
Well, Kopite was posting about a new SM level scheduler so maybe it is option "b" here - a separate launch port for non-FP32 instructions?
But it's a bit weird to try and improve Ampere FP32 throughputs since this obviously is not its weak point.
 
How is it valid? Separating INT SIMD from FP32 SIMD while still having one 32T/clock scheduler will result in the exact same math throughputs as those on Ampere because said scheduler will still be able to load only two (out of three) SIMDs with work.
Complicating SM level scheduling h/w to provide a separate launch port for some execution units (INT+SFU?) would help with filling all three of them with work per each cycle but will probably lead to lower overall h/w utilization due to scheduling conflicts (Kepler, anyone?).
Neither of these two options seem like an improvement in comparison to Ampere, especially since moving INT load on a separate h/w is essentially a regression back to Turing which would result in more h/w idling inside the GPU while performance gains are likely to be rather limited.

Ask him, not me. I am not the source of the rumor. One advantage would be improving throughput in some gaming workoads, if it's true as I read in the original review of Ampere's architecture on Tom's, that still around 30% of instructions in gaming workloads are INT, even if the usage is decreasing. I think that in any case cache changes in Ada are more than enough to give a good boost. But, as said, these are rumors and nothing more. The initial point was, Kopite figures for shader count for Navi31 are too low, period, as AMD should have known for long what to expect from Ada when factoring in the process change and the obvious boost due to architectural improvements. .
 
While I'm messing about on LLVM.org, GCNSubtarget.h has the following lines that indicate RDNA 3 (or later) stuff:

bool hasFlatScratchSVSMode() const { return GFX940Insts || GFX11Insts; }
bool hasPermLane64() const { return getGeneration() >= GFX11; }
bool hasVOP3DPP() const { return getGeneration() >= GFX11; }
bool hasLdsDirect() const { return getGeneration() >= GFX11; }
bool hasVALUPartialForwardingHazard() const {return getGeneration() >= GFX11; }
bool hasVALUTransUseHazard() const { return getGeneration() >= GFX11; }
bool hasSPackHL() const { return GFX11Insts; }
bool hasCompressedExport() const { return !GFX11Insts; }
bool hasNullExportTarget() const { return !GFX11Insts; }
bool hasDelayAlu() const { return GFX11Insts; }
bool hasLegacyGeometry() const { return getGeneration() < GFX11; }
bool shouldClusterStores() const { return getGeneration() >= GFX11; }

I want to highlight the "hazard" functions, hasVALUPartialForwardingHazard and hasVALUTransUseHazard. So there's some scheduling mischief there. I can't discern much else, though.

hasDelayAlu seems ominous. Again, more scheduling mischief.

Other stuff there, I can't comment on. We know about VOP3D already, but the rest is mysterious.
 
I’ve seen the “true double FP32” claim in multiple forums, Reddit, Twitter with no explanation of how it would be fed alongside INT.

Well it was not my intention to hype anything, it was only a mere comment on Ada having an improved architecture with increased performance, which is clearly expected by the competition.
 
Well it was not my intention to hype anything, it was only a mere comment on Ada having an improved architecture with increased performance, which is clearly expected by the competition.

No worries. I didn’t mean to imply that you were doing the hyping. If Ada splits the INT and FP32 pipelines it could be motivated by lower FP32 latency. FP32 instructions would retire in one cycle instead of 2 (like Pascal). That would still mean bubbles in the FP32 pipeline when issuing INT though.
 
FeatureTrue16BitInsts - my theory is that this is the basis of VOPD. This also makes SDWA redundant as a concept.
Instead of being related to VOPD, I would rather argue that the removal of SDWA is more an inevitable cleanup, now that FP16 packed math has been around for multiple generations.

It was born to enable packed F16 in the registers without packed math/ALU in GCN3/4, so you get at least the benefit to reduce register pressure. Now that packed math (F16x2 ALU) has been a thing for 3 generations since Vega, it does not seem SDWA support has any significant value aside from backward compatibility. F16 SIMD types in kernels/shaders would be better mapped to packed math instructions for the "double F32" rate.
 
Last edited:
Less CUs than the 256 CU rumor, but I'd think in theory 196 CUs on a single GCD is better than spread over 2x GCD.

Still MCM, just not the way people have been hoping with 2 GCDs. It means 2.5x 6900XT claims are no longer valid with just one compute chiplet.

196 CUs would still mean 2.45x more CUs than Navi 21. Granted that doesn't mean it translates to 2.5x gaming performance, but at the same clocks it is roughly 2.5x "compute" (well theoretical FP32 anyways).
 
Standing firm in the camp 2xGCDs. I stan an AMD patent over the Twitter rumor mill. :mrgreen:
But TBF, one big monolithic GCD surrounded by satellite MCDs (that have memory-side LLC and DRAM I/O) with Elevated Fanout Bridge sounds a plausible option as well.
 
196 CUs would still mean 2.45x more CUs than Navi 21. Granted that doesn't mean it translates to 2.5x gaming performance, but at the same clocks it is roughly 2.5x "compute" (well theoretical FP32 anyways).

You mean 192 CUs right? Yeah that would be better than 2 GCDs with 96 CUs each and should be a reasonable size on 5nm especially with memory and cache broken off into separate chiplets. It would be a much lower risk option for the first chiplet arch.
 
Might as well speculate this for Navi 31:
  • 3x GCDs of 16x WGPs and 4096 VALU lanes, each about 100mm² +
  • 3x MCDs of 128-bit GDDR and 128MB infinity cache, each around 150mm² +
  • I/O die of 150mm²
  • 24GB GDDR
That's 900mm² for 7900XT.

Then this for Navi 32, re-using Navi 31 chiplets:
  • 2x GCDs of 16x WGPs and 4096 VALU lanes, each about 100mm² +
  • 2x MCDs of 128-bit GDDR and 128MB infinity cache, each around 150mm² +
  • I/O die of 150mm²
  • 16GB GDDR
That's 650mm² for 7800XT.

Then Navi 33, 7700XT is approaching 400mm² as a monolithic GPU with 16x WGPs, 128MB infinity cache, 128-bit 8GB GDDR.

With the same chiplets used by both Navi 31 and 32, any broken WGPs would end up in the salvage SKUs:
  • 7900 with 44 WGPs = 2x full GCDs + 1x GCD with 4x broken WGPs
  • 7800 with 24 WGPs = 2x GCDs each with 4x broken WGPs
In the end, if AMD has learnt from Ryzen then small compute chiplets with lots of SKUs derived from partly-faulty premium chiplets seems likely. Zen 3 chiplets are 81mm², Zen 2 chiplets are 74mm², both on 7nm, relatively expensive in 2019 when Zen 2 launched.

You have to argue against this kind of SKU lineup built mostly from ~100mm² GCDs:
  • 7900XT = 48 WGPs
  • 7900 = 44 WGPs
  • 7800XT = 32 WGPs
  • 7800 = 24 WGPs
  • 7700XT = 16 WGPs
  • 7700 = ...
which maximises the return on 5nm wafers, when dismissing "multi-GCD" rumours.
 
Status
Not open for further replies.
Back
Top