AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
6 months typically from tape out to release?
Usually tapeout is when the design is completed and you upload the files to the foundry FTP/portal/cloud.
Then the foundry generates the photomasks and tools to build the silicon. For a 7 die chiplet, with extra time needed for interconnect and packaging, I expect 10 to 12 weeks before the customer gets the silicon back from the package factory (Amkor, ASE, etc)
Then qualification will take at least another 6 months.
I think one year from tape out to HVM is realistic if taking into account one minor revision (metal mask)

Edit: Lately, Nvidia has been extremely efficient and fast from tape out to HVM. Most of their first silicon revision (A0) went into HVM
 
Last edited:
Yes and more things that they don't disclosure in public. The obvious one is the usage of their Selene supercomputer for intensive algorithm/RTL/floor plan verification. Nvidia spend much more time than before in simulation and their silicon verification lab is now a huge department in order to shorten tape out to HVM time. For small dies like GA106-107, tape out to HVM was less than 5 months and it should be even less for Ada
 
They talk like AMD (or Nvidia for the matter) can change the hardware after tape out. These leakers should learn a bit how an ASIC is designed and when an uarch is frozen. It will help them to improve their lies :rolleyes:
That's just daft assessment. Most of these leaks come from firmware level / software leaks / vendor leaks and the information is only as good as AMD gives it at some point in time, what is leaked a month ago might be year old information and out of date. As time approaches to release more material things come out.
 

6 SEs and double wide CUs claimed for Navi31. This would put a single GCD at 15360 FP32 per clock.

I like how both AD102 and Navi 31 “got stronger” in a matter of days. They probably ate their spinach.

Actually that would be 30720 FP32. Maybe it’s 3 SEs per GCD and 6 SEs per package. Big numbers.
 
Last edited:
Four SIMDs sharing a single TMU/RA implies AMD thinks ray acceleration is fast enough and ray tracing performance is now down solely to traversal and hit/miss shader throughput.

Well, that assumes RA hasn't been changed and that traversal is still in software...
 
Four SIMDs sharing a single TMU/RA implies AMD thinks ray acceleration is fast enough and ray tracing performance is now down solely to traversal and hit/miss shader throughput.

Well, that assumes RA hasn't been changed and that traversal is still in software...

More SIMDs isn’t a solution for divergence though. If AMD continues to run traversal on the SIMDs they must not be serious about RT performance or they’re also doing some funky sorting in software.
 
+1. There weren't notable patents published about traversal hardware for RDNA in past 2 years.

Nothing new from Nvidia either aside from the AI stuff.

From AMDs patent on the design of the ray accelerator: "One purpose of using a merged data path unit is to reduce the amount of silicon area that is used by only a single type of instruction, because doing so reduces the total amount of silicon for a chip. This particular merged data path unit is capable of outputting results for four box tests per cycle or one triangle test per cycle."

So the RA isn't just re-using the TMU L1 memory pipeline. It's also sharing silicon between the box and triangle intersection logic. Very elegant and area efficient but one triangle per clock probably isn't going to cut it for RDNA 3.
 
More SIMDs isn’t a solution for divergence though.
Divergence literally lengthens the wall-clock time of a shader. If a WGP (or CU, as the rumour suggests CUs are still a thing in RDNA 3) provides for more shader cycles per RA cycle, then that reduces the walk clock time of divergent shaders.

This is similar to ALU:TEX ratio. Over time that ratio has increased.

If AMD continues to run traversal on the SIMDs they must not be serious about RT performance or they’re also doing some funky sorting in software.
So if AMD is doubling-down on software traversal it's reasonable to expect that more compute is matched by better local resources to increase the "traversal work-items per clock", such as a bigger LDS, bigger register file, higher-capacity crossbars (or ring-bus?).

The side-effect of "increased traversal WIPC" is that it helps with hit/miss shader WIPC too, since those shaders are also the victims of divergence. All GPUs have this problem and sorting is part of the solution.

Of course AMD might do nothing in terms of local resources to help with RT WIPC - such changes might be years off. The fact that patent documents for hardware traversal haven't appeared as yet seems to indicate it's off the cards for a very long time.
 
Divergence literally lengthens the wall-clock time of a shader. If a WGP (or CU, as the rumour suggests CUs are still a thing in RDNA 3) provides for more shader cycles per RA cycle, then that reduces the walk clock time of divergent shaders.

Sure but best case it reduces that time 2x. Worst case for divergence is 32x.

So if AMD is doubling-down on software traversal it's reasonable to expect that more compute is matched by better local resources to increase the "traversal work-items per clock", such as a bigger LDS, bigger register file, higher-capacity crossbars (or ring-bus?).

The side-effect of "increased traversal WIPC" is that it helps with hit/miss shader WIPC too, since those shaders are also the victims of divergence. All GPUs have this problem and sorting is part of the solution.

Of course AMD might do nothing in terms of local resources to help with RT WIPC - such changes might be years off. The fact that patent documents for hardware traversal haven't appeared as yet seems to indicate it's off the cards for a very long time.

Right, AMD needs to solve for both traversal divergence and shading divergence. Intel and Nvidia are tackling the former by avoiding SIMD altogether. Willl be interesting to see how AMD's gamble works out on next generation RT workloads. They don't have enough control over the market to slow down RT adoption so hopefully that's not their game plan.
 
Status
Not open for further replies.
Back
Top