AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
The same talk 10 years ago, APUs is going to eat up the market, dGPUs will disappear .. etc, etc .. none of which has materialized in any shape or form. I see the same hogwash here. If the cost is going to make 300$ GPUs disappear, then APUs are hopelessly screwed.
 
I see the same hogwash here
You can't see anything because you can't even read some semiwiki/semieng shit or idk.
You don't need 6D brains to understand the basic gist of "shit's fucked, yo" and "only the enfattened segments which can tolerate the costs will survive".
Or at least I hope you don't.
then APUs are hopelessly screwed.
They have the volumes and some very, utterly loyal OEM slaves to survive.
Your laptop is still gonna be more expensive.
 
If we are looking at a normally price GPU market, the best APUs are still only comparable to $50 GPU. I don't see this changing significantly enough to in the future. APUs are as expensive as GPU/CPUs to manufacture and have to be sold for less margin due to performance constraints.


If someone is trying to make a cheap build then the used market would most likely fill the gap between apu performance and $300ish . The jump between an apu to dedicated sub $300 gpu isn't that large.

Also most issues are with memory bandwidth on the apu side. So what happens when you start getting apus with infinity cache ?
 
If someone is trying to make a cheap build then the used market would most likely fill the gap between apu performance and $300ish . The jump between an apu to dedicated sub $300 gpu isn't that large.

Also most issues are with memory bandwidth on the apu side. So what happens when you start getting apus with infinity cache ?
More importantly, all CPUs having GPU with AM5 it gives even less incentive to fill that lowend niche
 
The same talk 10 years ago, APUs is going to eat up the market, dGPUs will disappear .. etc, etc .. none of which has materialized in any shape or form. I see the same hogwash here. If the cost is going to make 300$ GPUs disappear, then APUs are hopelessly screwed.
I can't comment on APUs eating up the market. They are more economical to produce (amortized packaging costs, no exotic DRAM) but vendor-OEM calculus is outside of my pay grade.

What I do know is that Moore's law's ability to reduce $/transistor is over (at least at the pace we're used to). There's still some power scaling but that's been tapering off too. Chips are all wires now and wires don't scale. So we're left with transistor density scaling, but you have to pay for the transistors in $, W and heat dissipation (which means more $). And bandwidth to feed those transistors, which is its own story.

In theory it's not all doom and gloom -- algorithmic and architectural innovations may spark off a new innovation cadence. But the industry is still adapting, with development cadences still grasping on to the last vestiges of Moore's law. It's all still a "moar cores" mindset. It's going to take a while to change, and until then we'll feel the costs.

I hope I'm proven wrong.
 
It's all still a "moar cores" mindset
GPUs are inherently that.
We've been at like 550-ish average high-end GPU die since 2006 or so.
algorithmic and architectural innovations may spark off a new innovation cadence
Well that's just bullshit; all that stuff needs them sweet sweet xtors and man are xtors not coming cheap those days.
I hope I'm proven wrong.
"We're lowkey fucked" is an industry-wide observation with like a bazillion articles written about it to day.
 
CPUs show us that the stuff that keeps the core working, instead of waiting, is what brings better efficiency. I believe RDNA 3 is about yet more of this, rather than more cores.

Sure, there's the rumoured triple-chiplet monstrosity that is Navi 31 (and 32) which seemingly goes for more cache as well as more cores, but I think AMD is aiming to implement fine-grained kernel-spawning for hand-off and function-calling using task pooling and queues within each WGP:

HARDWARE ACCELERATED DYNAMIC WORK CREATION ON A GRAPHICS PROCESSING UNIT - ADVANCED MICRO DEVICES, INC. (freepatentsonline.com)

AGGREGATED DOORBELLS FOR UNMAPPED QUEUES IN A GRAPHICS PROCESSING UNIT - ADVANCED MICRO DEVICES, INC. (freepatentsonline.com)

Register saving for function calling - Advanced Micro Devices, Inc. (freepatentsonline.com)

TECHNIQUES FOR IMPROVING OPERAND CACHING - Advanced Micro Devices, Inc. (freepatentsonline.com)

I think a major motivation here is ray tracing, because the function hierarchy, even in real time gaming graphics, is really tricky. This ties in with the continuation-passing style of sub-function control that I briefly referenced here:

Intel Xe Ray Tracing | Beyond3D Forum

where we can see that Intel is implementing the same concepts.

So that appears specific to ray tracing. I think it can be more widely used for conditional routing techniques for reducing the impact of control flow divergence. It could be said that this last part is a nice to have, because nesting/looping rapidly makes a mess.

So I think coarse-grained control flow resulting from ray traversal (miss? hit? material? spawn-ray?) looks amenable to intra-WGP task pooling and scheduling and it might have wider usage.

I do wonder if the next iteration of D3D12 includes some fine-grained shader-calling-shader functionality.
 
CPUs show us that the stuff that keeps the core working, instead of waiting, is what brings better efficiency
CPUs aren't GPUs.
is what brings better efficiency
Well yes, but also no, see DC chip power creep and correlate it to the ever-increasing CC.
Everyone wants more compute and whatever remains of Moore's only has so much to give.
I believe RDNA 3 is about yet more of this, rather than more cores.
Yes but it's also way way more stuff on the higher end!
Sure, there's the rumoured triple-chiplet
Triple?
It's fuckton of tiles.
Two GCDs, yes.
 
Well that's just bullshit; all that stuff needs them sweet sweet xtors and man are xtors not coming cheap those days.
Yeah well it's not quite the golden goose that keeps on giving like Moore's law did. But for many workloads there's up to 10x upside without more xtors. That's for the kernels, and then there's a ton of inefficiency in the software stacks stitching those kernels together. This is all pure research, I don't know whether or in what shape any of this will make its way into the market. I also don't know where we go once we've mined out all this inefficiency.
 
I feel it's worth reminding a few folks: Moore's Observation (aka "Law") was not about $ per transistor, nor overall transistor density, nor any power or performance or compute capability metric either. Rather, his observation was about the total number of transistors in an integrated ciruit roughly doubling every two years. It's not your fault if you weren't aware of this, a LOT of news outlets and bloggers and vloggers and forum participants echo the same set of misinformation about it being performance or price or compute capaibility some combination of the three. Despite these three things being resultant from decades of silicon lithography evolution, precisely none of those three were in Moore's original observation.

We' can continue crushing more transistors into a singular IC even to this day, yet most of the the doubling is coming from physical layout -- we're capable of building physically larger chips now, and we have some work being done on silicon stacking, both achieve the same result of more transistors in a single IC. Unfortunately the cost of doing so is now increasing significantly and the power / performance / sizing gain is paltry at best, bordering on non-existent now.

Moore's Law is Transistors per IC, not transistors per inch, not transistors per dollar, not instructions per cycle, not cycles per second, not cycles per unit of power, not instructions per unit of power.
 
Not sure that this can be seen as something what Moore was talking about.
This shouldn't be some obscure conjecture on whether Moore could've reasonably forseen ICs being constructed in more than a two dimensional plane. Rather, we get into some pedanticism around drawing a line to declare where a single integrated circuit ends and a new one begins; can a singular IC only exist in a flat 2D plane? What happens to a future possible state where we can legitemately build an integrated circuit in all three dimesions, say some future 3D printing technology?

I get that irregularly stacked chips like multiple discrete chiplets stacked (mounted?) on a singular underlying substrate are different. Howabout when TSVs are involved where the multiple layers of silicon interoperate and are not otherwise able to be made functional when standing alone as a singular layer? A singular chiplet can be made functional outside of the MCM substrate.
 
Moore's Law is Transistors per IC, not transistors per inch, not transistors per dollar, not instructions per cycle, not cycles per second, not cycles per unit of power, not instructions per unit of power.
upload_2021-9-28_10-30-9.png
Moore was originally looking at optimum cost of transistors for a process and the cost benefits it brings as a smaller process matures. Notice the clear "Cost" part of the graph?
 

A couple of new patent applications related to RT optimizations

PARTIALLY RESIDENT BOUNDING VOLUME HIERARCHY
Techniques for performing ray tracing for a ray are provided. The techniques include, based on first traversal of a bounding volume hierarchy, identifying a first memory page that is classified as resident, obtaining a first portion of the bounding volume hierarchy associated with the first memory page, traversing the first portion of the bounding volume hierarchy according to a ray intersection test, based on second traversal of the bounding volume hierarchy, identifying a second memory page that is classified as valid and non-resident, and in response to the second memory page being classified as valid and non-resident, determining that a miss occurs for each node of the bounding volume hierarchy within the second memory page.

RAY-TRACING MULTI-SAMPLE ANTI-ALIASING
A technique for performing a ray tracing operation for a ray is provided. The method includes performing one or more ray-box intersection tests for the ray against one or more bounding boxes of a bounding volume hierarchy to eliminate one or more nodes of the bounding volume hierarchy from consideration, for one or more triangles of the bounding volume hierarchy that are not eliminated by the one or more ray-box intersection tests, performing one or more ray-triangle intersection tests utilizing samples displaced from a centroid position of the ray, and invoking one or more shaders of a ray tracing pipeline for the samples based on results of the ray-triangle intersection tests.

First patent seems to aim at traversing the BVH tree even when a part of the BVH data is missing in cache with the hope of getting a hit within the resident data, at least while the missing data is still being fetched
Any idea what the second patent is doing?
 
Moore was originally looking at optimum cost of transistors for a process and the cost benefits it brings as a smaller process matures. Notice the clear "Cost" part of the graph?
That's all fine and nice.

His observation was very specifically transistors per integrated circuit. You can use his observation to infer a lot of other things which were true at that time, and yet are not true now even though his observation of transistors per singular IC is still reasonably tracking to current.
 
Status
Not open for further replies.
Back
Top