Next Generation Hardware Speculation with a Technical Spin [post E3 2019, pre GDC 2020] [XBSX, PS5]

AlNom · Nov 2, 2019

iroboto said:
Drivers etc all need to be ready. Waiting on RDNA 2 just may not be in the cards if it’s not ready; there is still production time etc and everything else around the console that needs to come together for this to ship; not just the singular APU. RDNA 1 with 2.0 features sounds like a like route that is likely going to be walked. In about 3-4 months time the chips need to be made. That means die tooling needs to be done way before that. Test runs before that etc. All the trials to work out silicon defects etc all need to be worked on to make the launch price economics.

Just fall back on DX11.

/cheapshot

Xbat · Nov 2, 2019

iroboto said:
Drivers etc all need to be ready. Waiting on RDNA 2 just may not be in the cards if it’s not ready;

Aren't RDNA 2 cards coming out next year? They need drivers too. I don't see that being an issue really.
I'm not saying next gen is going to be using RDNA 2 but that the consoles are supposedly going to have some form of hardware accelerated Ray tracing surely points to RDNA 2.

McHuj · Nov 2, 2019

Xbat said:
Aren't RDNA 2 cards coming out next year? They need drivers too. I don't see that being an issue really.
I'm not saying next gen is going to be using RDNA 2 but that the consoles are supposedly going to have some form of hardware accelerated Ray tracing surely points to RDNA 2.

Posted today:

https://www.reddit.com/r/hardware/comments/dqfjj2

AMD Navi 22 & Navi 23 graphics chips found in Linux driver

If Navi 22/23 are RDNA2 based, then drivers are being actively worked on.

Deleted member 11852 · Nov 2, 2019

Don't worry about drivers, worry about each console's optimisation tools. Making launch games is hard enough trying to balance a bunch of things on low-cost hardware, but to suddenly have one part of the system behaving differently/faster may sound good in theory but not always in practice. Graphics tech getting a boost sounds cool but if the GPU is sharing bandwidth to RAM with the CPU, is the CPU now disadvantaged? Did code that use to fit in 16ms now take 17ms? :runaway:

Without knowing anything about system architecture, it's hard to predict what any theoretical 'boost' to the graphics architecture means for the system as a whole.

function · Nov 2, 2019

I could see consoles possibly going for half the L3 of desktop. Consoles could probably get away without Epyc amounts of LLC. Save die area for CUs, lower L3 latency, and save a little power to spend elsewhere.

iroboto · Nov 2, 2019

anexanhume said:
You’re assuming all the next gen RDNA 2.0 features won’t see real silicon until sample dies. We already know next gen consoles have sampled their APUs and in all likelihood that includes the TMU based RT hardware.

Comparatively, the consoles need final silicon 16 months after 5700 needed final silicon. Subtract a few months for higher initial volume, but the point stands.

I’m just speaking off my mind; though @3dilettante might be able to provide a better perspective here on this. It’s been my understanding that even though your design and architecture is done and taped out. You submit batches for silicon production. See how well the chips are performing in terms of yield and clocking performance, making adjustments, and retool and do again. Until you get to an acceptable quality in which you do massive production.

I could be wrong on this front; but that process sounds long.

And considering consoles require a lot of assembly as well; everything needs to come together be packaged and distributed in 12 months from now. Which probably leaves 3-4 months left before mass production begins to have 1-2 million launch consoles ready.

I’m not saying it’s not RDNA2.0; I’m just saying the longer it takes for RDNA2.0 to release, the less likely its RDNA2.0

HBRU · Nov 2, 2019

Linux RDNA2 drivers worked on NOW is IMHO really a NEWS and makes a RDNA2 based GPU for both consoles a much closer reality... So it seems I was wrong. Do someone know what are the TF conversion factors of RDN2 in comparision to RDNA1 ?

iroboto · Nov 2, 2019

TheAlSpark said:
Just fall back on DX11.

/cheapshot

Lol. Launch like XBO? Hells yea!

Megadrive1988 · Nov 2, 2019

Nevermind RDNA 1.0 and 2.0 for a minute.

Do we 100 percent believe that both PS5 and Scarlett will end up with 36/40/44 CUs. and not more?

PSman1700 · Nov 2, 2019

Could be a custom solution, resulting in a hybrid, RDNA1 with RT, or something.

McHuj · Nov 2, 2019

Megadrive1988 said:
Nevermind RDNA 1.0 and 2.0 for a minute.

Do we 100 percent believe that both PS5 and Scarlett will end up with 36/40/44 CUs. and not more?

no, just guesses based on Navi 10 die size and power consumption.

3dilettante · Nov 2, 2019

Gubbi said:
L1 and L2 cache latency can have an impact on ray tracing. The top level nodes of your global space decomposition structure lives in your caches, because they are hit all the time. The bottom levels almost always miss the caches and hit main memory (ray coherency not withstanding).

On a CPU a first order approximation of cost is to consider traversing the cached top levels free, but if the latency is as high as detailed above, then that's certainly not the case for GCN.

Cheers

There was a GDC 2018 optimization hot lap from AMD that gave ~114, ~190, ~350 cycles respectively for L1 hit, L2 hit, and L2 miss.
AMD indicated elsewhere 10% improvement with Navi, although I didn't see it giving specific values versus an overall improvement because of the additional capacity throughout. The most recent Navi architecture slides indicated there's a lower-latency path for loads that bypass the sampling hardware, but no concrete figures.

Perhaps a BVH block would see latency closer to the direct load path, whatever value that is. If it's not at least an order of magnitude better, then it might explain why AMD's method has the BVH hardware defer to SIMD after each node evaluation. The CU's register file and LDS might be necessary to buffer sufficient context for a traversal method with such a long-latency critical loop. Perhaps it's counting on the CU's larger context storage to support more rays concurrently, or its register file and LDS to have the more reasonable latency figures for frequently hit node data.

Reverse engineering of Nvidia's L1 in Turing shows it's on the order of 32 cycles, which while vastly better than GCN is sloth-like compared to CPUs. Less clear is which level Nvidia's RT cores interface with. It seems probable they're closely linked to the SM's memory pipeline, but some of Nvidia's patents might have it hooked up outside the L1 and reliant on the L2. The L2 is about as slow as GCN's, which might be problematic if that's how it's implemented. However, there were indications that the RT hardware would have its own storage at presumably reasonable latency, and there were some hints that there is storage management done by the RT core for memory not clearly associated with the L1 or L2.

links:
GCN memory: https://gpuopen.com/gdc-2018-presentation-links/, optimization hot lap
Turing's memory: https://arxiv.org/pdf/1903.07486.pdf
edit: Navi reference https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf

MBTP said:
What are the possibilities of existing an Hybrid access memory area that both GPU and CPU can feed on "simultaneously" or sequentially without the read and write penalty for some kind of Hybrid RT?
I'm not an expert on this but some of the problems of having a hybrid RT solution that i read about was exactly that. On a shared memory system this could be a much simpler solution if both had direct access.

I'm unclear on the use of the term hybrid ray tracing. That's often used to describe a rendering engine that combines rasterization and ray tracing, but that's usually still on the GPU.

When involving a read/write penalty in the context of CPU and GPU cooperation, it sounds like you might mean the heavier synchronization barriers between them. Those generally exist because the GPU's memory model is much weaker than the CPU's and the GPU's overall execution loop is vastly longer latency and unpredictable in length.
How closely you think the CPU and GPU are cooperating might make a difference. Directly reading and writing to the same cache lines would either be brutally slow or error-prone. Trading intermediate buffers between certain synchronization points seems possible, but I'm not sure how much that would be a change from some of the mechanisms available already.
A more integrated approach likely means the GPU's memory and cache pipeline is very different from what we've seen already described, and I am not sure a critical area like CPU memory handling is worth the risk of disrupting.

Panino Manino said:
We know that Navi is in the works for a long time but...
Well, how much will "Navi 1.2" change, really? Beyond just some tweaks?

We're throwing around version numbers like they mean something. We don't have a good way to define how much a given change increments a counter, or whether AMD would care if we did. AMD resisted applying a number to GCN generations for quite some time, with sites like Anandtech going with terms like GCN 1.2 for the generation after Sea Islands to try to describe changes in an architecture AMD treated as an amorphous blob--until it relented and labelled things GCN1 (Southern Islands), GCN2(Sea Islands and consoles), GCN3 ("GCN 1.2", Fiji/Tonga/Polaris), etc.--then reverted to calling the next version Vega ISA.

In that context, the consoles were modifications from a GCN 1.x baseline already, one which AMD decided to give a whole number increment above the original hardware.
As for what is considered significant enough, hardware outside the CU array has often been updated relatively flexibly. The mid-gen consoles took on delta memory compression found in the Polaris and Vega products, and instructions for packed FP16 math from Vega showed up in the PS4 Pro. However, I think there's evidence that significant architectural changes like scalar memory writes from GCN3 did not show up, so outside of the additional hardware they were very close the the GCN2 baseline.

The transition from GCN to GCN2 may be a comparison point to RDNA1 to RDNA+?. One significant change to the ISA was the addition of a new instruction group for flat addressing, and a modest number of new instructions being added and some being deprecatd. Whether a new instruction or instructions for BVH node evaluation rises to the level a whole addressing mode may be up to the observer.

DSoup said:
How much time does that give devs with the new hardware before launch? Are you suggesting push back the launch for a year to 2021, or is RDNA2 a zero-effort advance over RDNA requiring no changes to existing code?

I may have missed confirmation about details on the earliest dev kits for the current gen. I remember the rumor was PCs using GPUs like Tahiti.
The earliest Sea Islands GPU to be released was Bonaire in spring of 2013. Hawaii was launched a little before the consoles launched.
Early silicon for those GPUs would seem to be the absolute limit for when developers could have done anything with non-console hardware with Sea Islands features.

Megadrive1988 · Nov 3, 2019

McHuj said:
no, just guesses based on Navi 10 die size and power consumption.

Okay good. Personally, I'd be really happy with 48 or 52 active CUs in either or both console GPUs.

Lets say retail PS5 had 48 active at 1.8 ~ 1.9 GHz (2.0 GHz for dev kits, same 48 CUs)
and retail Xbox Scarlett had 52 active at 1.7~1.8 GHz (56 dev kits, same clock).

The raw performance difference is in the single digits, and its really the "secret sauce" in each
(i.e. specific RDNA xx features and HW ray tracing implementations) that distinguishes each from the other.

HBRU · Nov 5, 2019

Megadrive1988 said:
Nevermind RDNA 1.0 and 2.0 for a minute.

Do we 100 percent believe that both PS5 and Scarlett will end up with 36/40/44 CUs. and not more?

Yes

BRiT · Nov 5, 2019

anexanhume said:
[citation needed]

Here's his citation: https://forum.beyond3d.com/posts/2087004/

Deleted member 11852 · Nov 8, 2019

Scott_Arm said:
If you think about it, the consoles are now going to be more akin to phone upgrades in the sense that the development kit and development ecosystem stays the same. There shouldn't be much of a learning curve. There was usually a learning curve that lead to late-gen titles looking vastly better than launch titles. This time around, that gap should be closer. I'm guessing the OS on the new consoles will be very similar to the OS on PS4/X1.

A couple of days back Sony's Jim Ryan said to gamesindustry.biz: "One thing that makes me particularly optimistic that what we're hearing from developers and publishers, is the ease in which they are able to get code running on PlayStation 5 is way beyond any experience they've had on any other PlayStation platform."

Sony made a big deal about the ease of time-to-triange on PS4 so I'd chalk up most of this to nextgen being an technological evolution rather than a revolution. Xbox went from 80x86/Nvidia to PowerPC/AMD to 80x86/AMD and PlayStation went from MIPS/WeirdArse3D to MIPS/EmotionEngine to PowerPC/Nvidia to 80x86/AMD, with a few weird bus/RAM setups from both manufacturers along the way. This is probably the first console transition for years - outside of Nintendo - where the technology is immediately familiar and the toolchain isn't entirely new.

anexanhume · Nov 12, 2019

Hello all. A large number of Microsoft ray tracing and GPU patents for your consumption:

Changing how textures are handled to reduce memory bandwidth needs:

https://patents.justia.com/patent/10388058

Pre-passing to determine whether to ray trace or use SSR for pixels.

http://www.freepatentsonline.com/y2019/0311521.html

Hardware assisted GPU performance profiling:

http://www.freepatentsonline.com/WO2019173115A1.html

This seems to be the big one. Custom methods and hardware for intercepting GPU instructions for the purposes of accelerating ray-tracing:

A method for generating a visualization of an image, comprising:

intercepting, from a graphics processing unit, GPU, or a graphics driver, a proprietary structure that specifies raytracing instructions for generating the image using ray tracing;

obtaining assistance information as one or more intersection shaders defined based on an acceleration structure from which the proprietary structure is generated;

converting, based on utilizing the one or more intersection shaders, the proprietary structure to a visualization structure for generating the visualization of the image; and generating, from the visualization structure, the visualization of the image.

http://www.freepatentsonline.com/WO2019168727A1.html

More GPU modifications to facilitate ray-tracing:

5. The method of claim 1, wherein the command to log the one or more parameters comprises a command to log a raytracing or rasterization feature associated with executing the modified shader.

http://www.freepatentsonline.com/WO2019168726A1.html

Better GPU fault detection:

http://www.freepatentsonline.com/y2019/0272206.html

Managing GPU memory allocation:

http://www.freepatentsonline.com/y2019/0236749.html

Improving ray intersection testing

Further, in some example implementations, the same three bounding planes of a tetrahedral cluster can be used to represent one, two, or three triangles that define faces of that tetrahedral cluster. With this configuration, overall memory costs are reduced when the same plane data is used to represent multiple triangles, and operations to check for ray-triangle intersections are still computationally efficient.

http://www.freepatentsonline.com/WO2019099283A1.html

BVH traversal:

Various approaches to performing non-divergent parallel traversal operations for a bounding volume hierarchy ("BVH") during ray tracing are presented. For example, a computer system has a processing unit with threads that, collectively, perform ray tracing for a group of rays in parallel in a computer-represented environment, which includes geometric objects (such as triangles) enclosed in the BVH. Each of the threads receives parameters for a given ray and traverses the BVH to determine an intersection, if any, between the given ray and one of the geometric objects. The order of traversal of the BVH is synchronized between threads for the rays of the group, for example, using a cross-group operation such as a ballot operation. In this way, the overall speed of the BVH traversal can be improved in many cases, while avoiding code divergence and data divergence in extra-wide single-instruction, multiple data ("SIMD") graphics processing unit ("GPU") architectures.

http://www.freepatentsonline.com/WO2019036098A1.html

bbot · Nov 12, 2019

I tried my hand at trying to estimate the size of Project Scarlett's SoC. First, I had to find a ruler to measure the dimensions of the SoC and GDDR6 module in the reveal trailr . Then, I had to discover the actual dimensions of the gddr6 module. The short side is facing the SoC.

350mm^2, my foot. It's more like 600mm^2. It's, like Donald Trump likes to say, "YUUUUUGE".

Shifty Geezer · Nov 12, 2019

Please post your measurements.

Proelite · Nov 12, 2019

bbot said:
I tried my hand at trying to estimate the size of Project Scarlett's SoC. First, I had to find a ruler to measure the dimensions of the SoC and GDDR6 module in the reveal trailr . Then, I had to discover the actual dimensions of the gddr6 module. The short side is facing the SoC.

350mm^2, my foot. It's more like 600mm^2. It's, like Donald Trump likes to say, "YUUUUUGE".

I am confident in my measurements of 24mmx15/16mm. The silver part in the middle is the die! You must be measuring the substrate too.

Next Generation Hardware Speculation with a Technical Spin [post E3 2019, pre GDC 2020] [XBSX, PS5]

AlNom

Moderator

Xbat

McHuj

Deleted member 11852

Guest

function

None functional

iroboto

Daft Funk

HBRU

iroboto

Daft Funk

Megadrive1988

PSman1700

McHuj

3dilettante

Megadrive1988

HBRU

BRiT

(>• •)>⌐■-■ (⌐■-■)

Deleted member 11852

Guest

anexanhume

bbot

Shifty Geezer

uber-Troll!

Proelite

Similar threads