Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

iroboto · Dec 1, 2020

Globalisateur said:
Beyond3D -> random portugese site -> RedGamingTech -> Beyond3D

I don't mind RGT. If he's reading this, I wish he'd write out what he wants to say edit a little bit further and be more on the signal and a little less on the noise.

Moore'sLawIsDead: that guy has nothing to offer to B3D.

function · Dec 1, 2020

mr magoo said:
RedGamingTech about PS5 unified l3 cache information found in recently discovered patent.

That patent does not show the cache arrangement used in Zen 3. RGT has been saying that PS5 uses the Zen 3 L3 arrangement. This is not that.

It could be using what's shown in the patent, or it could be using what's in Zen 3, but you need to pick only one. RGT has been saying it's using the Zen 3 arrangement up to this point.

Personally, I think the odds are that it's using a Renoir / Xbox style arrangement. Release the ~~Kraken~~ PS5 die shots already, Chipworks!

Shortbread · Dec 1, 2020

iroboto said:
I don't mind RGT. If he's reading this, I wish he'd write out what he wants to say edit a little bit further and be more on the signal and a little less on the noise.

Moore'sLawIsDead: that guy has nothing to offer to B3D.

I believe he does... but wings it slightly on camera. Plus, his monotone speech pattern and accent (Eastern European?) somewhat distorts his messaging.

Deleted member 90741 · Dec 1, 2020

Shortbread said:
Plus, his monotone speech pattern and accent (Eastern European?)

He is about as Eastern European as the Queen Of England/United Kingdom/Great Britain herself.

Shortbread · Dec 1, 2020

ethernity said:
He is about as as the Queen Of England/United Kingdom/Great Britain herself.

He is definitely not American or Canadian native that's for sure. His accent and tone reminded me of a Ukrainian businessman that I once knew.

PSman1700 · Dec 1, 2020

Its all about compromises in the end. In consoles, they couldnt go fast and wide (6800XT etc). It was either narrow/fast or wide/slow to achieve 10/12TF. The narrow/fast obviously has the early on advantages while the wider/slower will shine more later on. End result will be close anyway, close enough to not bother the average console player, but enough to keep DF busy.

Deleted member 86764 · Dec 2, 2020

Shortbread said:
He is definitely not American or Canadian native that's for sure. His accent and tone reminded me of a Ukrainian businessman that I once knew.

There are two people, one sounds like he's of Eastern European descent and lived in the UK for a few years, and there's Paul, who is definitely English - seems like a nice chap, but could do with some elocution. Says "basically" a lot.

j^aws · Dec 2, 2020

3dilettante said:
There was some weirdness about the results that may indicate the relationship isn't that straightforward. Raw pixel throughput might not be so cleanly doubled, since it was claimed some numbers were closer to the more traditional 64 pixels at the front end, versus broader sample throughput at the back end. Hence some references to Fermi's smaller fragment throughput versus ROP throughput. Modes where the same fragment could generate more samples may see more clear scaling.
If the geometry engine itself only outputs 4 non-culled primitives, that would cap the number of rasterizers that could be addressed per cycle.

For Navi21, my explanation for this weirdness lies with the 2 Scan Converters working on 1 triangle. A coarse and a fine scan converter, which we don't have much detail on their operations besides converting 1-32 fragments for each triangle.

Each Raster Unit has 2 scan converters, and from the driver leak, there is 1 scan converter per Shader Array. If triangles are mostly large, with coarse rasterisation, the Packers send off 32 fragments per Raster Unit to the Shader Arrays, 4x RU x 32 is peak 128 fragments per cycle.

When triangles are mostly small, fine rasterisation with Packers sending off 16 fragments per RU to the SAs, 4x RUs x 16 is peak 64 fragments per cycle.

So, at any given time, you'll get a mix of 64-128 fragments per cycle in flight.

3dilettante said:
The claim here is that an architecture with 16-wide rasterizers is less efficient at rasterizing small triangles than one with width 32?

No, rather that 2 scan converters working on 1 triangle with coarse and fine rasterisation is more efficient than 1 scan converter working on 1 triangle.

3dilettante said:
The big source of inefficiency for rasterizing small triangles comes from the width of the hardware, be it the rasterizer, the SIMD, or RBE. I wonder if the scan converter count means the single RDNA2 rasterizer block is less unified internally.

These inefficiencies remain. What the 2 Scan Converters are doing is making fragments more granular.

3dilettante said:
I'm not sure that the PS5 would drop down to 2 triangles.

I was thinking Navi22 first, and more likely to do this. The driver leak suggests it's a Navi21 sliced in half. Then a possibility that the PS5 may be a cousin.

3dilettante said:
That's a substantial drop in raw capability for geometry, and if sharing RDNA2's organization would cut the number of shader engines--which cuts things like wavefront launch rate as well.

Well, for PS5:

2 Triangles per cycle x 2.23GHz = 4.46 Gtri/sec

That's still substantial. And Geometry/ Command Processor would be tweaked accordingly.

3dilettante said:
The github leak indicated there's 4 triangles per cycle, at least for the types of geometry legacy titles would have. The legacy point would likely matter for backwards compatibility reasons, since it would be more difficult to fake the existence of two missing shader engines for PS4 Pro titles.

Yeah, Github was about backwards compatiblity testing. Are you referring to the table below from Globalisateur? I can't be certain if native is 2 triangles per cycle, and legacy is 4 triangles per cycle.

3dilettante said:
Other figures may also point to ~64 pixels/clk for pixel wavefront launch.

Yes, looks like the table below.

3dilettante said:
Triangle rate by AMD's terms seems to be defined by the capabilities of the Geometry Processor, or whatever shader type might be controlled by it if that's how it works.

Pinch point would be triangles rasterised, so that particular rate would apply to a unified shader GPU and a measure for this rate. They've used that in white papers as well.

j^aws · Dec 2, 2020

iroboto said:
Sorry you are correct, I got confused through calculations.
Though not your fault, AMD kept doubling stuff, my brain is swelling from SE, to SA, to dual compute units. Easy to lose track.
The diagrams don't separate what is a shader array in RDNA 2, so it can be easy to forget you're looking at shader engines instead and not shader arrays.

Yes, you can easily get lost in all the nomenclature.

iroboto said:
I wanted to point out that the number of RB's from RDNA 1 to 2 were halved. Which I guess wasn't the point you were trying to make; that instead there was 1 primitive unit and 1 raster per SA then per SE therefore the setup is like RDNA 1 and not RDNA 2. Well I'm not sure if that is correct. Because the setup is not likely to define what makes it RDNA 2. Block diagrams aren't necessarily as spot on as listed, how things are connected may not necessarily be as they appear. So it could be 2 rasterizers and 2 primitive units per shader engine. whereas RDNA 2 is 1 rasterizer and 1 primitive unit per shader engine.
If the distinction is that RDNA 1 binds those units to shader arrays than engines, then I would question why that would be a differentiation point or whether that even matters in terms of how it could affect performance. If there isn't something that highlights the difference, I'm not going to think there is any real difference.

I've clarified this before and will reiterate. The driver leak, Hotchips and Navi21 block diagrams make this clear. Don't get lost in all that nomenclature as above. It's really simple:

RDNA1:
1 Raster Unit = 1 Scan Converter working on 1 triangle, producing up to 16 fragments

RDNA2:
1 Raster Unit = 2 Scan Converters working on 1 triangle, producing 1-32 fragments

The block diagrams show a simplistic layout.

iroboto said:
In terms of CUs as discussed earlier:

I'm going to go on a limb and say both consoles are 1.1 according to this diagram.

Where did you get the data for RDNA1.1 and RDNA2?

iroboto said:
So as you see, it's no different from RDNA 2. I haven't found any other documentation that anything changed from RDNA 1 to 2 exception for mixed precision support and support down to int 4.
So essentially, custom CUs of RDNA 1 became the default for RDNA 2. Of which custom CUs are listed in the whitepaper anyway.

Yes, as I mentioned previously, I don't expect much of a difference between RDNA1 and RDNA2.

iroboto said:
So aside from some speed differentiation in cache transfer. (which may not be necessarily optimal for 2 SE setups) we're not seeing a big difference.
That really just makes RB+ the definitive architectural change between RDNA 1 and 2. And that's really just around double pumping and support for VRS.

So as per this claim:
XSX
Front-End: RDNA 1
Render-Back-Ends: RDNA 2
Compute Units: RDNA1
RT: RDNA

Front End - makes no difference - someone thinking they're smart iwth a block diagram
RBs = RDNA 2
CUs = RDNA 2 or 1.1 which are effectively the same
RT is RDNA 2
So I would disagree with the claim.

RB+ is designed to work effectively with rasterisation, especially the newer capabilities of coarse and fine rasterisation with the upgraded RDNA2 units. You aren't marking sense by disagreeing because you are effectively saying RDNA2 Raster Units have the same capabilities as RDNA1 Raster Units, and making the below equivalent:

RDNA1:
1 Raster Unit = 1 Scan Converter working on 1 triangle, producing up to 16 fragments

RDNA2:
1 Raster Unit = 2 Scan Converters working on 1 triangle, producing 1-32 fragments

The above aren't equivalent, so the XSX frontend, specifically Raster Units, isn't equivalent to RDNA2 Raster Units. Doesn't mean the entire frontend is RDNA1, rather just that stage.

iroboto said:
edit: sorry I had to bounce around to see your claim here. There's a lot to take in at once, I apologize. It is easy to get lost, I wrote something and had to delete it. the doubling is taking an effect on me.

Yes, it can get complicated, but I simplified the difference in Raster Units above between RDNA1 and RDNA2.

iroboto said:
So basically it comes down to whether PS5 is based off Navi 22 in your opinion. And XSX is based on Navi 21 lite.

I don't have as much data for PS5 compared to XSX, so can't be that certain, but that is my thinking.

iroboto said:
That is a very interesting proposition to consider. So you're basically trading off triangle performance to have much higher efficiency at micro-geometry. And because you have fewer triangles to output per cycle, you must rely very heavily on triangle discard.

Yes, Sony have a preference for geometry in their systems. And I would say Microsoft have a preference for texturing in their systems when it comes to representing detail in rendering.

iroboto said:
Interesting proposition to say the least.

Two different solutions given the constraints.

iroboto said:
Okay well we are about 6 months out now so here are some possible explanations for launch game performance as this is getting stale and probably no longer applicable (I would hope things have changed)

Geometry Shader (GS)
• Although offchip GS was supported on Xbox One, it’s not supported on Project Scarlett.
• We’re planning to move to Next-Generation Graphics (NGG) GS in a future release of the GDK.
• Performance for GS in the June 2020 release isn’t indicative of the final expected performance and will be improved in future releases.
• GS adjacency isn’t fully supported in the June 2020 release, but we’ll provide full support in a future release.
• You might encounter some issues with GS primitive ID with certain configurations. Please let us know if you do and the configuration you used so that we can further investigate the issue and provide a fix.
• When using multiple waves in a thread group, you might encounter graphics corruption in some GSs. If you do, please let us know so that we can try to help unblock you.

**
So typically IIRC, geometry shaders are very low performance and I believe most people get around them using compute shaders as the work around. But perhaps they don't suck using NGG. And that might explain why the RDNA 2 cards are performing very well in certain games? Just my thought here. It is likely outperforming a compute shader. Or just NGG in itself has been a big part of RDNA 2 success so far and more so with the titles that are supporting it. It's not exactly clear to me, nor the extent to which XSX can leverage the NGG path.

NGG path is the Primitive Shader path? Both Sony and Microsoft are promoting this, so AMD doing the same isn't surprising.

iroboto said:
This is the convo:

Yes, saw that conversation in the other thread. Seems like a faster path.

Globalisateur said:
What do you think about some of the github PS5 data in regards to the NGG vertex and primitive performance?

Github was mostly about PS5 backwards compatability from what I recall. There are suffixes that don't make entries completely clear. These entries are of interest with the discussion around rasterisation:

- peak prim legacy = 4 prim/clk (fixed-function)
- peak NGG legacy = 8 prim/clk (primitive shader)
- peak NGG fast = 3.3 prim/clk (weird, non-integer)
- peak NGG fast / scan conv = 4 prim/ clk (native?, primitive shader)

There is an entry missing:
- peak Prim Fast = native?, fixed-function

Suffix legacy is obviously for backwards compatiblity. The NGG is the Primitive Shader path.

We can do some deductions where primitive shader is faster:
- NGG legacy is twice as fast as prim legacy

Why is native primitive shader slower than legacy primitive shader, i.e.

Why is NGG fast/ scan conv (native?) half as fast as NGG legacy?

It follows for fixed-function that native is half as fast as legacy:

Prim Fast is half as fast as prim legacy (4 prim/clk).

Therefore Prim Fast (native?) = 2 prim/clk or 2 triangles per cycle.

This is half of Navi21 and XSX, and suggests RDNA2 Raster Units, but cannot be sure without more information.

Tkumpathenurple · Dec 2, 2020

ThePissartist said:
There are two people, one sounds like he's of Eastern European descent and lived in the UK for a few years, and there's Paul, who is definitely English - seems like a nice chap, but could do with some elocution. Says "basically" a lot.

And he's one of vose people vat says fings wrongly by substituting "f" and "v" for "th."

Me can't be like doin wid dat init when me be like such a sick English speaker. Aaaaaaye!

But genuinely, it does annoy me a bit, and I can't listen to him for long because of it.

rabidrabbit · Dec 2, 2020

To me, a non native english speaker, he sounds quite "posh"

iroboto · Dec 2, 2020

j^aws said:
Well, for PS5:

2 Triangles per cycle x 2.23GHz = 4.46 Gtri/sec

That's still substantial. And Geometry/ Command Processor would be tweaked accordingly.

I would suspect that depends largely on how many render targets the engines use that could demand more or less. Something like Doom Eternal can use well over 50 render targets per frame. Thus I believe the reason why triangle and fill rate numbers have to rise to meet the requirements from developers.

j^aws said:
RB+ is designed to work effectively with rasterisation, especially the newer capabilities of coarse and fine rasterisation with the upgraded RDNA2 units. You aren't marking sense by disagreeing because you are effectively saying RDNA2 Raster Units have the same capabilities as RDNA1 Raster Units, and making the below equivalent:

RDNA1:
1 Raster Unit = 1 Scan Converter working on 1 triangle, producing up to 16 fragments

RDNA2:
1 Raster Unit = 2 Scan Converters working on 1 triangle, producing 1-32 fragments

The above aren't equivalent, so the XSX frontend, specifically Raster Units, isn't equivalent to RDNA2 Raster Units. Doesn't mean the entire frontend is RDNA1, rather just that stage.

Interesting; okay, so I was thinking micro geometry - as in triangles that are < 16 pixels in size, so 8, 4 and 1 px explicitly and sub pixel triangles. These types of triangles tend to outright destroy the rasterization system entirely.
So you're basically saying there is an advantage here with larger triangles essentially. Or effectively, lower fidelity geometry, etc.
Yea that's a possibility of a small win here for RDNA 2 devices. Having hardware for actual micro-geometry (1 pixel triangles) where no one else did, I suspect would put the fidelity and performance well above the gap we see today. Typically you'll just stall the pipeline at resolutions above 1080p at least looking at some older GPU architectures.

Part of the reason I've been looking for this is because IIRC, Demon's Souls uses very fine triangles IIRC as part of that developer diary. So I'm not exactly certain on their sizing just yet.

j^aws said:
Yes, as I mentioned previously, I don't expect much of a difference between RDNA1 and RDNA2.

I was on my home PC when I posted that, but I think it was a slide dump from a typical news vendor when RDNA 2 slides were released. I'll check around when I get a chance.

j^aws said:
NGG path is the Primitive Shader path? Both Sony and Microsoft are promoting this, so AMD doing the same isn't surprising.

Indeed, and this is where I thought PS5 and 6800, 5700 family were generating their gains relatively to their competition. And with the documentation indicating it wasn't necessarily ready for XSX in the June GDK, I was looking at this as a possible vector to explain some of the differences we see.

I'm not sure how large an advantage you get from 1-32 fragment triangles, when really you're looking at 16-32 fragment triangles as keeping optimal performance. So basically as long as the triangles are 16 pixels to 32 pixels in size, you're keeping optimal performance of triangles per second. But if you have to lose 2x the triangle output to gain the performance of what you would for 2x the size of a triangle, I'm not sure what the pay off becomes. Once you drop below 16 pixels, you're going an exponential drop off in triangle performance as it approaches sub pixel. So I'm not sure if we are seeing a micro geometry gap. But it's possible we are looking at a large triangle gap. Might explain why the 6800 series is doing so well on borderlands over nvidia. Might. That's an interesting one. Borderlands seems extremely rasterization heavy. I'm was surprised to see XSX keep up with PS5 in that DF face off.

thicc_gaf · Dec 2, 2020

iroboto said:
Sorry you are correct, I got confused through calculations.
Though not your fault, AMD kept doubling stuff, my brain is swelling from SE, to SA, to dual compute units. Easy to lose track.
The diagrams don't separate what is a shader array in RDNA 2, so it can be easy to forget you're looking at shader engines instead and not shader arrays.

I wanted to point out that the number of RB's from RDNA 1 to 2 were halved. Which I guess wasn't the point you were trying to make; that instead there was 1 primitive unit and 1 raster per SA then per SE therefore the setup is like RDNA 1 and not RDNA 2. Well I'm not sure if that is correct. Because the setup is not likely to define what makes it RDNA 2. Block diagrams aren't necessarily as spot on as listed, how things are connected may not necessarily be as they appear. So it could be 2 rasterizers and 2 primitive units per shader engine. whereas RDNA 2 is 1 rasterizer and 1 primitive unit per shader engine.
If the distinction is that RDNA 1 binds those units to shader arrays than engines, then I would question why that would be a differentiation point or whether that even matters in terms of how it could affect performance. If there isn't something that highlights the difference, I'm not going to think there is any real difference.

In terms of CUs as discussed earlier:

I'm going to go on a limb and say both consoles are 1.1 according to this diagram.
So as you see, it's no different from RDNA 2. I haven't found any other documentation that anything changed from RDNA 1 to 2 exception for mixed precision support and support down to int 4.
So essentially, custom CUs of RDNA 1 became the default for RDNA 2. Of which custom CUs are listed in the whitepaper anyway.

So aside from some speed differentiation in cache transfer. (which may not be necessarily optimal for 2 SE setups) we're not seeing a big difference.
That really just makes RB+ the definitive architectural change between RDNA 1 and 2. And that's really just around double pumping and support for VRS.

So as per this claim:
XSX
Front-End: RDNA 1
Render-Back-Ends: RDNA 2
Compute Units: RDNA1
RT: RDNA

Front End - makes no difference - someone thinking they're smart iwth a block diagram
RBs = RDNA 2
CUs = RDNA 2 or 1.1 which are effectively the same
RT is RDNA 2

So I would disagree with the claim.

edit: sorry I had to bounce around to see your claim here. There's a lot to take in at once, I apologize. It is easy to get lost, I wrote something and had to delete it. the doubling is taking an effect on me.

So basically it comes down to whether PS5 is based off Navi 22 in your opinion. And XSX is based on Navi 21 lite.

That is a very interesting proposition to consider. So you're basically trading off triangle performance to have much higher efficiency at micro-geometry. And because you have fewer triangles to output per cycle, you must rely very heavily on triangle discard.

That could explain things for sure. I need the games to come into my house to test, if the power output is that low, might explain why. It's basically slowing down with micro-geometry and nothing is really happening.

Interesting proposition to say the least.
Okay well we are about 6 months out now so here are some possible explanations for launch game performance as this is getting stale and probably no longer applicable (I would hope things have changed)

Geometry Shader (GS)
• Although offchip GS was supported on Xbox One, it’s not supported on Project Scarlett.
• We’re planning to move to Next-Generation Graphics (NGG) GS in a future release of the GDK.
• Performance for GS in the June 2020 release isn’t indicative of the final expected performance and will be improved in future releases.
• GS adjacency isn’t fully supported in the June 2020 release, but we’ll provide full support in a future release.
• You might encounter some issues with GS primitive ID with certain configurations. Please let us know if you do and the configuration you used so that we can further investigate the issue and provide a fix.
• When using multiple waves in a thread group, you might encounter graphics corruption in some GSs. If you do, please let us know so that we can try to help unblock you.

**
So typically IIRC, geometry shaders are very low performance and I believe most people get around them using compute shaders as the work around. But perhaps they don't suck using NGG. And that might explain why the RDNA 2 cards are performing very well in certain games? Just my thought here. It is likely outperforming a compute shader. Or just NGG in itself has been a big part of RDNA 2 success so far and more so with the titles that are supporting it. It's not exactly clear to me, nor the extent to which XSX can leverage the NGG path.

This is the convo:

https://twitter.com/x/status/1329260531736309760

https://twitter.com/x/status/1329259816162824194

AH! There it is! That's the graphic I was referring to earlier when discussing that Twitter leak regarding Series X. I keep forgetting what the user's name is, but while they could technically be right about their leak, they could've also been loose with it. RDNA 1 could mean RDNA 1.1, or maybe it doesn't. But given other customized work MS's put into their design I really doubt they'd take the bare 1.0 frontend and leave it at that.

In fact, it's pretty much 100% guaranteed if they are using RDNA1, it's 1.1, since 1.0 doesn't have Int 8/Int 4 support which we already know the Series systems do. Plus the addition of RT hardware in the CU, that may not be in 1.0 but it's absolutely absent in 1.0, and again, we've been known the Series systems have that in their CUs.

I think we can safely put that Twitter leak to rest now; if frontend and CUs are RDNA "1" then that doesn't make any difference to RDNA 2 outside of no RT hardware in 1-based cards, which we know both Series systems and PS5 have anyway. Always thought it was a "half glass empty" leak anyway; similar to the Github stuff, it wasn't telling the full story, though that doesn't mean it was invalid.

The controversy it was generating, though, that can probably be left in the past. Can't believe some people looked at that leak and therefore implied MS (and by proxy AMD) were lying about the system being RDNA2. Although I'm pretty sure neither system has Infinity Cache and if that's a prerequisite for being "full RDNA2", then I guess we can hold them guilty in court on a fringe technicality.

cwjs · Dec 2, 2020

iroboto said:
I would suspect that depends largely on how many render targets the engines use that could demand more or less. Something like Doom Eternal can use well over 50 render targets per frame. Thus I believe the reason why triangle and fill rate numbers have to rise to meet the requirements from developers.

Can you explain this assumption and what it means to somebody new to understanding this stuff (especially the hardware side)?

my (limited) understanding of the doom renderer is that it uses compute shaders to rasterize in clusters for lighting, and uses a lot of culling techniques to send a fairly normal or even small amount of triangles to the rasterizer (i am under the impression that clustered forward renderers are better for fillrate than deferred). What are the render targets and why do they put pressure on high triangle and fill rate? Are they from shadow maps or something?

iroboto · Dec 2, 2020

cwjs said:
Can you explain this assumption and what it means to somebody new to understanding this stuff (especially the hardware side)?

my (limited) understanding of the doom renderer is that it uses compute shaders to rasterize in clusters for lighting, and uses a lot of culling techniques to send a fairly normal or even small amount of triangles to the rasterizer (i am under the impression that clustered forward renderers are better for fillrate than deferred). What are the render targets and why do they put pressure on high triangle and fill rate? Are they from shadow maps or something?

I'm not the best person for this as I only worked indie (for a brief moment in my career), and even then I worked mainly with quads and not triangles. This is probably not going to be accurate as I have never needed to understand GPUs even when I was coding games, no indie game I was working on was capable of really stressing the GPUs, so memory management was the largest bottleneck for performance. I just sort didn't care so much about performance back then. But this should provide you a ballpark idea despite being likely wrong.

At high level, to draw the pixel onto the screen using the traditional renderer path, often described as the 3D Pipeline, or legacy term would be fixed function pipeline. Basically we used to have silicon for each shader type, and that would create all sorts of bottlenecks until we unified all the shader types into 1 piece of silicon, known as unified shader pipeline. This is the basis of what CU's do.
The 3D Pipeline is controlled by the Graphics Command Processor, and in schedules work through the various steps of the 3D Pipeline starting from geometry through the unified shader hardware (the CUs) and ending will rasterization. When all the shading is done it outputs do the ROPs to output the final colour into a memory buffer. The RBs are responisble for this process, and known as fillrate.

Compute shaders skip all these steps (they are scheduled by Compute scheduler, also known as ACE units) You go from CUs and it writes directly into a memory buffer.

So the high level idea here is that the final image that you see is often composed of multiple render targets. So as the image gets more complex to perform more complex graphics you need to run through more stages. So each stage is a render target, or mutliple render targets. Once you have all your render targets, you need to blend them to create a final image. This is at least where things get fuzzy for me, but I believing this is where blending happens. So if you think about a generating a photo from film, you go through the process of adding colours onto the photo 1 colour at a time, each one blending with the last. This idea is similar in that respect.

So now you know what render targets are, the question is how render targets are generated.
The 3D Pipeline deals with triangles and no other shape, to render anything is to render a triangle, it can try to render things like quads if it accepts it, and likely at a penalty. Anyway that doesn't matter. What matters is that triangles are generated and culled, then shaded and then rendered. The rendering portion is controlled by the RBs. Most RB's are typically 4 pixels wide in today's architecutre. So 4 RBs per rasterizer means a it can fill a 16 pixel triangle per single clock cycle. I do not know if you can use RBs to fill pixels without a triangle, this part is outside my knowledge. Most people calculate fill rate without triangle rate, but I think they are joined. So every time you make a render target, you are drawing the entire screen in triangles, every single time.

A 4K screen is 3840 * 2160 pixels or 7.15 million pixels. That means a 4K render target is also 7.15million pixels. The fill rate for say XSX is 116 Gpixels per second. (1825 Mhz * 64) So 7 million pixels is nothing really. Except if you have to draw 50 of them per frame. Now you're at 350 Million drawn pixel per frame, this assumes you are getting _the absolute maximum fillrate_. Which is not realistic either because I believe this is bound to triangle rate and it also bound to by memory bandwidth. So developers need to work around these limitations, ie using 4K buffers only when you need it, lower resolution buffers for things you can get away with. Choosing not to use 32bit floats for each render target and so on, as this also takes up a lot of memory as well. None of this includes the remaining time to actually do calculations which has nothing to do with everything we've spoken about here. So if rasterization takes up like 4 ms, you have 4.5ms left for calculations and post processing. Not a lot of time to make things look good.

I haven't even brought the metric of blending into play which is something that needs to be mentioned. Fill rate is one aspect, but blending is not at the same speeds as fillrate.

cwjs · Dec 2, 2020

iroboto said:
So if rasterization takes up like 4 ms, you have 4.5ms left for calculations and post processing. Not a lot of time to make things look good.

I haven't even brought the metric of blending into play which is something that needs to be mentioned. Fill rate is one aspect, but blending is not at the same speeds as fillrate.

I really appreciate you taking the time to write that out. I'm familiar with the broad strokes of most of what you wrote already (I'm a technical artist, also indie.)

What I dont understand is how a particular renderer's workload is analyzed to see where it's bound on particular hardware. So: my naive understanding here is that fill rate is basically "how many pixels you shade how many times", and doom should be fairly light since it's forward rendered (with no depth prepass or any similar gbuffer info i don think) and lighting is culled aggressively by a compute rasterizer of some kind so there shouldnt be a lot of passes per light, and raster (i'm fuzzier here) is "how many triangles are prepared for rendering, how many times" which also ought to be light since it's a clustered renderer with aggressive culling...

In the post i quoted, it was speculated that there was a direct connection between render targets (I'm fuzzy here too -- these are any images/buffers that come out of the gpu? or are they only from the standard graphics pipeline (not compute)? or something else?) and fill rate/raster and I'm not clear on that connection (or exactly what render target means).

Is the assumption here, judging by your last paragraph, that each render target is (close to) a full resolution buffer of pixels? Is that a guess, or something we can know based on what "render target" means or some secondary source? My current operating assumption is that a lot of these render targets are little slices of the screen from the tiled part of the renderer or compositing shading from different clusters, things like that.

iroboto · Dec 2, 2020

cwjs said:
I really appreciate you taking the time to write that out. I'm familiar with the broad strokes of most of what you wrote already (I'm a technical artist, also indie.)

What I dont understand is how a particular renderer's workload is analyzed to see where it's bound on particular hardware. So: my naive understanding here is that fill rate is basically "how many pixels you shade how many times", and doom should be fairly light since it's forward rendered (with no depth prepass or any similar gbuffer info i don think) and lighting is culled aggressively by a compute rasterizer of some kind so there shouldnt be a lot of passes per light, and raster (i'm fuzzier here) is "how many triangles are prepared for rendering, how many times" which also ought to be light since it's a clustered renderer with aggressive culling...

In the post i quoted, it was speculated that there was a direct connection between render targets (I'm fuzzy here too -- these are any images/buffers that come out of the gpu? or are they only from the standard graphics pipeline (not compute)? or something else?) and fill rate/raster and I'm not clear on that connection (or exactly what render target means).

Is the assumption here, judging by your last paragraph, that each render target is (close to) a full resolution buffer of pixels? Is that a guess, or something we can know based on what "render target" means or some secondary source? My current operating assumption is that a lot of these render targets are little slices of the screen from the tiled part of the renderer or compositing shading from different clusters, things like that.

This may help explain some of those concepts, though not entirely. A senior member here would be able to provide better guidance than I, lol, I am tapped out. But for the journal below, I've been using the word target interchangeable for buffer.

There doesn't necessarily have to be a direct connection to fillrate since you can by pass that with compute shaders. But typically in the indie space, we'd stick with 3D pipeline.

http://www.adriancourreges.com/blog/2016/09/09/doom-2016-graphics-study/

cwjs · Dec 2, 2020

iroboto said:
This may help explain some of those concepts, though not entirely. A senior member here would be able to provide better guidance than I, lol, I am tapped out. But for the journal below, I've been using the word target interchangeable for buffer.

There doesn't necessarily have to be a direct connection to fillrate since you can by pass that with compute shaders. But typically in the indie space, we'd stick with 3D pipeline.

http://www.adriancourreges.com/blog/2016/09/09/doom-2016-graphics-study/

no problem, thanks for sharing info. Reading through that again it looks like more changed between the two dooms than i thought.

You might be interested in this breakdown on doom eternals renderer too https://simoncoenen.com/blog/programming/graphics/DoomEternalStudy.html -- they did away with the g buffer and even render things like the screen space reflections in the forward shader now. The discrepancies between the doom 2016 mostly-forward mostly-clustered renderer and doom eternal's new renderer account for most of my confusion.

(additional siggraph slides https://advances.realtimerendering.com/s2020/RenderingDoomEternal.pdf)

3dilettante · Dec 2, 2020

j^aws said:
For Navi21, my explanation for this weirdness lies with the 2 Scan Converters working on 1 triangle. A coarse and a fine scan converter, which we don't have much detail on their operations besides converting 1-32 fragments for each triangle.

Can you provide a reference again to the coarse/fine scan converter discussion?
The usage of the term for coarse seems a little out of line from what the term coarse rasterization would imply. Coarse rasterization's typical use case would yield no coverage information usable for wavefront launch.

Each Raster Unit has 2 scan converters, and from the driver leak, there is 1 scan converter per Shader Array. If triangles are mostly large, with coarse rasterisation, the Packers send off 32 fragments per Raster Unit to the Shader Arrays, 4x RU x 32 is peak 128 fragments per cycle.

Coarse rasterization would help determine which tiles or regions of screen space may have coverage by a primitive. In hardware or primitive shader, it may help determine which shader engines are passed a primitive for additional processing. I wonder if some coarse rasterization checks can be handled by the geometry processor.
However, knowing a general region may be touched by a triangle isn't sufficient to generate coverage information for a pixel shader, so subsequent rasterization at finer granularity would be needed.

Using two scan converters to cover the same triangle may not pair well with their usage model with RDNA1. A wavefront only references one packer, which wouldn't align with more than one scan converter being involved in a wavefront's launch since packers are per-SC. However, a shader engine so far has hardware that works on launching one wavefront at a time.

2 Triangles per cycle x 2.23GHz = 4.46 Gtri/sec

It would become a problem in instances where one of the backwards compatibility modes is invoked, and iso-clock the PS5 would lose to the prior console it's trying to emulate.

Yeah, Github was about backwards compatiblity testing. Are you referring to the table below from Globalisateur? I can't be certain if native is 2 triangles per cycle, and legacy is 4 triangles per cycle.

Legacy would give an indication of some of the hardware paths that govern both. Some of the NGG figures may include the extra culling capabilities of the the new path, versus legacy hardware that could only evaluate 4 primitives per cycle regardless of final culling status. The post-cull figure for NGG aligns with the 4 primitives/cycle for legacy, and post-cull is what is relevant to the RBEs.

j^aws said:
NGG path is the Primitive Shader path? Both Sony and Microsoft are promoting this, so AMD doing the same isn't surprising.

NGG is a reorganization of the internal shaders used for geometry processing, which seems to include hardware and compiler changes. Primitive shaders are a part of it, but there were other changes like the merging of several internal shader types that were discussed separately--at least for prior GPU generations.

- peak prim legacy = 4 prim/clk (fixed-function)
- peak NGG legacy = 8 prim/clk (primitive shader)
- peak NGG fast = 3.3 prim/clk (weird, non-integer)
- peak NGG fast / scan conv = 4 prim/ clk (native?, primitive shader)

Legacy hardware for the PS4 generation wouldn't have the ability to cull 2 primitives per cycle, so the 4 to 8 jump could come just from the marketed 8 pre-cull figure for the geometry processor in RDNA.
The specific tests have different settings, which may impact what is being measured. The specific format being processed and culling settings can change behavior, and 3.3 primitives from a 10-vertex peak rate makes sense assuming 3 vertices/triangle.

Why is NGG fast/ scan conv (native?) half as fast as NGG legacy?

It may be a different mode, or it's not measuring the same facet of the front end as the others.

Therefore Prim Fast (native?) = 2 prim/clk or 2 triangles per cycle.

Given that the tests may not be comparable to one another, why assume a different value than what is given? Which entries say 2 primitives/clk?

Globalisateur · Dec 2, 2020

3dilettante said:
Can you provide a reference again to the coarse/fine scan converter discussion?
The usage of the term for coarse seems a little out of line from what the term coarse rasterization would imply. Coarse rasterization's typical use case would yield no coverage information usable for wavefront launch.

Coarse rasterization would help determine which tiles or regions of screen space may have coverage by a primitive. In hardware or primitive shader, it may help determine which shader engines are passed a primitive for additional processing. I wonder if some coarse rasterization checks can be handled by the geometry processor.
However, knowing a general region may be touched by a triangle isn't sufficient to generate coverage information for a pixel shader, so subsequent rasterization at finer granularity would be needed.

Using two scan converters to cover the same triangle may not pair well with their usage model with RDNA1. A wavefront only references one packer, which wouldn't align with more than one scan converter being involved in a wavefront's launch since packers are per-SC. However, a shader engine so far has hardware that works on launching one wavefront at a time.

It would become a problem in instances where one of the backwards compatibility modes is invoked, and iso-clock the PS5 would lose to the prior console it's trying to emulate.

Legacy would give an indication of some of the hardware paths that govern both. Some of the NGG figures may include the extra culling capabilities of the the new path, versus legacy hardware that could only evaluate 4 primitives per cycle regardless of final culling status. The post-cull figure for NGG aligns with the 4 primitives/cycle for legacy, and post-cull is what is relevant to the RBEs.

NGG is a reorganization of the internal shaders used for geometry processing, which seems to include hardware and compiler changes. Primitive shaders are a part of it, but there were other changes like the merging of several internal shader types that were discussed separately--at least for prior GPU generations.

Legacy hardware for the PS4 generation wouldn't have the ability to cull 2 primitives per cycle, so the 4 to 8 jump could come just from the marketed 8 pre-cull figure for the geometry processor in RDNA.
The specific tests have different settings, which may impact what is being measured. The specific format being processed and culling settings can change behavior, and 3.3 primitives from a 10-vertex peak rate makes sense assuming 3 vertices/triangle.

It may be a different mode, or it's not measuring the same facet of the front end as the others.

Given that the tests may not be comparable to one another, why assume a different value than what is given? Which entries say 2 primitives/clk?

From my memory of that github data the 2 NGG legacy values are there for Pro legacy, not PS4.

Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

iroboto

Daft Funk

function

None functional

Shortbread

Island Hopper

Deleted member 90741

Guest

Shortbread

Island Hopper

PSman1700

Deleted member 86764

Guest

j^aws

j^aws

Tkumpathenurple

rabidrabbit

A Reformed Member

iroboto

Daft Funk

thicc_gaf

cwjs

iroboto

Daft Funk

cwjs

iroboto

Daft Funk

cwjs

3dilettante

Globalisateur

Globby

Similar threads