Understanding game performance through profiler trace graphs *spawn

davis.anthony · Aug 19, 2024

I love this topic, my die hard knowledge of this stuff is limited but I do enjoy reading it and may ask the odd clever question.

One thing that immediately struck me, if GPU's aren't using all of their flops at any given time, wouldn't it not be better for AMD/Nvidia to look at making the required changes (the cache and memory it looks like) to increase GPU efficiency rather then to keep doing what they are? - (Maybe not worded great)

I genuinely thought a few years ago that all the GPU's would be using HBM2 by now for gobs of bandwidth at reduced latency which it seems RT requires, but no sign of it.

Or maybe a hybrid of 2GB HBM2 for BVH work and then a pool of GDDR6/7 as normal.

Albuquerque · Aug 19, 2024

Who says AMD and NVIDIA aren't doing exactly what you describe, re: making changes?

The reality is, just like for x86 compute, GPU compute is not homogenous. There are millions of unique workloads which can come down the pike, and some optimizations are only useful for some workload types. As a handwaving exercise, everything benefits from more bandwidth, less latency, higher throughput, and more compute cycles. But not everything benefits equally, and no matter what is "added", there will always be bubbles in the instruction pipelines.

We just had a parallel version of this conversation about hyperthreading a few days ago in another thread. Basically, it's functionally impossible to guarantee 100% utilization of any compute pipeline while generating useful output. There will always be unused resources on the table, and there's no magic wand that guarantees more usage of any one resource results in a specifically better outcome. This is the same reason why stack traces of this nature aren't a big conversation point about measuring CPU performance either -- as it turns out, your CPU has gobs of functional units which are completely idle during even the most intensive workloads, just like your GPU does. Trying to fill them all is pointless, because ultimately the ability to fully pack a compute pipeline becomes harder than just throwing more bandwidth, less latency, and more compute cycles at it.

Even with infinite time to plan an execution strategy for arranging instructions flowing into a compute pipeline, you still cannot always fill it. That's not how real world workloads work.

An easy example to understand is why "Multicore rendering" still cannot fully consume all CPUs available, even to this day. Turns out, your GPU is a massive multicore rendering unit, and sometimes some of those resources can't be filled. Is looking at the graph and tilting at that windmill going to fix it? Turns out, developers don't hate users, and in fact generally want their game to perform well so they can sell more of the game, and for that matter, sell more future games without being labeled as lazy or shitty. Yet as I described, sometimes the answer is simpler to throw hardware at it rather than try to iron out the random bubbles across eighty different GPU parts floating around in the wild.

DavidGraham · Aug 19, 2024

Dark silicon is a real thing in GPUs, if GPUs fired every circuit in every second power consumption would triple in an instant, and no amount of cooling would be enough to handle it. GPU vendors rely heavily on dark silicon when they design their GPUs.

raytracingfan · Aug 19, 2024

Frenetic Pony said:
The different design of current L2 caches on Nvidia/AMD offer a good contrast. Nvidia in 4XXX has gone with a giant L2 cache that can keep a decent amount of a BVH pretty close to the RT/CUs, so odds are decent that asking for a given part of a BVH will be quick. Meanwhile AMD with RDNA3 has an absolutely enormous bandwidth L2 (almost double the bw as Nvidias) but it's also a much much smaller one than Nvidias, meaning if the data isn't in there you're going to wait longer for it.

If there's a wide array of potential data you need bits and pieces of quickly then Nvidia's design clearly wins. If you need a smaller array of data but want as much of it as possible then AMD's design wins.

The cost, however, of Nvidia's design is it's much bigger. In cache designs the larger the cache the higher latency it has and the more the cost goes up. This is because caches store a smaller range of data than the full range of memory the GPU has access to, to compress this memory range and keep track of what is where at the same time takes book keeping and that book keeping costs memory and time itself. So it's a tradeoff of size versus cost in terms of both latency and $$$. The larger the cache, the more book keeping, but the more likely the data you want is in that cache.

How much does the L3 Infinity Cache help with ray tracing?

Albuquerque · Aug 19, 2024

A larger cache means cache-coherency of rays can potentially go up -- there's more in the cache to be coherent with. At the same time, non-coherent rays are absolutely a thing, necessary for specific types of effects such as reflections and refractions and certain types of material diffusion. At some level, a large enouch cache could possibly include those rays, but then again, you're approaching the point where the cache is larger than the rendered scene itself. At that point, it's just VRAM again, and faster, smaller caches will exist to accelerate smaller, more tightly-bundled datasets. And now we're back to simply adding more bandwidth and less latency and more clock cycles.

The actual answer, as you expect, is "it depends."

Shifty Geezer · Aug 19, 2024

I guess the limiting factor is speed is inversely proportional to size. Without a way to increase one without affecting the other, you'll always be constrained.

Albuquerque · Aug 19, 2024

Indeed they are, however time and the steady evolution of technology also allow both to generally increase. Imagine the VRAM speeds and sizes we have today, versus the L1 / L2 cache speeds and sizes we had a decade ago. Although to be fair, the latency hasn't changed much in absolute wall-clock time; much of a GPU scheduler's workload is hiding memory latency and will always and forever be so. Cache will always have the latency advantage, if for no other reason than the pure physics of being being wired physically closer to the compute units.

cwjs · Aug 19, 2024

trinibwoy said:
Makes sense to me.

I don’t share the same concerns about weaponizing B3D discussions or death threats but I’m sure you guys have your reasons. It’s pretty unlikely that we can fix human behavior on the internet by censoring ourselves at B3D.

I for one am not as worried about weaponizing anything, and I’m all for more people learning to dig into this stuff (I’m doing it myself!) — my caution is along these lines:

1. Your conclusions are going to be wrong. Without signal on what the engine code is doing, how the hardware works, how the shader compiler works, or how the content is set up, plus a lot of intuition, most captures of complex content are basically noise, and

2. If you want to dive deeper until you start to have pretty good guesses you’re going to end up a graphics programmer by the end of the journey.

Regardless, it’s fun to speculate, but you’re going to have professionals throwing cold water on your conclusions and they’re going to (I can say from experience, frustratingly) shrug their shoulders when you ask them to elaborate unless the capture shows a very classic case or it is their area of their engine (and it’s appropriate for them to share publicly)

davis.anthony said:
One thing that immediately struck me, if GPU's aren't using all of their flops at any given time, wouldn't it not be better for AMD/Nvidia to look at making the required changes (the cache and memory it looks like) to increase GPU efficiency rather then to keep doing what they are? - (Maybe not worded great)

There’s a ton of this already, it’s part of what makes this so complicated — between the shader code, the engine architecture, the driver, and the hardware, there’s a lot going on to hide latency (the most simple example: texture reads are high latency, so unless a shader explicitly (or accidentally!) requires texture data before doing other math, shaders will be re-arranged to front load all independent work while waiting on texture samples) and improve cache behavior, whether that means more cache friendly algorithms, using different types of memory for different data (both programmer-specified and opaque), etc. of course, because there’s so much complexity and so many clever tricks, the cache friendly behavior on one gpu might turn out to be the catastrophic slow case on another, even depending on what *other* work is in flight, sometimes even within the same gpu families.

Albuquerque · Aug 19, 2024

To make matters worse, the operating system and even discrete driver versions can and will affect these profiler data. There's nothing here so simple as "well the engine isn't doing something" because whatever that "something" is can be fully external to the application's sphere of control. Why do we see "Game Ready Drivers" for new games? Why didn't they just work? Why didn't the application teams just do it right the first time so the driver update wasn't required? Because that's not how the ecosystem actually works.

The plots are interesting; they're even more interesting to compare and contrast. They're especially interesting when running A-B comparisons when an individual element of the ecosystem is adjusted (new driver version before vs after, or new application patch before vs after, or perhaps a new setting in the app configuration before vs after...) Just splashing proflier traces of a single instance in time from a single app is functionally pointless.

Shifty Geezer · Aug 19, 2024

Albuquerque said:
The plots are interesting; they're even more interesting to compare and contrast. They're especially interesting when running A-B comparisons when an individual element of the ecosystem is adjusted (new driver version before vs after, or new application patch before vs after, or perhaps a new setting in the app configuration before vs after...) Just splashing proflier traces of a single instance in time from a single app is functionally pointless.

I think this. If we're going to do profile comparisons, we need to ensure they are A/B tests. It'd be really interesting to see the difference on, say, a new UE version to see how it's changed, and then profile the same game using different features, or two hardwares running the same benchmark, without being judgemental on the outcomes or claim a deficit or failure. Comparisons would be "this game is not getting the same utilisation as that game on this hardware" and "this GPU gets more use at that point than this other GPU" without blaming anyone and taking to the internet to say someone somewhere should be doing better. Those conclusiosn need to be left to those who really know what they are talking about.

We can't stop other corners of the internet taking such findings and making up their own s*** about it, but that's true of all information. We can just ensure we're above it and provide a place for those genuinely curious to explore the inner workings more deeply than we have before.

trinibwoy · Aug 19, 2024

cwjs said:
Your conclusions are going to be wrong. Without signal on what the engine code is doing, how the hardware works, how the shader compiler works, or how the content is set up, plus a lot of intuition, most captures of complex content are basically noise.

No different to the current state of the industry where “journalists” draw all sorts of conclusions based only on frame rates every day. How many times have we heard people declare that XYZ engine or game is better optimized for a given architecture based on just average fps?

cwjs said:
Regardless, it’s fun to speculate, but you’re going to have professionals throwing cold water on your conclusions and they’re going to (I can say from experience, frustratingly) shrug their shoulders when you ask them to elaborate unless the capture shows a very classic case or it is their area of their engine (and it’s appropriate for them to share publicly)

Great, that means people are actually talking. Experts pouring cold water on layman speculation is par for the course. I find this whole topic bizarre to be honest. I can’t recall any time in the past where people here were so nervous about sharing data or cared what the rest of the internet thinks.

Shifty Geezer · Aug 19, 2024

trinibwoy said:
Great, that means people are actually talking. Experts pouring cold water on layman speculation is par for the course. I find this whole topic bizarre to be honest. I can’t recall any time in the past where people here were so nervous about sharing data or cared what the rest of the internet thinks.

The forums had never been terminated from fanboy warring before... Many of the new mods are industry folk invited to keep B3D alive, and they have reservations based on, quite frankly, decades of wrestling with noise on these boards. So many opinions presented with only a view to attack those who don't agree, and so often data has been weaponised.

Those who survived the Console Wars of the noughties are loathe to ever end up back there. Lest We Forget. Some were left thinking total Profile Disarmament is the only way to be safe.

Andrew Lauritzen · Aug 19, 2024

trinibwoy said:
It’s pretty unlikely that we can fix human behavior on the internet by censoring ourselves at B3D.

Indeed, but it's about maintaining an atmosphere that is conducive to having friendly discussions that are less likely to get blown up, taken out of context and so on. For instance, digging deep into rendering algorithms and performance across a bunch of games and tech demos and comparing and tweaking how things run and so on is great, and unlikely to cause problems. Posting captures and speculation specifically about the hottest new benchmark to drop when the media and consumers are looking for headlines (and sometimes blood) is a very different risk profile, I hope for obvious reasons. If it were purely about understanding hardware and algorithms there are *tons* of games and workloads we could already run and profile to that end; I think people are perhaps fooling themselves if they don't admit that part of wanting to jump on the latest talked-about thing is because they want to engage with the broader consumer audience and get attention via the conclusions drawn.

Now of course the internet will be the internet with or without B3D's help. But for the post part engineers can't/don't engage on sites that are too tied to media headlines, for good reasons. I enjoy engaging here at B3D and selfishly want to keep it as a place that I can continue to participate in. Selfishness aside, I believe this sentiment is shared by the owners and moderators here because it's largely the main thing that makes B3D unique. There are many places where you can go to just chat with other consumers about tech and benchmarks and so on.

Andrew Lauritzen · Aug 19, 2024

davis.anthony said:
One thing that immediately struck me, if GPU's aren't using all of their flops at any given time, wouldn't it not be better for AMD/Nvidia to look at making the required changes (the cache and memory it looks like) to increase GPU efficiency rather then to keep doing what they are? - (Maybe not worded great)

These things are constantly being adjusted, every single generation. Different SKUs within the stack even adjust ratios to better suit the workloads that you tend to run on lower end vs. higher end GPUs, etc.

But this sort of "rate thinking" misses a key factor - it's not that there's "memory work" and "math work" and "raytracing work" and "rasterization work" that you can just send off to asynchronous units and then balance how many of each you have. It's the big mess of dependencies between all of these things (from coarse grained barrier stuff all the way down to fine grained instruction stuff) and getting data to the right places at the right times that is what makes this hard. Different passes and parts of the frame will have very different bottlenecks.

For example, while a GPU's ALUs may only be 20% active over the total frame, it is very common that the majority of that use falls into a couple really FLOPS-heavy passes. Thus if you lower the amount of ALUs, you still reduce throughput (i.e. performance), even if you theoretically have 5x as many as you need over the full frame. "So why not just figure out a way to run all those things in parallel so that you spread the ALU work out evenly"? Well, maybe the entire rest of the frame depends on getting that 20% heavy ALU work done as quickly as possible up front? This is not hypothetical - this is the very nature of GPU programming. We have various tools like pipelining and interleaving and async compute and ubershaders and so on to try and spread the work more evenly and keep more units active, but all of them still have to obey the underlying dependencies. The only way to break those dependencies in many cases is to overlap larger sections of work (even multiple frames) with different sections from other frames but this is generally undesirable at the coarse scale because it both increases memory pressure (now you need two or more copies of all the transient stuff), but also latency.

Anyways this is just scratching the surface, but hopefully this gives some intuitive about why it is both expected and reasonable for units to appear under-utilized.

Andrew Lauritzen · Aug 19, 2024

cwjs said:
Regardless, it’s fun to speculate, but you’re going to have professionals throwing cold water on your conclusions and they’re going to (I can say from experience, frustratingly) shrug their shoulders when you ask them to elaborate unless the capture shows a very classic case or it is their area of their engine (and it’s appropriate for them to share publicly)

Right and let me be clear about why we often do this... it's not because we can't make our own guesses about what's going on or whatever. It's because we frequently see folks (even occasionally other graphics engineers) posting speculation about stuff that we are very familiar with and it's often completely wrong. I could easily - but for similar reasons won't - give plenty of examples of even top of the field graphics engineers posting stuff that is completely untrue on twitter. No one is immune to this, but generally the more experienced you are the more you understand how many degrees of freedom there really are.

We all have our moments, but I want to avoid being "that guy" as much as possible. Even if the broader internet doesn't ever get the full story, I don't want my colleagues and other graphics engineers shaking their head at random uninformed speculation that I post as if its fact. And to be clear, that's what it would be if it was based purely on a hardware trace with no markers or source.

trinibwoy · Aug 19, 2024

Shifty Geezer said:
The forums had never been terminated from fanboy warring before... Many of the new mods are industry folk invited to keep B3D alive, and they have reservations based on, quite frankly, decades of wrestling with noise on these boards. So many opinions presented with only a view to attack those who don't agree, and so often data has been weaponised.

Those who survived the Console Wars of the noughties are loathe to ever end up back there. Lest We Forget. Some were left thinking total Profile Disarmament is the only way to be safe.

I’m not sure this is the right response to those fears though. At least in this specific case with the Wukong traces that initiated this debate. We had multiple traces from a 3090 and a 4080. They showed much higher RT unit utilization on the 4080 in the same workload. Hopefully this is a non controversial fact. We can speculate that this is due to the larger L2 on Ada. At no point did anyone claim UE5 sucks or Wukong devs suck based on those traces which seems to be the main concern here. It’s a pre-emptive defense of an attack that may never have materialized.

Andrew Lauritzen said:
Indeed, but it's about maintaining an atmosphere that is conducive to having friendly discussions that are less likely to get blown up, taken out of context and so on. For instance, digging deep into rendering algorithms and performance across a bunch of games and tech demos and comparing and tweaking how things run and so on is great, and unlikely to cause problems. Posting captures and speculation specifically about the hottest new benchmark to drop when the media and consumers are looking for headlines (and sometimes blood) is a very different risk profile, I hope for obvious reasons.

Is it reasonable to expect B3D members to spend dozens of hours profiling a bunch of games in order to post here? Clearly no one is going to do that and no one has done it in the past. Not even the content creators who get paid for their time are going to do that. We need to be realistic.

Andrew Lauritzen said:
If it were purely about understanding hardware and algorithms there are *tons* of games and workloads we could already run and profile to that end; I think people are perhaps fooling themselves if they don't admit that part of wanting to jump on the latest talked-about thing is because they want to engage with the broader consumer audience and get attention via the conclusions drawn.

Hopefully that’s not surprising. The B3D community isn’t immune to the appeal of new games and content and discussing hot new games isn’t somehow nefarious.

cwjs · Aug 19, 2024

trinibwoy said:
Is it reasonable to expect B3D members to spend dozens of hours profiling a bunch of games in order to post here?.

Post here no, but “not get shot down immediately making guesses about a random nsight screenshot” Well, yeah, honestly, it might be? Some things are just require lots of time to be able to have a conversation about — same way you might be expected to read a dense difficult book before showing up at a book club meeting. I can’t speak to whether anyone is trying to *stop* you but your guesses are likely to be dismissed out of hand.

Re: your points here and earlier about journalists — sure, lots of people make unqualified guesses, if they post here I’ll say the same thing!

Albuquerque · Aug 19, 2024

I think @cwjs summarized it perfectly. Lots of dumb in the world which we're not going to fix. Anyone can post all kinds of things on B3D, however this doesn't mean their conjectures are assumed accurate and nor should they be considered "protected". And with enough evidence to the contrary of their diatribe, someone who continually flaunts bad posting decisions may eventually find their post content culled or their posting abilities limited.

I very much doubt it would come to such a pass with the current participants of this thread; I'm simply laying out what should be a rational expectation for enforcing community standards here at B3D. We can't be held responsible for content outside of our forums and control; we certainly can and will be responsible for the content within.

trinibwoy · Aug 20, 2024

Where is this expectation coming from that people should be posting dissertations on GPU architecture else they be “shot down immediately”? This forum thrives on poorly informed speculation otherwise it would be a graveyard. Unless I’ve missed the reams of highly accurate posts with insider info that you guys are holding up as the gold standard for participating here.

Maybe that’s the dream but it’s certainly not the reality. I’ve gotta assume there’s some context here that I don’t have otherwise this convo makes zero sense.

Albuquerque said:
Anyone can post all kinds of things on B3D, however this doesn't mean their conjectures are assumed accurate and nor should they be considered "protected".

That’s what I thought too.

trinibwoy · Aug 20, 2024

cwjs said:
Post here no, but “not get shot down immediately making guesses about a random nsight screenshot” Well, yeah, honestly, it might be?

Since we’re just going in circles now can you be more specific as to which guesses you found particularly offensive and why?

Understanding game performance through profiler trace graphs *spawn

davis.anthony

Albuquerque

Red-headed step child

DavidGraham

raytracingfan

Albuquerque

Red-headed step child

Shifty Geezer

uber-Troll!

Albuquerque

Red-headed step child

cwjs

Albuquerque

Red-headed step child

Shifty Geezer

uber-Troll!

trinibwoy

Meh

Shifty Geezer

uber-Troll!

Andrew Lauritzen

Moderator

Andrew Lauritzen

Moderator

Andrew Lauritzen

Moderator

trinibwoy

Meh

cwjs

Albuquerque

Red-headed step child

trinibwoy

Meh

trinibwoy

Meh

Similar threads