Understanding game performance through profiler trace graphs *spawn

Someone has to bring issues to the forefront and I hardly think the people responsible are inclined to do so. It's great individuals outside game development can provide analytic traces and open discussion on potential causes and issues without waiting for developers to "kick the ball". If a game does have an issue hopefully developers are working to resolve it and not paying attention to "novice" analysis trace discussion in the general public.

Can you imagine if the shader compilation issue was left up to developers "to get the ball rolling" without any outside general public and media interest?
 
Last edited:
It's uninteresting to say that an opaque workload is using the GPU "efficiently" in the same way as busy waiting a CPU at 100% is or is not an "efficient" use
I remember very well an incident that demonstrated how such an analysis can lead to incorrect conclusions, the Chips and Cheese article that analyzed Starfield launch performance, they profiled the game using the standars tools and came to the conclusion that AMD's overperformance (at launch) in this title was mostly due to better occupancy vs worse occupancy on NVIDIA GPUs.

Fast forward 2 months later and the game revieved multiple patches that boosted NVIDIA's performance by 40% to 50%, restoring the natural order of performance among GPUs, AMD GPUs were no longer overperforming vs comparable NVIDIA GPUs.

Worse yet the site never revisited the patched game to "learn" from the changes, they presented their inaccurate conclusions and forgot about it forever.

 
Someone has to bring issues to the forefront and I hardly think the people responsible are inclined to do so. It's great individuals outside game development can provide analytic traces and open discussion on potential causes and issues without waiting for developers to "kick the ball". If a game does have an issue hopefully developers are working to resolve it and not paying attention to "novice" analysis trace discussion in the general public.

Can you imagine if the shader compilation issue was left up to developers "to get the ball rolling" without any outside general public and media interest?

That’s why I love what chipsandcheese is doing even if they don’t get it 100% right. Literally nobody else is trying to pull back the curtain on GPU performance. At least not in public. Everyone else is just throwing up the same boring frame rate graphs.
 
That’s why I love what chipsandcheese is doing even if they don’t get it 100% right. Literally nobody else is trying to pull back the curtain on GPU performance. At least not in public. Everyone else is just throwing up the same boring frame rate graphs.
It's certainly difficult to find info you can learn from.
 
Fast forward 2 months later and the game revieved multiple patches that boosted NVIDIA's performance by 40% to 50%, restoring the natural order of performance among GPUs, AMD GPUs were no longer overperforming vs comparable NVIDIA GPUs.

Worse yet the site never revisited the patched game to "learn" from the changes, they presented their inaccurate conclusions and forgot about it forever.

Was it really inaccurate? It sounds to me like occupancy was probably the key difference, and NVIDIA just managed to improve their compiler to either improve occupancy (by reducing register pressure etc.) or improve instruction/memory level parallelism.

NVIDIA does have a fewer registers per FMA so they are naturally more dependent on the compiler doing a good job for this kind of workload.

It’d great if they had the time to revisit it and analyse why it changed, whether it is what I’m suggesting or whether their initial analysis was just wrong, but given all the other great articles they’ve published in the meantime I can’t really blame them too much for not prioritising this.
 
That’s fair though I am a bit puzzled why we’re so skittish about interpreting data in an industry where nearly all conclusions are based on a few high level metrics (frame rate and frame times if we’re lucky).

What would you guys suggest as an appropriate methodology for evaluating game performance and efficiency?
I for one think looking and and trying to interpret the data is good, I just want to caution that it is harder than it looks. IMO profiling cpu is fairly "easy" (although for an engine like ue the tradeoffs that lead to cpu perf issues might be complex) but gpu profiling, especially comparatively between different hardware, demands quite a lot of detective work and knowledge of hardware and connecting the dots. I suspect Andrew is the only one in this thread qualified enough to make good guesses, but I even for an expert and even for an engine you *don't* have a professional responsibility for guessing publicly can be an iffy proposition.


I'll offer some unqualified suggestions though:
I don't know anything about RT hardware perf, probably less than you do, so I can't even really offer a confident direction, but I find the nsight cpature you screenshotted a little confusing at first glance. Like you said, utilization is fairly low, and it's low for quite a long portion of the frame, but memory throughput is quite low too -- at first glance I wouldn't think we're obivously gpu memory bound there. The extent of pcie throughput is also kinda scary, there's a (relatively) high amount there compared to the rest of the frame (except for the very start) -- are we waiting on system ram? Are we spending time reading or writing here? Is something terribly wrong, rather than just a heavy gpu workload? All not clear to me. Maybe this is just how raytracing workloads look, or maybe the specific subject we're tracing is very sparse and a large % of hit tests have to wait for a moment on data, or some raytracing specific caches have been invalidated and this is the slow path.

Speculating is fun, but I the odds of drawing any correct conclusions from a random isolated frame are fairly low. The easiest way to start to build understanding imo is profiling something running fine, at a good 60fps, and then starting to account for why certain parts of the scene are different -- maybe you run at 60fps but have slight rendering stutters to ~55fps at one part of the scene, try profiling that transition and see if anything stands out. Once you start to have a foothold on what's going on ("these vfx shaders introduce a new bottleneck") then you can start tweaking settings or comparing to cases with similar but different results. (This is much easier to do with your own content where you can straight up add or remove things from the scene, and further easier to do with your own engine/code.)
 
Last edited:
When we say memory is a big bottleneck, is it primarily bandwidth or latency? If the latter, how do we fix that? Is my understanding correct that latency increases with cache size?
 
I for one think looking and and trying to interpret the data is good, I just want to caution that it is harder than it looks. IMO profiling cpu is fairly "easy" (although for an engine like ue the tradeoffs that lead to cpu perf issues might be complex) but gpu profiling, especially comparatively between different hardware, demands quite a lot of detective work and knowledge of hardware and connecting the dots. I suspect Andrew is the only one in this thread qualified enough to make good guesses, but I even for an expert and even for an engine you *don't* have a professional responsibility for guessing publicly can be an iffy proposition.

I get that but how are guesses based on profiling any more dangerous than guesses based on frame rates? I just find it strange that we’re so sensitive about drawing imperfect conclusions when that’s what the entire gaming review scene is based on.

I find the nsight cpature you screenshotted a little confusing at first glance. Like you said, utilization is fairly low, and it's low for quite a long portion of the farame, but memory throughput is quite low too -- at first glance I wouldn't think we're obivously gpu memory bound there. The extent of pcie throughput is also kinda scary, there's a (relatively) high amount there compared to the rest of the frame (except for the very start) -- are we bottlenecked on system ram? Are we spending time reading or writing here? Is something terribly wrong, rather than just a heavy gpu workload? All not clear to me. Maybe this is just how raytracing workloads look, or maybe the specific subject we're tracing is very sparse, or some raytracing specific caches have been invalidated and this is the slow path.

Yeah it’s hard to draw any definitive conclusions as to the root cause of the low utilization. However we can take away the fact that utilization is low. There are also other RT passes in the same frame with much higher utilization accompanied by much higher L2 hit rates. So clearly workloads with better cache locality or more coherent accesses are good for utilization. Not a surprising observation.


Speculating is fun, but I the odds of drawing any correct conclusions from a random isolated frame are fairly low.

True but as I mentioned earlier I’m not really getting why we need this high bar of perfect conclusions. The data itself is interesting and we need more of it not less. This fear of arriving at imperfect answers is baffling to me given we’re living in the dark anyway.
 
When we say memory is a big bottleneck, is it primarily bandwidth or latency? If the latter, how do we fix that? Is my understanding correct that latency increases with cache size?

Yes!

Ok the standard dumb joke aside, you can get throttled by both. There are compute units waiting for a response either because they have to go out to main memory for a piece of data, or they're churning through data faster than can be provided.

Currently raytracing, for example, can/is often latency because only a small part of a BVH might be needed by any given unit, but which RT/CU needs what can feel random so there's a lot of waiting while that data is found. Meanwhile if you're doing something with the g-buffer you can end up bandwidth bound, you know exactly what data each CU needs and it's a heck of a lot.

The different design of current L2 caches on Nvidia/AMD offer a good contrast. Nvidia in 4XXX has gone with a giant L2 cache that can keep a decent amount of a BVH pretty close to the RT/CUs, so odds are decent that asking for a given part of a BVH will be quick. Meanwhile AMD with RDNA3 has an absolutely enormous bandwidth L2 (almost double the bw as Nvidias) but it's also a much much smaller one than Nvidias, meaning if the data isn't in there you're going to wait longer for it.

If there's a wide array of potential data you need bits and pieces of quickly then Nvidia's design clearly wins. If you need a smaller array of data but want as much of it as possible then AMD's design wins.

The cost, however, of Nvidia's design is it's much bigger. In cache designs the larger the cache the higher latency it has and the more the cost goes up. This is because caches store a smaller range of data than the full range of memory the GPU has access to, to compress this memory range and keep track of what is where at the same time takes book keeping and that book keeping costs memory and time itself. So it's a tradeoff of size versus cost in terms of both latency and $$$. The larger the cache, the more book keeping, but the more likely the data you want is in that cache.
 
Last edited:
True but as I mentioned earlier I’m not really getting why we need this high bar of perfect conclusions. The data itself is interesting and we need more of it not less. This fear of arriving at imperfect answers is baffling to me given we’re living in the dark anyway.
Well, with framerate we have very clear signal on something important: We aren't hitting the target framebudget. With gpu utilization graphs in isolation it's not even obvious that anything is wrong -- there simply might be a lot of sparse work required to get the desired result. Or there might not be anything "more efficient" ready that the gpu could be doing. Etc. My mindset with data about something fairly opaque is that data which offers clear conclusions are best, followed by no conclusion, followed by wrong conclusions.
 
Well, with framerate we have very clear signal on something important: We aren't hitting the target framebudget.

Yep we’ve known those fps numbers for decades. It’s not enough to satisfy our curiosity IMO or provide any insight into our how our hardware works.

With gpu utilization graphs in isolation it's not even obvious that anything is wrong -- there simply might be a lot of sparse work required to get the desired result. Or there might not be anything "more efficient" ready that the gpu could be doing. Etc. My mindset with data about something fairly opaque is that data which offers clear conclusions are best, followed by no conclusion, followed by wrong conclusions.

That’s funny. I think we have the exact opposite priorities. More data even with imperfect understanding is far better than sitting in the dark. I don’t know that we need to protect engine developers from inaccurate conclusions about their work. The more transparency and discussion the better. That will raise the level of understanding across the board and ultimately enable people to better appreciate the complexity of the work and expertise required to ship a working game.
 
This is a common (mis)conception about having "scientific" conversations in public. Namely, the idea that "more information" somehow makes the conversation more legitemate, or the outcome more understandable, or the discourse more civil.

This is the same misconception that has led to vaccine denial, flat earth, the shittiness of the current US political system, and a number of other "truther" movements where the root problems were never actually lack of information but lack of interest in having a fair and literally, actually balanced or even support for nuanced conversation. Do vaccines end up with some of the same side effects as the thing they're trying to prevent? Yes, but the misundersatnding there is... DID YOU JUST HEAR THAT BULLSHIT? VACCINES CAUSE THE SAME PROBLEMS AND DONT HELP AT ALL AND THIS SCIENTIST LITERALLY JUST AGREED. No, that's not it at all, vaccine side effects are usually incredibly mild HE JUST SAID THERE ARE SIDE EFFECTS DO YOU WANT THIS DANGEROUS SHIT IN YOUR VEINS YOU SHEEEPLE??!!?!?

No, more information is not the solution to an uneducated mass who only wants to place blame and hoist pitchforks. A knowledge gap isn't the problem, it's a problem with people who want a short, simple answer and refuse to believe there isn't one, even in the face of incredible complexity.

After having seen what "more information" does to society concerning the amazing and frankly infurating tidalwave of post-truth psuedo-science bullshit, I refuse to believe "more information" is the solve at all. Either people need to learn to actually think critically, or else they need to be ignored with their full-farce false-intellectualism on their favorite social media platform.

Those who actually want to understand the nuance and the depth of the situation will be smart enough to dig. Blasting partial truths and "this is hard to understand, please wait a minute while we help you distill" retults in "HOLY SHIT DID YOU SEE HOW FUCKED THIS PLATFORM IS WHAT GARBAGE I'M GOING TO REPOST THIS A THOUSAND TIMES BECAUSE OF HOW BROKEN AND UNACCEPTABLE THIS THING IS THAT I CAN'T UNDERSTAND AND INSTEAD DEMAND I GET THE CLICKS I DESERVE TO OBTAIN MY INFLUENCER STATUS."

No. I'm really not interested in someone trying to tell me that actually hard-to-understand information (and yes, this is HARD to understand) will somehow solve any problems for the purposefully and willfully ignorant. And in this world of social media bullshit, anyone with a smattering of intelligence understands precisely how what-could-be-useful information gets distorted and twisted to serve some influencer's grinding wheel.
 
Was it really inaccurate? It sounds to me like occupancy was probably the key difference, and NVIDIA just managed to improve their compiler to either improve occupancy (by reducing register pressure etc.) or improve instruction/memory level parallelism.

NVIDIA does have a fewer registers per FMA so they are naturally more dependent on the compiler doing a good job for this kind of workload.

It’d great if they had the time to revisit it and analyse why it changed, whether it is what I’m suggesting or whether their initial analysis was just wrong, but given all the other great articles they’ve published in the meantime I can’t really blame them too much for not prioritising this.

In terms of the context of the situation when that article was written and how it was it's more so the type of conclusion they drew from the date. Remember also that with Starfield there was a lot of controversey in terms of vendor optimization due cross marketing partnerships.

Specifically here -

However, there’s really nothing wrong with Nvidia’s performance in this game, as some comments around the internet might suggest.

You have to remember at the time the debate was in part whether or not the relative performance was more so a result of the relative optimization focus on each vendor or the inherent strengths/weaknesses of each vendors hardware and software stack. With how the article chose to phrase the conclusion it was latched onto in discussion circles that Starfield's performance was just inherently played to one vendors strengths as opposed to optimization priority.

Regardless of the motivation here, sometimes it can just be the phrasing of one line that greatly effects how something is presented and what people draw from it.
 
I get that but how are guesses based on profiling any more dangerous than guesses based on frame rates? I just find it strange that we’re so sensitive about drawing imperfect conclusions when that’s what the entire gaming review scene is based on.
Because people will believe that with more numbers that the interpretation is more accurate whereas with more data, there are more possible reasons that'll just go ignored and people will end up misinformed. Urban myths are remarkably pervasive!

With framerate, you measure the outcome. You can swap CPU and GPU and see what affects that and draw direct, meaningful conclusions that this CPU performs worse or that GPU performs better. There's no particular understanding why and no real means to determine why, but you are comparing like for like on the same software to derive meaningful conclusions.

When you get into profiles, you are moving into a completely different question from "What is the fastest hardware?" to "Why is the fastest hardware?" and that why is incredibly complicated. Why's always are. There is always so much more nuance and so many more variables, pretty much every time someone comes up with a data based explanation, they are missing half (more like 95%) of the data. Five years later with more data thinking changes. And five years after that, it changes again.

So when measuring the framerate and IQ, you are measuring the outcome and have an exact thing to compare it against - higher framerate is better. When measuring the GPU, the GPU isn't really meant to be 100% full and the goal isn't really to fill it 100%. The goal is to get pretties on screen working around many complications and caveats. Less at this point might result in more at that point, whereas a naive reading might think "this profile is down there where it should be up." You'd also need to profile the entire system with CPU, to see what that is doing, and RAM, and again, the system won't be using all of those resrouces all of the time. Man, can you imagine the power draw if they did?!?!

I don’t know that we need to protect engine developers from inaccurate conclusions about their work.
Why not? Social media is quick to jump on any information and 'act' on it without waiting for validation. One or two badly interpreted gaming profiles could lead to a tidal-wave of echo-chamber idea reinforcement, tanking sales, online abuse and maybe even death threats, so bonkers is the internet society. It's not just a case of informed folk being wrong and getting around to correcting themselves after a round of polite discussion.

Andrew is also in a tricky spot here because he wants very much the types of dicussions wanted, but he's also an outward voice at Epic and if shit starts hitting fans, he'll have to pull out of conversations, as sadly so many of our insiders have. So between the ideal of free information and the reality of accepting some places are more trouble than they are worth, we need to find somethign that works for everyone. Or we don't care about some views and those people can like it or lump it. ;)
The more transparency and discussion the better. That will raise the level of understanding across the board and ultimately enable people to better appreciate the complexity of the work and expertise required to ship a working game.
Only if the people involved are wanting that level of discussion and learning. Very often, as Andrew Lauritzen alludes, it's instead weaponised information in support of a brand or belief, and these views very rarely get changed from civil reasoning. It's beyond exceptional for someone to have an interpretation and express their viewpoint to then turn it around via discussion and accept they are wrong when the data points that way instead of doubling down on their views and shifting the arguments into ad hominems on bias, fanboyism, etc. We're lucky when we get a discussion defuse down to just 'agree to disagree'!!

We would LOVE B3D to be able to have engneering level discussions on game tech. That's our reason d'etre! It's worth noting a lot of the Old Guard here came in rookies with wonky, prejudiced ideas and did learn and did become balanced in their understanding, some even going on to jobs in the industry! It would be great if we could bring in deeper discussion on game rendering and the experts felt comfortable sharing about what's going on and why. Maybe one day we'll get there? Maybe we could try and work on a trajectory to facilitate that?

The concern among those in the industry is that's not happening and so it won't happen in future. Instead, colourful block diagrams fuelling jumped-to conclusions will poison the general understanding, rather than be the basis of a healthy education, and just be another weapon in the fanboy wars.

Note no-one has said 'don't' and all that's being asked is caution. Profiles are fine. I guess trying to interpret them, and sharing those interpretations as fact, less so. If one knows one's limits, and talks with curisoty and an openness to learn rather than to share a belief and try to convince everyone else of it, there won't be a problem. Maybe.
 
Note no-one has said 'don't' and all that's being asked is caution. Profiles are fine. I guess trying to interpret them, and sharing those interpretations as fact, less so. If one knows one's limits, and talks with curisoty and an openness to learn rather than to share a belief and try to convince everyone else of it, there won't be a problem. Maybe.

Makes sense to me.

I don’t share the same concerns about weaponizing B3D discussions or death threats but I’m sure you guys have your reasons. It’s pretty unlikely that we can fix human behavior on the internet by censoring ourselves at B3D.
 
We can not feed the misinformation troll-farms by censuring ourselves at B3D.

That is the difference.
 
If that discussion doesn't happen here, it'll happen somewhere else. I'd rather we establish a workable methodology for such discussions as a counter to the wider pointless noise of the Internet so people finding talk here find reality.
 
Back
Top