Understanding game performance through profiler trace graphs *spawn

Andrew Lauritzen · Aug 15, 2024

Before we got *too* into the weeds - it's great to see folks getting some actual data but for people not familiar with low level GPU optimization, please try and avoid your initial reactions to things. It might seem crazy to see something like "this pass only uses 20% of the theoretical ALU throughput" but that's actually completely normal, and always has been for GPUs. Each pass will tend to hit different parts of the GPU harder, and there will always be a bottleneck somewhere.

Moreover bottlenecks are not as simple as "memory" vs "ALU". Even in cases where things are stuck waiting on memory, it's usually not as simple as adding more memory bandwidth. As GPU rendering becomes more complex there are more and more cases where performance is a complex function of cache hierarchies, latency hiding mechanisms, register file sizes and banking and so on. There's a reason why huge parts of these chips are devoted to register files and increasingly caches rather than just laying down more raw compute to pump those marketing numbers

So yes, it's all well and good to look at some profiles but unless you are experienced in looking at these things frequently please avoid generalizing what you're seeing to any statements about what constitutes "normal" or "efficient" use of a GPU for a given task.

Frenetic Pony said:
Regarding that: How much does rendering VSMs consume, and how much does tracing the shadowmap consume? A cheaper soft shadow map solution might like the current CoD one might save a ms or two on lower end hw. But for rendering VSMs themselves all I can think of is that "constant time vsm" trick, where VSM gets a constant time budget and only renders the highest mips it can within that budget. Sure the next frame could see a pop to higher res in some tiles, but it seemed to work well enough from what I saw of it.

You're going to hate this but... "it depends" a lot. There are a lot of factors that affect VSM performance from how much cache invalidation there is to how much non-nanite there is and so on. There's quite a lot of console variables and tools to adjust how they function for a given game, some amount of which does help you target specific budgets. The "constant time" thing is not really feasible to do in a strict sense since 1) it's not possible to perfectly predict how long something will take to render beforehand, especially for non-nanite geometry and 2) the pops you note can be pretty significant if it is not handled carefully. There are hysteresis tools though that can adjust resolution smoothly as you approach the page pool allocation size and similar. More importantly though, as VSMs try and match their resolution to the sampling resolution of the screen, they actually scale better with the primary dynamic res adjustments, unlike conventional shadow maps. Again, non-Nanite geometry is a bit of a wildcard as it's impossible to control the cost of it on the backend, but that's yet more reason to make everything Nanite.

Shifty Geezer · Aug 15, 2024

Andrew Lauritzen said:
Before we got *too* into the weeds - it's great to see folks getting some actual data but for people not familiar with low level GPU optimization, please try and avoid your initial reactions to things. It might seem crazy to see something like "this pass only uses 20% of the theoretical ALU throughput" but that's actually completely normal, and always has been for GPUs.

Big highlight here. When you are reading such graphs, what are you comparing it to to know if it's abnormal or not? If you aren't comparing against lots of other games, you are only comparing against expectations which might (almost certainly!) won't match what actually happens on hardware.

To properly evaluate this or any other title, you need a whole bunch of performance graphs from different games on the same rig. Best case scenario, you get enough similar titles with fairly isolated variables to be able to see how one game compares to baseline for similar.

Unless you set out to make this your job and create a website full of profiles, you probably won't get enough to make reasonable comparisons. It's the sort of thing only experienced engine or hardware developers profiling lots of games will know. If you know any such folk, it's worth listening to what they have to say on the subject.

Edit: One alternative more meaningful use of such profiles we can use it running the same game/benchmark on different hardware to see how it behaves differently.

trinibwoy · Aug 15, 2024

Andrew Lauritzen said:
Before we got *too* into the weeds - it's great to see folks getting some actual data but for people not familiar with low level GPU optimization, please try and avoid your initial reactions to things. It might seem crazy to see something like "this pass only uses 20% of the theoretical ALU throughput" but that's actually completely normal, and always has been for GPUs. Each pass will tend to hit different parts of the GPU harder, and there will always be a bottleneck somewhere.

Moreover bottlenecks are not as simple as "memory" vs "ALU". Even in cases where things are stuck waiting on memory, it's usually not as simple as adding more memory bandwidth. As GPU rendering becomes more complex there are more and more cases where performance is a complex function of cache hierarchies, latency hiding mechanisms, register file sizes and banking and so on. There's a reason why huge parts of these chips are devoted to register files and increasingly caches rather than just laying down more raw compute to pump those marketing numbers

So yes, it's all well and good to look at some profiles but unless you are experienced in looking at these things frequently please avoid generalizing what you're seeing to any statements about what constitutes "normal" or "efficient" use of a GPU for a given task.

How would you characterize those traces? Nvidia has some high level docs on their methodology for evaluating trace results. It basically boils down to this - if every major pipeline in the GPU is simultaneously underutilized then the hardware is not being used efficiently. The profiler also spits out clear hotspots that are holding things up - long scoreboard waits on mem loads etc. You can argue that any piece of software is doing the best it can but it doesn’t change those basic observations from the hardware’s point of view.

trinibwoy · Aug 15, 2024

Shifty Geezer said:
Big highlight here. When you are reading such graphs, what are you comparing it to to know if it's abnormal or not? If you aren't comparing against lots of other games, you are only comparing against expectations which might (almost certainly!) won't match what actually happens on hardware.

I’ve profiled quite a few games (not hundreds) and definitely don’t do it for a living but one thing is clear. Normal flops utilization across games and engines is generally “low”. Shipping games don’t usually have performance markers enabled so the profiling isn’t as detailed as it could be but it gives a pretty good high level picture.

I don’t think it matters whether it’s normal or abnormal or meets some arbitrary expectation. It’s just the facts. Would love to see any data that points to a different conclusion.

troyan · Aug 15, 2024

yngc said:
Where in the benchmark is that trace taken? Here much better on a 4080 super, at the beginning of the benchmark.

Can you provide a comparision between Pathtracing and Lumen in 720p? Performance is equal between both so there has to be a bottleneck with Lumen on Lovelace. Thx.

trinibwoy · Aug 15, 2024

troyan said:
Can you provide a comparision between Pathtracing and Lumen in 720p? Performance is equal between both so there has to be a bottleneck with Lumen on Lovelace. Thx.

The medium setting for “full raytracing” enables RT shadows and reflections. I’m not sure what the very high setting adds on top of that. The cost for medium RT on my 3090 is reasonable (25%) compared to other heavy RT titles. I interpret that to mean the non-RT shadow and reflections system is also pretty heavy.

Lumen doesn’t handle shadows though so not sure you can get a direct comparison.

Shifty Geezer · Aug 15, 2024

trinibwoy said:
I don’t think it matters whether it’s normal or abnormal or meets some arbitrary expectation. It’s just the facts. Would love to see any data that points to a different conclusion.

Yeah, I like that. I like seeing the profiles, just so long as people aren't trying to read too much into them in isolation. Some of the best profile images are from dev talks where they show their optimisation and before/after graphs!

yngc · Aug 15, 2024

troyan said:
Can you provide a comparision between Pathtracing and Lumen in 720p? Performance is equal between both so there has to be a bottleneck with Lumen on Lovelace. Thx.

I guess it is just that lighting takes less percentage of frame time at lower resolution, so the difference between ray tracing and lumen is less important to overall frame time. Lumen a little faster at 1440p/dlss perf.

RayTracing:

Lumen:

techuse · Aug 15, 2024

Andrew Lauritzen said:
Before we got *too* into the weeds - it's great to see folks getting some actual data but for people not familiar with low level GPU optimization, please try and avoid your initial reactions to things. It might seem crazy to see something like "this pass only uses 20% of the theoretical ALU throughput" but that's actually completely normal, and always has been for GPUs. Each pass will tend to hit different parts of the GPU harder, and there will always be a bottleneck somewhere.

Moreover bottlenecks are not as simple as "memory" vs "ALU". Even in cases where things are stuck waiting on memory, it's usually not as simple as adding more memory bandwidth. As GPU rendering becomes more complex there are more and more cases where performance is a complex function of cache hierarchies, latency hiding mechanisms, register file sizes and banking and so on. There's a reason why huge parts of these chips are devoted to register files and increasingly caches rather than just laying down more raw compute to pump those marketing numbers

So yes, it's all well and good to look at some profiles but unless you are experienced in looking at these things frequently please avoid generalizing what you're seeing to any statements about what constitutes "normal" or "efficient" use of a GPU for a given task.

You're going to hate this but... "it depends" a lot. There are a lot of factors that affect VSM performance from how much cache invalidation there is to how much non-nanite there is and so on. There's quite a lot of console variables and tools to adjust how they function for a given game, some amount of which does help you target specific budgets. The "constant time" thing is not really feasible to do in a strict sense since 1) it's not possible to perfectly predict how long something will take to render beforehand, especially for non-nanite geometry and 2) the pops you note can be pretty significant if it is not handled carefully. There are hysteresis tools though that can adjust resolution smoothly as you approach the page pool allocation size and similar. More importantly though, as VSMs try and match their resolution to the sampling resolution of the screen, they actually scale better with the primary dynamic res adjustments, unlike conventional shadow maps. Again, non-Nanite geometry is a bit of a wildcard as it's impossible to control the cost of it on the backend, but that's yet more reason to make everything Nanite.

What is the most common bottleneck(s) for GPUs? Does it differ much between AMD and Nvidia?

Frenetic Pony · Aug 15, 2024

techuse said:
What is the most common bottleneck(s) for GPUs? Does it differ much between AMD and Nvidia?

Entirely workload dependent! You're going to get limited by furmark entirely differently from Alan Wake II on max settings.

But as stated, it tends to be hard to fill idle time on compute and RT units and etc. in modern games today, the same holds true into the CPU in a lot of workloads as well. So there's always a desire for more cache, lower latency cache, higher bandwidth cache, all of which can move instructions closer and get those compute units active again. The reason for this is SRAM, which makes up most all cache today, hasn't scaled well with new silicon nodes for years and years now. So adding compute tends to be cheaper than adding SRAM cache levels to fill that available compute, thus cache issues of one kind or another are often run into. Finding a replacement for SRAM that's faster/smaller/cheaper/etc. has become a mini holy grail for silicon manufacturing, but so far no one's convincingly shown to have found one, though there are several recent contenders claiming to be viable.

Lurkmass · Aug 15, 2024

A big motivation behind using Work Graphs for these sorts of optimizations is so that we can apply starvation-free algorithms to avoid flushing the GPU caches with unnecessary barriers ...

Scott_Arm · Aug 16, 2024

New season of Fortnite is UE 5.5. Can't remember if the last one was. Will probably be impossible to dissect anything that's new under the hood.

DegustatoR · Aug 16, 2024

Black Myth: Wukong Benchmark angeschaut

ComputerBase hat sich den Benchmark zu Black Myth: Wukong genauer angesehen und stellt einen ersten Vergleich der Pathtracing-Grafik an.

www.computerbase.de

trinibwoy · Aug 17, 2024

DegustatoR said:
Black Myth: Wukong Benchmark angeschaut

ComputerBase hat sich den Benchmark zu Black Myth: Wukong genauer angesehen und stellt einen ersten Vergleich der Pathtracing-Grafik an.

www.computerbase.de

Ridiculous advantage for Ada with RT enabled. The 4070 is 5% faster than the 3080 without RT and 50% faster with it enabled.

The 4080 and 7900 xtx comparison is even sillier. 8% advantage with RT off turns into a 218% advantage with it enabled. Crazy.

Andrew Lauritzen · Aug 17, 2024

trinibwoy said:
How would you characterize those traces?

In isolation? You can't. It has to be stressed that hardware utilization and optimization is all in the service of producing a given set of visuals as quickly as possible. It's uninteresting to say that an opaque workload is using the GPU "efficiently" in the same way as busy waiting a CPU at 100% is or is not an "efficient" use. It is only in the comparison and understanding of alternative ways to accomplish the goals that these metrics become interesting.

trinibwoy said:
You can argue that any piece of software is doing the best it can but it doesn’t change those basic observations from the hardware’s point of view.

Sure, but the basic observations are meaningless without that context. I won't infer your personal motivations, but the issue is that in the past when consumers get involved in analysis at this level they are rarely trying to actually learn about rendering algorithms. Usually this is just a pretext to posting about how some game is "unoptimized" and presenting the data to folks who see a graph and think it must mean whatever the poster says it does; grab the pitchforks! I really want to avoid sounding like a gatekeeper - especially at Beyond3D - but to properly interpret things like low level GPU profilers you probably need to have some level of experience as a rendering engineer or equivalent, and even then it's complicated.

I hope B3D still remains primarily a home for people interested in learning and discussing and not just grabbing gotcha data to impress more ignorant folks on the internet, but I've seen the latter happen several times recently so I admit I'm on guard.

trinibwoy said:
Ridiculous advantage for Ada with RT enabled. The 4070 is 5% faster than the 3080 without RT and 50% faster with it enabled.

To your original point, this is a pretty relevant example of why apparent low hardware utilization is not necessarily an indicator of specific blame. Is what the game is doing "unoptimized", or does the 3xxx series cards have some architectural limitations around efficiently executing it? Dynamic branching used to be really slow on NVIDIA cards when it was new... were games "unoptimized" for wanting to do branches in shaders, or was the hardware just not very good at it yet? Hell raytracing itself has always been widely considered to be something that is "inefficient" on GPUs, and for good reason. The question is not whether a GPU would run more "efficiently" if you are just drawing texture mapped polygons, but rather what is the most efficient way to get a specific visual result. Sometimes you need to branch. Sometimes you need to do a series of nasty dependent memory lookups in low occupancy shaders. Sometimes you need to shoot incoherent rays.

(Aside: that's likely the difference in bottlenecks of the different RT passes you are seeing in the traces; some will be shooting very coherent shadow or reflections rays that hit caches nicely as they traverse the acceleration structure. Others will be shooting much less coherent GI/AO/glossy reflections/very soft shadow rays that are just not very efficient to trace. Such is life - you need both.)

Just remember that neither of these two worlds (software, GPUs) work in isolation. We are constantly evolving *both* to work more efficiently on the stuff that we need to do to advance rendering quality and performance over time.

Scott_Arm said:
New season of Fortnite is UE 5.5. Can't remember if the last one was. Will probably be impossible to dissect anything that's new under the hood.

My usual PSA... in this context "5.5" just means "some changelist post 5.4". It obviously can't include everything that's going to be in 5.5 because 5.5 isn't done yet.

yngc said:
I guess it is just that lighting takes less percentage of frame time at lower resolution, so the difference between ray tracing and lumen is less important to overall frame time. Lumen a little faster at 1440p/dlss perf.

Raytracing itself tends to scale strongly with resolution of course. At very low resolutions fixed overheads dominate, that's nothing new or interesting.

trinibwoy · Aug 17, 2024

Andrew Lauritzen said:
In isolation? You can't. It has to be stressed that hardware utilization and optimization is all in the service of producing a given set of visuals as quickly as possible. It's uninteresting to say that an opaque workload is using the GPU "efficiently" in the same way as busy waiting a CPU at 100% is or is not an "efficient" use. It is only in the comparison and understanding of alternative ways to accomplish the goals that these metrics become interesting.

I just realized that the flops convo I replied to started with someone claiming "UE5 is unoptimized". I agree that discussing software optimization in isolation is pointless unless you can find other software that does the same thing more efficiently. My intent with sharing that trace was to support the idea that flops comparisons aren't that useful since flops typically aren't the primary bottleneck in most GPU workloads within a frame. However hardware utilization is interesting in isolation. If the most efficient software at a given task only uses 25% of hardware capacity that's still worth discussing.

Andrew Lauritzen said:
Sure, but the basic observations are meaningless without that context. I won't infer your personal motivations, but the issue is that in the past when consumers get involved in analysis at this level they are rarely trying to actually learn about rendering algorithms. Usually this is just a pretext to posting about how some game is "unoptimized" and presenting the data to folks who see a graph and think it must mean whatever the poster says it does; grab the pitchforks! I really want to avoid sounding like a gatekeeper - especially at Beyond3D - but to properly interpret things like low level GPU profilers you probably need to have some level of experience as a rendering engineer or equivalent, and even then it's complicated.

I hope B3D still remains primarily a home for people interested in learning and discussing and not just grabbing gotcha data to impress more ignorant folks on the internet, but I've seen the latter happen several times recently so I admit I'm on guard.

I wasn't making any observations with respect to UE5 but I understand your sensitivity given the context. No we should not discourage people from sharing data from their own hardware, that would be silly to say the least. I don't quite agree with you that you need to be a rendering engineer to have any understanding of this stuff. There are tons of smart technical people in the world who are capable of basic understanding of how GPUs work even if they don't work with GPUs for a living. If you have concerns with average Joe's misinterpreting data that's what this community should be about - more knowledgeable folks sharing their experience and insight to shine light on more opaque topics.

There were no pitchforks and no-one mentioned UE5 efficiency in response to those traces so I do think you're overreacting a bit.

Shifty Geezer · Aug 17, 2024

trinibwoy said:
There were no pitchforks and no-one mentioned UE5 efficiency in response to those traces so I do think you're overreacting a bit.

Andrew's response was as much pre-emptive based on historical misuse of information - the reaction isn't an overreaction to these few posts but an experienced reaction to the long-term way The Internet operates. It's more a caution than a criticism.

Andrew Lauritzen · Aug 17, 2024

trinibwoy said:
No we should not discourage people from sharing data from their own hardware, that would be silly to say the least. I don't quite agree with you that you need to be a rendering engineer to have any understanding of this stuff. There are tons of smart technical people in the world who are capable of basic understanding of how GPUs work even if they don't work with GPUs for a living.

At the level I'm talking about, it's less "general smartness" and more just lacking the experience to understand what they are looking at. It's one thing to have a vague understanding of what sorts of bottlenecks can occur in GPU architectures, and it's quite another to iterate on code that moves them around and get a feel for the fundamental axes of flexibility and the various performance cliffs in a given architecture.

But to be clear, my comment was made in the context of people trying to draw general conclusions from looking at such data. i.e. this algorithm is doing this well or poorly, etc. I have no issues with anyone looking at data, it's just important to have the self-awareness to know when you don't have the experience to generalize what you are seeing.

trinibwoy said:
If you have concerns with average Joe's misinterpreting data that's what this community should be about - more knowledgeable folks sharing their experience and insight to shine light on more opaque topics.

Amen and that's what I mean by I hope that B3D is the atmosphere to foster that. My cautionary notes are because we've seen what I described several times in this very thread, where someone takes some of the data or discussion here and just posts it elsewhere out of context to a bunch of enthusiasts who write articles/posts that draw ridiculously misguided conclusions. I realize that no matter how much I tell people to not do that, some portion of the people reading here are doing it precisely for that reason (and hopefully have enough self-awareness to at least know that), but I think it's still better to give the benefit of the doubt even though I will not be surprised where this ends up.

trinibwoy said:
There were no pitchforks and no-one mentioned UE5 efficiency in response to those traces so I do think you're overreacting a bit.

I mean it was mentioned several times with respect to this exact benchmark already. As Shifty said though, I hope it was clear from the start of my post that it was meant primarily as "this is great, but let's all put on our learning/humble hats and avoid these pitfalls if we're going this route please".

But hey, I certainly hope you're right and my concern is unwarranted this time

cwjs · Aug 17, 2024

Andrew Lauritzen said:
but to properly interpret things like low level GPU profilers you probably need to have some level of experience as a rendering engineer or equivalent, and even then it's complicated..

Just to throw a non-expert opinion into the mix, as somebody who's profiled extensively to improve performance at work and also written a simple hobby renderer, the point Andrew Lauritzen makes here about how challenging it is to interpret a gpu capture, even with source code, a simple renderer, and some partial expert knowledge shouldn't go unnoticed. These wukong traces are certainly interesting to share, and maybe somebody with a lot of experience could make some guesses and inferrences here, but noting low or high utilization here or there in the trace is really hard to draw final conclusions for. At least for me, after tracing my small and (supposedly) well-understood renderer on different gpus in my household I came away with less understanding of what's going on under the hood on those gpus than I started with.

trinibwoy · Aug 18, 2024

cwjs said:
Just to throw a non-expert opinion into the mix, as somebody who's profiled extensively to improve performance at work and also written a simple hobby renderer, the point Andrew Lauritzen makes here about how challenging it is to interpret a gpu capture, even with source code, a simple renderer, and some partial expert knowledge shouldn't go unnoticed. These wukong traces are certainly interesting to share, and maybe somebody with a lot of experience could make some guesses and inferrences here, but noting low or high utilization here or there in the trace is really hard to draw final conclusions for. At least for me, after tracing my small and (supposedly) well-understood renderer on different gpus in my household I came away with less understanding of what's going on under the hood on those gpus than I started with.

That’s fair though I am a bit puzzled why we’re so skittish about interpreting data in an industry where nearly all conclusions are based on a few high level metrics (frame rate and frame times if we’re lucky).

What would you guys suggest as an appropriate methodology for evaluating game performance and efficiency?

Understanding game performance through profiler trace graphs *spawn

Andrew Lauritzen

Moderator

Shifty Geezer

uber-Troll!

trinibwoy

Meh

trinibwoy

Meh

troyan

trinibwoy

Meh

Shifty Geezer

uber-Troll!

yngc

techuse

Frenetic Pony

Lurkmass

Scott_Arm

DegustatoR

Black Myth: Wukong Benchmark angeschaut

trinibwoy

Meh

Black Myth: Wukong Benchmark angeschaut

Andrew Lauritzen

Moderator

trinibwoy

Meh

Shifty Geezer

uber-Troll!

Andrew Lauritzen

Moderator

cwjs

trinibwoy

Meh

Similar threads