Digital Foundry Article Technical Discussion [2022]

Status
Not open for further replies.
New DF Direct is out:


00:00:00 Introductions
00:00:55 News 01: Kingdom Hearts 4 revealed
00:07:50 DF Supporter Q: What games do you hope are real from the Nvidia leak?
00:09:33 News 02: Next-gen Witcher 3 delayed and reorganized
00:16:24 News 03: Rumors of Criterion NFS in November
00:19:40 DF Supporter Q: Are there any new ideas or techniques you'd like to see games try with the greater bandwidth?
00:25:24 News 04: LG C2 TVs have issues
00:34:59 DF Content Discussion: Unreal Engine 5
00:53:00 DF Content Discussion: Motorstorm and PS3 emulation
01:04:08 DF Supporter Q1: What is your opinion on Deck's night pitched fan whine?
01:05:40 DF Supporter Q2: Would the Steam Deck (or other platforms) benefit from games having a native resolution UI with the 3D graphics at a lower resolution scaled up?
01:05:56 DF Supporter Q3: What are your thoughts on the upcoming technologies regarding upsampling via FSR 2.0 or Unreal engines TAA solution regarding the Steam Deck's resolution?
01:08:36 DF Supporter Q4: Why didn't UE5 release with direct storage to begin with?
01:11:08 DF Supporter Q5: How do you guys think the increased focus on CPU power in current-gen consoles will affect game design trends going forward?
01:19:36 DF Supporter Q6: Do you think we’ll ever see a FPS AI driven optimization built into game engines, perhaps to push a 30fps title to an AI-interpolated 60fps without a noticeable sacrifice to quality and latency?
01:24:05 DF Supporter Q7: What exactly does the PS5 have in hardware to support ray tracing?
01:25:33 DF Supporter Q8: Which games would you consider technological highlights on the "OG-Xbox"?
 
Last edited by a moderator:
The issue cannot be compute, there's no such thing as programming things for wider compute arrays

I don't understand this -- I'm no graphics programmer and hardware utilization is definitely why I'm not, so I could be mistaken here, but every technical talk I see on graphics programming is all about choosing how much unique work to dispatch and checking out concurrency, timing, etc, in the debugger, based on the target hardware. See for example: https://gpuopen.com/learn/optimizing-gpu-occupancy-resource-usage-large-thread-groups/

Nothing is going to ever be 100% saturated, dispatching work that's closer shaped to the hardware gets you less downtime waiting on other waves/threadgroups?
 
I don't understand this -- I'm no graphics programmer and hardware utilization is definitely why I'm not, so I could be mistaken here, but every technical talk I see on graphics programming is all about choosing how much unique work to dispatch and checking out concurrency, timing, etc, in the debugger, based on the target hardware. See for example: https://gpuopen.com/learn/optimizing-gpu-occupancy-resource-usage-large-thread-groups/

Nothing is going to ever be 100% saturated, dispatching work that's closer shaped to the hardware gets you less downtime waiting on other waves/threadgroups?
I'm also not a graphics programmer, my experience though is only with compute. My understanding with things is that it depends on how work is submitted whether through compute pipeline or 3d pipeline. But typically my understanding is that each CU will run the same program from beginning to completion. Having more CUs means more of that program can be run in parallel, so if you have work that can be divided into 50 CUs for instance, if you have a 1 CU gpu, that CU has to run 50 times to get through all of that data. If you have 50 CUs, you only need 1 run per CU.

So when it comes to larger work, having more CUs is absolutely beneficial, and we see this behaviour in just about most GPUs, the larger it is, the easier time is has with handling 4K etc. At least if you are compute bound.
However not all workloads will necessarily end perfectly fit for a large number of CUs, in particular if you are targeting less work, so say the work remaining has enough for 36CUs or less CUs less to dispatch but you have 52 CUs. It will only be able to dispatch the remaining work to 36 of the 52. So in essence those remaining 16 are wasted. This is CU scaling, which is different from saturation. Saturation is about threads, waves and the number of registers. The longer your program is, the less registers available for each thread. So you run into issues when your shader runs into a stall, there won't be many threads remaining to switch to because each thread sucked up so many registers. On the other hand, having a lot of smaller programs that end faster will allow you to have a lot of threads run concurrently and the hardware can switch more readily if required. Thus we're getting into a discussion of threads/wave dispatching. It's a bit different from CU scaling.

The only thing i don't know or remember is if on the 3D pipeline the controller can use any empty unused CUs to different jobs on. Which I don't know if it is possible and I don't believe so. I believe the CUs all must run the same program, but that's characteristic of the compute pipeline. Async compute only runs once workloads have moved off the CUs to the ROPs, or other hardware, then you can insert async compute jobs.

I am not too familiar with GCN terminology, but since having to dabble in CUDA from time to time, this is what I also assume GCN does:

SM = CU
Block = Wave IIRC

Resource Assignment to Blocks
Execution resources are assigned to threads per block. Resources are organized into Streaming Multiprocessors (SM). Multiple blocks of threads can be assigned to a single SM. The number varies with CUDA device. For example, a CUDA device may allow up to 8 thread blocks to be assigned to an SM. This is the upper limit, and it is not necessary that for any configuration of threads, a SM will run 8 blocks.

For example, if the resources of a SM are not sufficient to run 8 blocks of threads, then the number of blocks that are assigned to it is dynamically reduced by the CUDA runtime. Reduction is done on block granularity. To reduce the amount of threads assigned to a SM, the number of threads is reduced by a block.

In recent CUDA devices, a SM can accommodate up to 1536 threads. The configuration depends upon the programmer. This can be in the form of 3 blocks of 512 threads each, 6 blocks of 256 threads each or 12 blocks of 128 threads each. The upper limit is on the number of threads, and not on the number of blocks.

Thus, the number of threads that can run parallel on a CUDA device is simply the number of SM multiplied by the maximum number of threads each SM can support. In this case, the value comes out to be SM x 1536.

***
If you look at the bolded above, I guess if one would desire to use 512 threads per block (a very tiny shader), than having more SMs/CUs would be hugely beneficial due to how much work can be done in parallel.

I'm not sure if this is the case with lego. Seems painful, but worth discussing further. I'm not actually sure how it's done. I would be grateful for any insight here.
 
Last edited:
I'm also not a graphics programmer, my experience though is only with compute.

Thanks for this thorough post. My understanding of this is hazy in places due to, like I said, not actually writing very much graphics code. My understanding is in a non-cuda context you dispatch compute shaders by setting the number of threads to be used -- the driver splits that work up over the number of CUs, and you run into whatever register, bandwidth, memory limitations exist per those cus. If you dispatch a compute shader with a number of threads that are mismatched to the number of CUs, yeah, you'll end up leaving large parts of the hardware idle. So there's a matter of finding work for compute to do and writing code for that such that it works efficiently on all of the threads you want within the limitations of your hardware. There is plenty of work to do on compute (async or regular)-- rasterizing pixels to a visibility buffer, doing particle physics simulations, binning and culling lights in a clustered renderer, doing post processing operations, generating (various kinds of) shadow maps... and all of that can be broken up further into smaller chunks to dispatch separately by diving things up by screen space, world space, etc. Given that console developers have good profiling tools, a fixed piece of hardware in front of them and a lot of access to how memory is handled, how work is dispatched, etc, they are equipped to, if the investment is worth it, significantly overhaul how the renderer works to best make use of the resources available.

As for compute not running during other operations, I'm not sure that's meaningfully correct. I think it's technically true on some or all hardware (I'm a little above my depth on all of the latency hiding things gpus do) but I think for practical purposes async compute runs simultaneously to other operations -- the async work runs at the "same time" and it's either being squeezed in to the built in latency when the CUs are doing their work or being run on certain CUs separately. Either way, capacity has a meaningful effect on how much (and what kind of) work can be done in the same real time window. This blog post seems helpful, although with your cuda experience it may be less enlightening to you than it is to me: https://www.linkedin.com/pulse/dire...synchronous-compute-nvidia-amd-dennis-mungai/

I do agree that I doubt any of this is the important part with lego though.
 
Thanks for this thorough post. My understanding of this is hazy in places due to, like I said, not actually writing very much graphics code. My understanding is in a non-cuda context you dispatch compute shaders by setting the number of threads to be used -- the driver splits that work up over the number of CUs, and you run into whatever register, bandwidth, memory limitations exist per those cus. If you dispatch a compute shader with a number of threads that are mismatched to the number of CUs, yeah, you'll end up leaving large parts of the hardware idle. So there's a matter of finding work for compute to do and writing code for that such that it works efficiently on all of the threads you want within the limitations of your hardware. There is plenty of work to do on compute (async or regular)-- rasterizing pixels to a visibility buffer, doing particle physics simulations, binning and culling lights in a clustered renderer, doing post processing operations, generating (various kinds of) shadow maps... and all of that can be broken up further into smaller chunks to dispatch separately by diving things up by screen space, world space, etc. Given that console developers have good profiling tools, a fixed piece of hardware in front of them and a lot of access to how memory is handled, how work is dispatched, etc, they are equipped to, if the investment is worth it, significantly overhaul how the renderer works to best make use of the resources available.

As for compute not running during other operations, I'm not sure that's meaningfully correct. I think it's technically true on some or all hardware (I'm a little above my depth on all of the latency hiding things gpus do) but I think for practical purposes async compute runs simultaneously to other operations -- the async work runs at the "same time" and it's either being squeezed in to the built in latency when the CUs are doing their work or being run on certain CUs separately. Either way, capacity has a meaningful effect on how much (and what kind of) work can be done in the same real time window. This blog post seems helpful, although with your cuda experience it may be less enlightening to you than it is to me: https://www.linkedin.com/pulse/dire...synchronous-compute-nvidia-amd-dennis-mungai/

I do agree that I doubt any of this is the important part with lego though.
good article you posted here. I think this is the part I wasn't sure if it happens, but I guess the CUs can get leveraged.
On Nvidia's Maxwell architecture, what would happen is that Task A is assigned to 8 SMs such that execution time is 1.25ms and the FFU does not stall the SMs at all. Simple, right? However we now have 20% of our SMs going unused.
So we assign task B to those 2 SMs which will complete it in 1.5ms, in parallel with Task A's execution on the other 8 SMs.
Here is the problem; when Task A completes Task B will still have 0.25ms to go, and on Maxwell, there's no way of reassigning those 8 SMs before Task B completes. Partitioning of resources is static(unchanging) and happens at the drawback boundary, controlled by the driver.

Nvidia's Pascal architecture solves this problem with 'dynamic load balancing' ; the 8 SMs assigned to A can be reassigned to other tasks while Task B is still running; thus saturating the SMs and improving utilization.

So you can assign remaining work to fill up unused CUs such that all 10 SMs are used. I was under the assumption that if you assigned enough work and it split over 8 of 10, the last 2 were unusable until the first program finished. But it's suggesting that's not true.

hmm. quite curious really then. Having more CUs in theory should be better flat out in nearly all cases.
 
good article you posted here. I think this is the part I wasn't sure if it happens, but I guess the CUs can get leveraged.


So you can assign remaining work to fill up unused CUs such that all 10 SMs are used. I was under the assumption that if you assigned enough work and it split over 8 of 10, the last 2 were unusable until the first program finished. But it's suggesting that's not true.

hmm. quite curious really then. Having more CUs in theory should be better flat out in nearly all cases.

This, of course, assumes there are enough parallel tasks without dependencies that you can continuously retask SMs that have finished with their current task. As well as outstanding tasks that can be reassigned into the available SMs that were just freed up. IE if 3 SMs were just freed up but the only available task requires 5 SMs, it may have to wait for an additional 2 SMs to finish.

Regards,
SB
 
Df Article @ https://www.eurogamer.net/digitalfo...ands-on-the-cost-of-next-generation-rendering

Unreal Engine 5 hands-on: the cost of next generation rendering
Nanite and Lumen are incredible - but the performance implications are daunting.

Unreal Engine 5 recently emerged from early access, with a full version now available to games creators. Simultaneously, the 'city sample' portion from the brilliant The Matrix Awakens demo was also released, giving users a chance to get to grips with MetaHuman crowds and large-scale AI in a vast open world, with buildings, roads and more created via procedural generation. In short, Epic is opening up a staggering wealth of new technologies to all and UE5 is, effectively, the first paradigm shift in games development seen since the arrival of the new consoles. So what have we learned from this release? Put simply: it's demanding. Very demanding.

...
 
Having more CUs in theory should be better flat out in nearly all cases.
In practice going really wide, without the commensurate increase in cache size and bandwidth, can create issues in some scenarios. This is a new challenge for microarchitectures which is actually a decades-old challenge facing server design.

Eg. 40 CUs with 4Mb Cache and 400Gb/s bandwidth should theoretically always be better than 30 CUs with 4Mb cache and 400Gb/s bandwidth but trying to do too much with too little cache runs the risk of contaminate cache. And it never seems like these architectures scale entirely linearly. Increasing CUs is relatively straight forward in terms of transistor budget, cache is very expensive on a per-transistor basis and increasing bandwidth to a block of cache is a whole different level of complex.

CPUs, CUs, cache, RAM, shoving too much compute without increasing the rest of the support architecture, never yields the end of performance improvement you would expect.
 
uch, seeing how heavy is ue5 on cpu and how badly scale with cores number maybe massive exodus to ue5 for so many studios is little premature
 
uch, seeing how heavy is ue5 on cpu and how badly scale with cores number maybe massive exodus to ue5 for so many studios is little premature

I disagree. They can still use all the same identical subsystems as UE4. They are not required to use nanite or lumine with UE5. They would still get all the other vast improvements on UE5.
 
In practice going really wide, without the commensurate increase in cache size and bandwidth, can create issues in some scenarios. This is a new challenge for microarchitectures which is actually a decades-old challenge facing server design.

Eg. 40 CUs with 4Mb Cache and 400Gb/s bandwidth should theoretically always be better than 30 CUs with 4Mb cache and 400Gb/s bandwidth but trying to do too much with too little cache runs the risk of contaminate cache. And it never seems like these architectures scale entirely linearly. Increasing CUs is relatively straight forward in terms of transistor budget, cache is very expensive on a per-transistor basis and increasing bandwidth to a block of cache is a whole different level of complex.

CPUs, CUs, cache, RAM, shoving too much compute without increasing the rest of the support architecture, never yields the end of performance improvement you would expect.
Definitely, I made a generic assumption about increasing CUs. You’d have to beef all areas of memory to support the additional compute. Alu alone is not sufficient.
 
Df Article @ https://www.eurogamer.net/digitalfo...ands-on-the-cost-of-next-generation-rendering

Unreal Engine 5 hands-on: the cost of next generation rendering
Nanite and Lumen are incredible - but the performance implications are daunting.

Unreal Engine 5 recently emerged from early access, with a full version now available to games creators. Simultaneously, the 'city sample' portion from the brilliant The Matrix Awakens demo was also released, giving users a chance to get to grips with MetaHuman crowds and large-scale AI in a vast open world, with buildings, roads and more created via procedural generation. In short, Epic is opening up a staggering wealth of new technologies to all and UE5 is, effectively, the first paradigm shift in games development seen since the arrival of the new consoles. So what have we learned from this release? Put simply: it's demanding. Very demanding.

...
That is still quite ... disappointing for a new engine to be that single-core limited. So the bad performance on the current-gen consoles is mostly a CPU related problem. Maybe for that city build on the consoles a bit worse for xbox because it the playstation api was always better with CPU resources than those from the xbox. But once this is ironed out it should at least be equal in the cpu department. Also as the engine is heavily single-core bound that might still give the PS5 the full power for the GPU as the other cores are more or less idling around.
And they must definitely work on that shader caching. But that is already a problem with the last version of the engine.
 
Df Article @ https://www.eurogamer.net/digitalfo...ands-on-the-cost-of-next-generation-rendering

Unreal Engine 5 hands-on: the cost of next generation rendering
Nanite and Lumen are incredible - but the performance implications are daunting.

Unreal Engine 5 recently emerged from early access, with a full version now available to games creators. Simultaneously, the 'city sample' portion from the brilliant The Matrix Awakens demo was also released, giving users a chance to get to grips with MetaHuman crowds and large-scale AI in a vast open world, with buildings, roads and more created via procedural generation. In short, Epic is opening up a staggering wealth of new technologies to all and UE5 is, effectively, the first paradigm shift in games development seen since the arrival of the new consoles. So what have we learned from this release? Put simply: it's demanding. Very demanding.

...

Ouch that is a relatively massive performance hit to enable hardware RT (32% in their testing) when the GPU (RTX 3090) isn't even really being taxed that heavily. I hadn't realized that hardware RT was so heavy on the CPU.

Regards,
SB
 
Hardware RT has had an enormous cpu penalty from the beginning. When I got my 3080 I tested out Control and was shocked by how heavy the cpu penalty was.

https://forum.beyond3d.com/threads/dxr-performance-cpu-cost.62177/

I've had a bit of an axe to grind with the major tech review sites by really under-valuing how important cpus still are for gaming on PC. You'll always see these benchmarks that show these low to mid-range cpus are good enough because they're always testing under conditions that flatten results because they hit gpu limits. Then ray tracing, or some other new cpu heavy game comes along, and that "low value" high end cpu suddenly looks a lot better. I'm glad to start seeing more sits testing 720p high/ultra to really show how cpu performance differs, but there's still a big lack of ray tracing inclusion in benchmarks.

For ray tracing to become pervasive we need a much better software stack, or some ground-breaking new ways to store scene information that is more cpu friendly. Consoles have some advantages on the cpu-side as well as the consumer expectation that 60 or even 30 fps (yikes) is good enough. Buying i7,i9,R7,R9 cpus is very expensive.
 
Status
Not open for further replies.
Back
Top