the custom unit fallacy - modern GPU workloads more dependant on memory than computationla power? *spawn

Lurkmass

Veteran
So Cerny's "custom hardware for machine learning" is just some tweak to the CUs? I'd consider that false advertising.
There was never any magic solution to RT or AI other than "more memory performance". More fixed function or other specialized HW logic alone won't net you major gains ...
 
There was never any magic solution to RT or AI other than "more memory performance".
Not true at all. Memory has never been the sole or primary solution for any of the mentioned workloads, and there are countless ways to optimize performance through specialized logic or better designed software. This can be achieved by using more compact data structures, bruteforcing cache sizes, compressing data, quantizing data, performing multitoken or speculative predictions for LLMs, batching work, fusing kernels, and more. There are myriad optimizations available that can dramatically improve performance in both RT and AI without relying on faster memory. If memory were the only factor, it would be the component costing thousands of dollars, not GPUs.
 
Not true at all. Memory has never been the sole or primary solution for any of the mentioned workloads, and there are countless ways to optimize performance through specialized logic or better designed software. This can be achieved by using more compact data structures, bruteforcing cache sizes, compressing data, quantizing data, performing multitoken or speculative predictions for LLMs, batching work, fusing kernels, and more. There are myriad optimizations available that can dramatically improve performance in both RT and AI without relying on faster memory. If memory were the only factor, it would be the component costing thousands of dollars, not GPUs.
@Bold Then what after exhaustively applying all sorts of data compression/compaction schemes ? At that point you just have an economically expensive race to see who will first reach to integrating a vertically stacked memory architecture whilst dedicating ever more master compute die area to caches or register files ...

All RT or AI mostly exposes is whomever will reach the limits of their memory system first before the others do which isn't interesting so all the best wishes for consoles banking on something like HBM becoming mainstream/widespread at semi-reasonable prices ...
 
Then what after exhaustively applying all sorts of data compression/compaction schemes ? At that point you just have an economically expensive race to see who will first reach to integrating a vertically stacked memory architecture whilst dedicating ever more master compute die area to caches or register files ...
For the past 20+ years, GPUs and software have consistently evolved toward higher math density algorithms. It is no different with RT and AI this time. As an example, per pixel Restir is more computationally intensive than bruteforce global illumination due to the additional calculations done per pixel, yet it achieves significantly higher quality results with the same number of samples. On device inference for neural compressed textures takes time as well, but it compresses textures another 8–10 times more than what is possible with the simple block compression.

When it comes to LLMs, the idea is similar - maximizing data reuse in caches and registers is key to achieving higher math density. This can be achieved by batching multiple prompts and processing them in parallel, or by performing multi token or speculative predictions at the single prompt granularity, leveraging additional parallelism within a single request. In the case of the new class of models — Large Reasoning Models — there is inherently infinite parallelism, as they can launch an infinite number of prompts over the same weights at each reasoning step, enabling them to evaluate large number of hypotheses in a single step.
 
Last edited:
For the past 20+ years, GPUs and software have consistently evolved toward higher math density algorithms. It is no different with RT and AI this time. As an example, per pixel Restir is more computationally intensive than bruteforce global illumination due to the additional calculations done per pixel, yet it achieves significantly higher quality results with the same number of samples. On device inference for neural compressed textures takes time as well, but it compresses textures another 8–10 times more than what is possible with the simple block compression.
Nvidia would think otherwise since their most optimized implementation of ReSTIR involves using SER to spill arguments to their L2 cache to reorder the threads. There's absolutely no math involved in that part of the process as it's ALL MEMORY operations!

"Consistently evolved" towards higher math density yet the industry proceeds to keep using their deferred renderers and composite many more rendering passes for it and there's no sign of them either moving to tile-based rendering architectures or make use of D3D12's optional render pass API hence the disastrous results observed on Snapdragon Windows PCs!
When it comes to LLMs, the idea is similar - maximizing data reuse in caches and registers is key to achieving higher math density. This can be achieved by batching multiple prompts and processing them in parallel, or by performing multi token or speculative predictions at the single prompt granularity, leveraging additional parallelism within a single request. In the case of the new class of models — Large Reasoning Models — there is inherently infinite parallelism, as they can launch an infinite number of prompts over the same weights at each reasoning step, enabling them to evaluate large number of hypotheses in a single step.
All of this is pure drivel coming from you since we still can't run the simplest of LLMs on many NPUs. Most of these applicable optimizations doesn't whitewash away the underlying fact that we have a memory problem ...
 
Nvidia would think otherwise since their most optimized implementation of ReSTIR involves using SER to spill arguments to their L2 cache to reorder the threads. There's absolutely no math involved in that part of the process as it's ALL MEMORY operations!
What does SER have to do with ReSTIR? SER reorders threads to sort them by material ID prior to shading in hit shaders, but how is this related to the math in ReSTIR?

"Consistently evolved" towards higher math density yet the industry proceeds to keep using their deferred renderers and composite many more rendering passes for it and there's no sign of them either moving to tile-based rendering architectures or make use of D3D12's optional render pass API hence the disastrous results observed on Snapdragon Windows PCs!
Rasterization performance still scales well without moving to the tile-based or other exotic architectures, which confirms what I've already said — "more memory performance" is not the sole or primary solution for any of the mentioned workloads, even in mature rasterization.

All of this is pure drivel coming from you since we still can't run the simplest of LLMs on many NPUs. Most of these applicable optimizations doesn't whitewash away the underlying fact that we have a memory problem ...
The claim that you can't run even the simplest LLM on NPUs just because they don't have enough bandwidth is the drivel. You can run LLMs on many 40+ TOPS NPUs, but it would be a slow and painful experience due to the immature software of many NPUs. Of course, it will also be much slower compared to running the same task on a 1400 TOPS 4090. It's ridiculous how you've reduced something like this to a single thesis. Apparently, in your world, it's only memory performance that prevents integrated GPUs from reaching the level of much beefier discrete graphics.
 
What does SER have to do with ReSTIR? SER reorders threads to sort them by material ID prior to shading in hit shaders, but how is this related to the math in ReSTIR?
It'd be helpful to yourself to look at code more often than asking for responses ...
Rasterization performance still scales well without moving to the tile-based or other exotic architectures, which confirms what I've already said — "more memory performance" is not the sole or primary solution for any of the mentioned workloads, even in mature rasterization.
Your understanding deficiency in deferred rendering and the utilities of a G-buffer in relation to render pass APIs is rearing up it's own view by mentioning "rasterization performance" which can only mean that you had no grasp of the problem statement ...
The claim that you can't run even the simplest LLM on NPUs just because they don't have enough bandwidth is the drivel. You can run LLMs on many 40+ TOPS NPUs, but it would be a slow and painful experience due to the immature software of many NPUs. Of course, it will also be much slower compared to running the same task on a 1400 TOPS 4090. It's ridiculous how you've reduced something like this to a single thesis. Apparently, in your world, it's only memory performance that prevents integrated GPUs from reaching the level of much beefier discrete graphics.
You can't run ANY LLMs at all on many NPUs (TinyLlama w/ 1.1B parameters is absolutely NOT an LLM) in comparison to GPUs so your claim that they're "much slower" is mostly erroneous. LLMs ARE memory and memory performance bound and there's no argument to be had over there. The only step in an LLM that even is remotely compute bound is the "prefill phase" where the model analyzes the user prompt but that pales in comparison when the standard performance metric for AI models are tokens/s which are memory bound in every way possible ...
 
It'd be helpful to yourself to look at code more often than asking for responses ...
Why would I need to look at this code when there is a whitepaper and guide available that explain what the SER does, where to use it, and how? In the code you linked, the integrate_indirect_pass is responsible for computing indirect lighting. It handles sorting out materials before shading them, as I mentioned earlier. So, I’ll ask one more time - how does this relate to the ReSTIR, which is a completely unrelated pass?

Your understanding deficiency in deferred rendering and the utilities of a G-buffer in relation to render pass APIs is rearing up it's own view by mentioning "rasterization performance" which can only mean that you had no grasp of the problem statement ...
Nobody asked you for these links and the discussion was never about tile-based GPUs in the first place. I am perfectly aware of how tile-based GPUs reduce memory accesses by using tiling, which have been used for decades not only in the GPUs but also as a general software optimization. Doubt HW tiling used anywhere today besides particle rendering on modern GPUs, because you still need to store the buffer with all the vertices prior to tiling the screen, and unless the buffer fits into the cache, it's not feasible. The more vertices you need to store, the worse it gets. Even if tilers had any advantages for G-Buffer rendering in modern applications, which I sincerely doubt, they would be outweighed by the minimal time spent in G-Buffer passes in modern games. You can’t achieve a 10x speedup by accelerating a 2-3 ms fraction of a frame. You don't even need the heavy machinery that comes with the complex TDBR as modern games are not primarily limited by memory-bound passes. That’s why, as I mentioned earlier, even classic rasterization (and, for god's sake, by rasterization, I meant the entire pipeline, not just the G-Buffer/Depth or shadowmaps passes) is not generally limited by memory bandwidth.

The only step in an LLM that even is remotely compute bound is the "prefill phase" where the model analyzes the user prompt but that pales in comparison when the standard performance metric for AI models are tokens/s which are memory bound in every way possible ...
It seems your lack of understanding of LLM architecture prevents you from grasping a simple concept - LLMs are bandwidth-limited because the attention phase doesn’t have enough parallelism to saturate a GPU. This is why batching and speculative decoding help improve GPU utilization. See, I don’t need a thousand irrelevant links to show how wrong you are - there are numerous benchmarks with different batch sizes that clearly demonstrate you have no idea of LLM architectures or its bottlenecks.
 
Why would I need to look at this code when there is a whitepaper and guide available that explain what the SER does, where to use it, and how? In the code you linked, the integrate_indirect_pass is responsible for computing indirect lighting. It handles sorting out materials before shading them, as I mentioned earlier. So, I’ll ask one more time - how does this relate to the ReSTIR, which is a completely unrelated pass?
If you've even tried to look at the included header file, you would immediately realize that the integration pass is a part of RTXDI's (ReSTIR) algorithm but you obviously don't know any better ...
Nobody asked you for these links and the discussion was never about tile-based GPUs in the first place. I am perfectly aware of how tile-based GPUs reduce memory accesses by using tiling, which have been used for decades not only in the GPUs but also as a general software optimization. Doubt HW tiling used anywhere today besides particle rendering on modern GPUs, because you still need to store the buffer with all the vertices prior to tiling the screen, and unless the buffer fits into the cache, it's not feasible. The more vertices you need to store, the worse it gets. Even if tilers had any advantages for G-Buffer rendering in modern applications, which I sincerely doubt, they would be outweighed by the minimal time spent in G-Buffer passes in modern games. You can’t achieve a 10x speedup by accelerating a 2-3 ms fraction of a frame. You don't even need the heavy machinery that comes with the complex TDBR as modern games are not primarily limited by memory-bound passes. That’s why, as I mentioned earlier, even classic rasterization (and, for god's sake, by rasterization, I meant the entire pipeline, not just the G-Buffer/Depth or shadowmaps passes) is not generally limited by memory bandwidth.
The snarky reply with TBR architectures was to demonstrate that merging/fusing render passes can be a performance win due to the reduced memory traffic of having fewer rendering pass. I find your claim that the industry is "evolving towards higher math density" to be extremely contentious now that the most popular AAA PC & console game engine has added yet ANOTHER rendering pass that involves performing 64-bit atomic r/m/w memory operations to render geometry into a visibility buffer with other hardware vendors are now scrambling to implement said memory traffic compression scheme for this as well and results showing handheld PCs churning really hard to attain low performance. The industry is also looking to implement/use persistent threads or Work Graphs to reduce the amount of GPU work starvation that happens with cache flushes ...

You clearly have no idea what hardware vendors have to do behind the scenes to optimize their memory system to enable high-end modern rendering ...
It seems your lack of understanding of LLM architecture prevents you from grasping a simple concept - LLMs are bandwidth-limited because the attention phase doesn’t have enough parallelism to saturate a GPU. This is why batching and speculative decoding help improve GPU utilization. See, I don’t need a thousand irrelevant links to show how wrong you are - there are numerous benchmarks with different batch sizes that clearly demonstrate you have no idea of LLM architectures or its bottlenecks.
I guess Intel's advice must be "irrelevant and wrong too" according to you since they seem to think that the prefill phase is peanuts compared to the token phase (of which EVERYONE uses as a benchmark for AI models in general) ...
 
If you've even tried to look at the included header file, you would immediately realize that the integration pass is a part of RTXDI's (ReSTIR) algorithm but you obviously don't know any better ...
You’ve clearly gotten lost in the sources you linked. SER is used to reorder hits before shading them for GI, and the integrators do exactly what their names suggest - add either the direct or indirect lighting contribution. Feel free to keep linking sources without getting a grasp of what's happening in them.

I find your claim that the industry is "evolving towards higher math density" to be extremely contentious now that the most popular
This is not a contentious issue. If you plot a graph of flops per byte ratios over the years, you'll quickly get an idea of where the industry is heading. Again, never said memory performance doesn't matter or anything along those lines. I said the statement - There was never any magic solution to RT or AI other than "more memory performance" - is false for too many reasons.

You clearly have no idea what hardware vendors have to do behind the scenes to optimize their memory system to enable high-end modern rendering
Pretty sure I do know it better than you do.

I guess Intel's advice must be "irrelevant and wrong too" according to you since they seem to think that the prefill phase is peanuts compared to the token phase (of which EVERYONE uses as a benchmark for AI models in general) ...
It's matrix multiplications in the attention layers that have always dominated in the LLMs, it's common knowledge, not a secret Intel's knowledge. This simply means there may not be enough tokens to fill the GPU with work and keep the machine busy until the next portion of weights arrives. Since weight matrices dominate memory requests, you need a large number of tokens to keep the machine occupied with matrix multiplications. For this, you need batching/multi-token prediction/speculative prediction, and more compact weight matrices.
 
Last edited:
You’ve clearly gotten lost in the sources you linked. SER is used to reorder hits before shading them for GI, and the integrators do exactly what their names suggest - add either the direct or indirect lighting contribution. Feel free to keep linking sources without getting a grasp of what's happening in them.
That's not it ...

SER can be used to accelerate an importance sampling technique known as next event estimation. Nvidia specifically extended their original ReSTIR algorithm with this technique over the many other importance sampling methods they could've settled on but they chose NEE because you can reuse those same paths in a future frame for the neighboring set pixels which makes it easy for their latest hardware to sort these rays ...
This is not a contentious issue. If you plot a graph of flops per byte ratios over the years, you'll quickly get an idea of where the industry is heading. Again, never said memory performance doesn't matter or anything along those lines. I said the statement - There was never any magic solution to RT or AI other than "more memory performance" - is false for too many reasons.
Despite the increasing ratios of compute with respect to memory performance, the industry has mostly doubled down on the current paradigm and made it far worse (fatter per-pixel G-buffers, more rendering passes, virtual textures/geometry/shadows, overusing barriers, and especially ray reordering) ...

Making good use of a GPUs 'parallelism' involves splitting up the rendering pipeline into many more fine grained rendering passes that can overlap with each other to enable more async compute or higher occupancy which puts hardware vendors under even more pressure to implement a more robust memory system ...
Pretty sure I do know it better than you do.
Your attempts to demonstrate this knowledge has been inadequate so far ...
It's matrix multiplications in the attention layers that have always dominated in the LLMs, it's common knowledge, not a secret Intel's knowledge. This simply means there may not be enough tokens to fill the GPU with work and keep the machine busy until the next portion of weights arrives. Since weight matrices dominate memory requests, you need a large number of tokens to keep the machine occupied with matrix multiplications. For this, you need batching/multi-token prediction/speculative prediction, and more compact weight matrices.
Something tells me you've never attended a class/read a book on algorithms but I'll give you a quick rundown from the authors of the FlashAttention paper . From the standard attention algorithm implementation you should arrive at a formula of 8N^2 + 8Nd (bytes) for total memory movement and 4(N^2)d + 3N^2 (ops) for total compute operations where N = sequence length and d = dimension of an attention head ...

If you calculate the compute/memory ratios for Llama 3 w/ 70B parameters (N = 8192 & d = 128) or GPT 3 w/ 175B parameters (N = 2048 & d = 128), you get just over 63 ops/byte and 60 ops/byte respectively. An H100 NVL GPU can deliver a compute/memory ratio well over 800 ops/byte which well past obliterates the ratios of both AI models ...

Batching only applies to the prefill phase when the model is doing user input prompt analysis. Even if you set the number input prompt tokens (8192 in case of Llama 3) equivalent to the maximum sequence length of a model (extremely unrealistic), your total defined (tokens*parameters/compute perf) prefill time comes out to 144ms on the fastest GPU today and this step of a model output generation process only happens ONCE. Generating a completion token (parameters/bandwidth) during the last phase of model inferencing only takes just under 18ms on the same hardware and a model can take multiple completion tokens to finish generation. As little as 8 completion tokens can overtake the maximum possible time spent on the prefill phase on a Llama 3 model w/ 70B parameters ...

Full stop, the numbers have proven that your claim of LLMs being compute bound is entirely wrong no matter how far you proceed into your argument ...
 
Making good use of a GPUs 'parallelism' involves splitting up the rendering pipeline into many more fine grained rendering passes that can overlap with each other to enable more async compute or higher occupancy which puts hardware vendors under even more pressure to implement a more robust memory system ...

This makes perfect sense though I haven’t really seen much evidence of saturated compute or off-chip bandwidth at least not in Nvidia’s profiling tools. I don’t know if that’s evidence that developers are struggling to generate a lot of overlapping work or Nvidia’s hardware/drivers are struggling to schedule that work efficiently.
 
This makes perfect sense though I haven’t really seen much evidence of saturated compute or off-chip bandwidth at least not in Nvidia’s profiling tools. I don’t know if that’s evidence that developers are struggling to generate a lot of overlapping work or Nvidia’s hardware/drivers are struggling to schedule that work efficiently.
It was stated in the hardware sub forum that instruction issue is one of the bigger weak points on NV GPUs. Maybe that will be a focus for Blackwell.
 
SER can be used to accelerate an importance sampling technique known as next event estimation. Nvidia specifically extended their original ReSTIR algorithm with this technique over the many other importance sampling methods they could've settled on but they chose NEE because you can reuse those same paths in a future frame for the neighboring set pixels which makes it easy for their latest hardware to sort these rays ...
Glad you can google things. However, sorting has nothing to do with the math in ReSTIR, they are orthogonal. The sole purpose of sorting in SER is to improve memory and computation coherence. Obviously, by sorting rays, you are trading time for sorting in exchange for faster processing later on in hit shaders/TTU search/etc. As long as sorting doesn't spill data outside the chip, it is ultimately a memory optimization, even if the sorting itself is memory bound. That trades onchip bandwidth and computation to save offchip memory bandwidth later down the pipe, in the same way that the ReSTIR algorithm does relative to bruteforce sampling.

the industry has mostly doubled down on the current paradigm and made it far worse (fatter per-pixel G-buffers, more rendering passes, virtual textures/geometry/shadows, overusing barriers, and especially ray reordering)
As always, you're mixing a lot of things here. Especially funny to see ray reordering included, which is actually a memory optimization. This isn't helpful.

I'd say the whole arguing about whether rasterization, RT or AI are dominated by memory performance is goofy because even the on-shelf products, like the 4080 and 7900 XTX, prove they aren't.

Your attempts to demonstrate this knowledge has been inadequate so far ...
I have yet to hear anything sensible from you either.

Something tells me you've never attended a class/read a book on algorithms but I'll give you a quick rundown from the authors of the FlashAttention paper .
For someone who presumably attended algorithm classes, you're using too many words to explain that vector by matrix multiplications have quadratic complexity, while the rest of the optimization would include using tiling via onchip SRAM in an attempt to reduce the offchip memory traffic to linear complexity. Don't see anything potentially novel in the paper, as tiling has been used in GEMM for ages, and it's common to see up to 70-90% utilization in GEMM kernels.
What you're trying to prove is that the O(n²) complexity of on-chip computations is insufficient to mask the linear complexity of memory accesses (with the tiling optimizations), which is not true, as people would never have achieved 70-90% utilization in GEMM kernels otherwise. Your calculations are likely incorrect, along with the other assumptions. Moreover, prefill is unrelated to FlashAttention and concerns caching already processed elements, as transformers are autoregressive all to all models that perform many repetitive computations that can be cached. Prefill is the stage where the cache is populated. This caching is the source of the additional memory requirements and the lack of parallelism, as it stores the cache in memory and avoids recalculating already processed elements. This is why you need a large number of tokens to keep the GPU busy, as this optimization reduces computational complexity but increases memory demands. This is also why the number of tokens, which depends on prompt length or batching, has such a significant impact on the performance of LLMs.
 
Glad you can google things. However, sorting has nothing to do with the math in ReSTIR, they are orthogonal. The sole purpose of sorting in SER is to improve memory and computation coherence. Obviously, by sorting rays, you are trading time for sorting in exchange for faster processing later on in hit shaders/TTU search/etc. As long as sorting doesn't spill data outside the chip, it is ultimately a memory optimization, even if the sorting itself is memory bound. That trades onchip bandwidth and computation to save offchip memory bandwidth later down the pipe, in the same way that the ReSTIR algorithm does relative to bruteforce sampling.
It's not my problem that you can't see the value of an importance sampling technique towards a ReSTIR implementation. Ray sorting is absolutely useful to estimating direct lighting. Spatio-temporal reservoir sample reuse is neither the complete story nor algorithm like you insistently suggest at every turn ...

Extracting nearby ray/sample coherence out of a totally random estimator like Monte Carlo is very hard over doing this with particular estimators such as NEE where you only sample the paths between a surface directly lit by a light source for an estimate and if we know that all paths lead to a light source then we can sort the neighboring pixel samples by the ray direction which should be very similar to each other ... (take it from Peter Shirley who works at Nvidia as a graphics expert)

The most optimized ReSTIR implementation is NOT the original algorithm featuring a Monte Carlo estimator but is a modified implementation which currently uses SER to accelerate NEE for integral estimation to direct lighting ...
As always, you're mixing a lot of things here. Especially funny to see ray reordering included, which is actually a memory optimization. This isn't helpful.

I'd say the whole arguing about whether rasterization, RT or AI are dominated by memory performance is goofy because even the on-shelf products, like the 4080 and 7900 XTX, prove they aren't.
I wouldn't describe any sorting method to be a "memory optimization" since you're explicitly expending memory performance to reach a higher occupancy in the case of ray reordering ...
For someone who presumably attended algorithm classes, you're using too many words to explain that vector by matrix multiplications have quadratic complexity, while the rest of the optimization would include using tiling via onchip SRAM in an attempt to reduce the offchip memory traffic to linear complexity. Don't see anything potentially novel in the paper, as tiling has been used in GEMM for ages, and it's common to see up to 70-90% utilization in GEMM kernels.
What you're trying to prove is that the O(n²) complexity of on-chip computations is insufficient to mask the linear complexity of memory accesses (with the tiling optimizations), which is not true, as people would never have achieved 70-90% utilization in GEMM kernels otherwise. Your calculations are likely incorrect, along with the other assumptions. Moreover, prefill is unrelated to FlashAttention and concerns caching already processed elements, as transformers are autoregressive all to all models that perform many repetitive computations that can be cached. Prefill is the stage where the cache is populated. This caching is the source of the additional memory requirements and the lack of parallelism, as it stores the cache in memory and avoids recalculating already processed elements. This is why you need a large number of tokens to keep the GPU busy, as this optimization reduces computational complexity but increases memory demands. This is also why the number of tokens, which depends on prompt length or batching, has such a significant impact on the performance of LLMs.
You have some really poor misconceptions as to what happens during both the prefill and autoregressive (token) phase of a transformer architecture in LLMs. Populating the KV cache during the prefill phase can be compute intensive since you can batch multiple input prompt tokens. All of that goes out the window in the next (token) phase when your populated KV cache effectively becomes a single prompt token (or a completion token as many would term it) past beyond the 1st inference (prefill) while the model proceeds to keep reloading all of the parameters (billions of them) upon subsequent inferences to perform whatever residual amount computation there is needed left so there's no batching to be had here out of the nature of the design ...
 
Last edited by a moderator:
LLMs are irrelevant to image processing filters, depending on architecture you can run those batched. So XDNA added to the processor would be very nice.
 
Back
Top