You’ve clearly gotten lost in the sources you linked. SER is used to reorder hits before shading them for GI, and the integrators do exactly what their names suggest - add either the direct or indirect lighting contribution. Feel free to keep linking sources without getting a grasp of what's happening in them.
That's not it ...
SER can be used to
accelerate an importance sampling technique known as next event estimation. Nvidia specifically
extended their original ReSTIR algorithm with this technique over the many other importance sampling methods they could've settled on but they chose NEE because you can reuse those same paths in a future frame for the neighboring set pixels which makes it easy for their latest hardware to sort these rays ...
This is not a contentious issue. If you plot a graph of flops per byte ratios over the years, you'll quickly get an idea of where the industry is heading. Again, never said memory performance doesn't matter or anything along those lines. I said the statement - There was never any magic solution to RT or AI other than "more memory performance" - is false for too many reasons.
Despite the increasing ratios of compute with respect to memory performance, the industry has mostly doubled down on the current paradigm and made it far worse (fatter per-pixel G-buffers, more rendering passes, virtual textures/geometry/shadows, overusing barriers, and especially ray reordering) ...
Making good use of a GPUs 'parallelism' involves splitting up the rendering pipeline into many more fine grained rendering passes that can overlap with each other to enable more async compute or higher occupancy which puts hardware vendors under even more pressure to implement a more robust memory system ...
Pretty sure I do know it better than you do.
Your attempts to demonstrate this knowledge has been inadequate so far ...
It's matrix multiplications in the attention layers that have always dominated in the LLMs, it's common knowledge, not a secret Intel's knowledge. This simply means there may not be enough tokens to fill the GPU with work and keep the machine busy until the next portion of weights arrives. Since weight matrices dominate memory requests, you need a large number of tokens to keep the machine occupied with matrix multiplications. For this, you need batching/multi-token prediction/speculative prediction, and more compact weight matrices.
Something tells me you've never attended a class/read a book on algorithms but I'll give you a quick rundown from the authors of the FlashAttention
paper . From the standard attention algorithm implementation you should arrive at a formula of 8N^2 + 8Nd (bytes) for total memory movement and 4(N^2)d + 3N^2 (ops) for total compute operations where N = sequence length and d = dimension of an attention head ...
If you calculate the compute/memory ratios for Llama 3 w/ 70B parameters (N = 8192 & d = 128) or GPT 3 w/ 175B parameters (N = 2048 & d = 128), you get just over 63 ops/byte and 60 ops/byte respectively. An H100 NVL GPU can deliver a compute/memory ratio well over 800 ops/byte which well past obliterates the ratios of both AI models ...
Batching only applies to the prefill phase when the model is doing user input prompt analysis. Even if you set the number input prompt tokens (8192 in case of Llama 3) equivalent to the maximum sequence length of a model (extremely unrealistic), your total defined (tokens*parameters/compute perf) prefill time comes out to 144ms on the fastest GPU today and this step of a model output generation process only happens ONCE. Generating a completion token (parameters/bandwidth) during the last phase of model inferencing only takes just under 18ms on the same hardware and a model can take multiple completion tokens to finish generation. As little as 8 completion tokens can overtake the maximum possible time spent on the prefill phase on a Llama 3 model w/ 70B parameters ...
Full stop, the numbers have proven that your claim of LLMs being compute bound is entirely wrong no matter how far you proceed into your argument ...