Flappy Pannus
Veteran
Follow along with the video below to see how to install our site as a web app on your home screen.
Note: This feature may not be available in some browsers.
How is my description of Apple closer to AMD when they don't seem to have any unified/flexible on-chip memory ? Based on their slides, AMD's other types of on-chip memory (LDS/L0 cache) are depicted to be physically separate ...It seems to me that the schema you describe is more akin to what AMD is doing (as described by the patent here: Register Compaction with Early Release). If Apple indeed operated that way, there would be no change in occupancy on "heavy" shaders and one would still need to do dispatch fine-tuning — but M3 behavior seems to be very different. Some pieces of evidence is dramatic performance improvements on complex shaders (e.g. Blender — that's before the hardware RT kicks in) and also in the Blender patches (where it is mentioned that no dispatch group fine-tuning is needed as the system will take care of occupancy automatically).
I imagine that Apple absolutely can launch as many waves as they want at the start but I think they "starve certain sources of memory by priority" by carving out other pools of on-chip memory before they spill to device memory so it still fits with their claim that 'occupancy' isn't dependent on worst case register allocation ...All this does raise an interesting question — I don't think we have a very good idea how these things are managed. Getting though multiple layers of marketing BS to the actual technical details can be incredibly tricky. At least my understanding of what Apple does is that they virtualize everything, even register access. So registers are fundamentally backed by system memory and can be cached closer to the SIMDs to improve performance (here is the relevant patent: CACHE CONTROL TO PRESERVE REGISTER DATA). This seems to suggest that you can in principle launch as many waves as you want, just that in some instances the performance will suck since you'll run out of cache. I have no idea how they manage that — there could be a watchdog that monitors the cache utilization and occupancy suspends/launches new waves to optimize things, or it could be a more primitive load-balancing system that operates at the driver level. It is also not clear at all what AMD does. It doesn't seem like their system is fully automatic.
Most likely both require assistance by the shader compiler to flag the lower part of the stack as "preemptable", so rather than a hard register count, you now have a peak working set size, and a total register count for the deepest part of the call tree. Worst case it requires hoisting a couple of registers from a lower page to an upper page on call to actually minimize the hot working set size by compaction of active registers.have no idea how they manage that — there could be a watchdog that monitors the cache utilization and occupancy suspends/launches new waves to optimize things, or it could be a more primitive load-balancing system that operates at the driver level. It is also not clear at all what AMD does.
That slide might very well be accurate. But even in the worst case, nothing speaks against memory management performed by a memory management kernel that is operating outside the "normal" visibility constraints. You don't necessarily have only the instructions available that the ISA docs expose as "user visible" either, and it's almost safe to assume that software interrupt routines with elevated privileges existed for a while now.Based on their slides, AMD's other types of on-chip memory (LDS/L0 cache) are depicted to be physically separate ...
No. You should rarely be even able access a register that's outside the active working set / assigned part of the register file if the shader compiler did its job properly. When you do, or you happen to hit something that has truly been preempted, you will most likely hit some sort of trap state for the wave. And you can only resume once the dependency is resolved, so you never were able to see anything stale. Since this is a rare case, resolving this dependency may very well happen in pure software, just an ordinary interrupt routine.Also if you have some "race condition" in your program where you're continuously modifying certain memory locations (threadgroup or tile) throughout the lifetime of the program, isn't that dangerous (crash/incorrect results) to allocate registers from other pools of on-chip memory ?
You can't assume that its tied in with the "regular" memory hierarchy just because its mapped into a single logical address space for the sake of unified instruction encoding. Backing different parts of the logical address space with different fixed functions or hardware pools is not that uncommon, respectively neither is assigning different retention and mapping strategies based on virtual address range. E.g. direct addressing for control registers, basic offset addressing for registers in the register file with a linear layout, page table based addressing for main memory. And not all those hardware pools need to be tied into a hierarchy by direct interconnects in order to enabled preemption or reloading, interrupts can handle all the rare cases where a virtual address from any of those distinct ranges is currently unmapped.There is only the cache hierarchy backed by system memory, and "registers" and "threadgroup memory" are special ways of addressing, using narrow "local" addresses instead of 64-bit (48/52 bits used) virtual addresses. Everything is backed by system memory, but ideally almost all "register" accesses come out of cache.
Sorry for the confusion and going somewhat off-topic, but I was referring to what Apple is doing since M3, not AMD. And that's not about "a single logical address space for the sake of unified instruction encoding", but explicitly about dynamically sharing the on-chip memory between registers, threadgroup memory, and cache. They can still be separate logical address spaces.You can't assume that its tied in with the "regular" memory hierarchy just because its mapped into a single logical address space for the sake of unified instruction encoding.
Huh, curious that this approach ended up being efficient at all. I mean that this implies that the M3 is backing the register file with a cache that has multiple associativity rather than a simple adder+mux. There's definitely been a trade-off there between better occupation and increased complexity in the by far hottest path, considering that you usually need multiple registers per instruction and thread. I can not see that paying off, in terms of required transistors per cache line. Maybe in terms of efficiency though, as most of the extra silicon required to give the L0 a sufficient number of ports is inactive on average, as opposed to needing more SRAM cells, especially when assuming a limited associativity.but explicitly about dynamically sharing the on-chip memory between registers, threadgroup memory, and cache.
Do we have a time yet for the review embargo?
Not really. XT will end up faster than 5070Ti in games which favor AMD h/w. This is a result where I'd expect the average w/o RT to be.That's more than I expected.
I'd imagine that there would be a small "register L0"/operand reuse cache to cover the majority of register accesses. So a small number of comparators + mux for an SRAM block that's significantly smaller than the traditional GPU register file. And, unlike accesses to global or threadgroup memory, register indexing is uniform across a simdgroup.Huh, curious that this approach ended up being efficient at all. I mean that this implies that the M3 is backing the register file with a cache that has multiple associativity rather than a simple adder+mux. There's definitely been a trade-off there between better occupation and increased complexity in the by far hottest path, considering that you usually need multiple registers per instruction and thread. I can not see that paying off, in terms of required transistors per cache line. Maybe in terms of efficiency though, as most of the extra silicon required to give the L0 a sufficient number of ports is inactive on average, as opposed to needing more SRAM cells, especially when assuming a limited associativity.
I meant the TDP.Not really. XT will end up faster than 5070Ti in games which favor AMD h/w. This is a result where I'd expect the average w/o RT to be.
Ah, yeah, power is still a sore point for AMD. The "downclocked"/power limited 9070 does okay though.I meant the TDP.
Ah, yeah, power is still a sore point for AMD. The "downclocked"/power limited 9070 does okay though.
"3.3.3. Dynamic VGPR Allocation & Deallocation" provides some meat for the dynamic register allocation discussion above.RDNA4 Instruction Set Architecture Reference Guide was posted over at Anandtech's forum.
My apologies if it has already been posted in the thread.
setmaxnreg
PTX instruction which Nvidia added in Hopper and which is available on all Blackwell GPUs.USETMAXREG
(https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#hopper-instruction-set)This feature is stated to support only wave32.For those who have better knowledge of AMD/RDNA's architecture than I do: when they say it doesn't apply to graphics, does that include raytracing? Or is that 'compute' and this is expected to be aggressively used in raytracing? And do they definitely not support this for graphics shaders at all, or might that be redacted from the public ISA because they consider it too limited/complicated to be exposed publicly?