The high-level description of RTX implies a more autonomous RT core versus the AMD patent. A BVH instruction passes pointers and ray data across a bus to the texture unit, and the memory and filtering path loads and filters data that can be passed back to the SIMD hardware and/or the intersection engine. That engine can perform intersection tests on bounding boxes or triangles, depending on what type of node is being evaluated.Seems very similar to RTX, as far as i can imagine how the latter works.
What seems different here is that the hybrid texture/RT block only accelerates the evaluation of one node at a time, whereas Nvidia's described RT core functionality keeps on traversing and testing until it can return a hit/miss result.
The AMD method uses a state machine that takes the set of intersection tests, additional child nodes, and data indicating how the traversal stack should be updated, and passes it back to the SIMD hardware. The SIMD hardware then evaluates or implements what was passed back. Successive nodes would involve executing another BVH instruction with arguments based on the most recently updated context and stack data.
The SIMD hardware and the register and memory resources it has available host much of the context, although I am unclear on how exposed the back and forth between the texture block+state machine and the SIMD would be to the programmer. The programmable SIMD could make more programmable decisions about what it does for the next traversal step, although what the patent describes could be implemented such that it's using programmable hardware, but using internal programs or microcode loops that won't release the wavefront back to programmer control until they are done.
The impact of the AMD method on the overall CU appears to be more significant, versus Nvidia's claim that its RT core can leave the SM mostly free to do other things.
Other Nvidia claims like the RT core's built-in execution loop saving a lot of instruction fetch bandwidth for the SM would put AMD's method in a potentia half-way point. There's a subset of operations that have an internal loop, but whatever additional steps go back to the SIMD hardware may incur significant instruction traffic--albeit not as much as a fully software solution.
I haven't found a description to this level of detail for the RTX elements of Nvidia's architecture for comparison. One possible point for future review is how the two methods compare if RTX hardware is using a custom intersection shader, which Nvidia generally recommends against for performance reasons. That might inject some of the back and forth communication between RT core and SIMD hardware AMD's method defaults to.
The vector register file seems to be the first choice for hosting the stack, but since this is back in the programmable domain there could be fallback to LDS and memory.I really wonder how they manage a stack per ray. Can one assume an upper bound of how large this stack has to be? And even if, that's a lot of memory and bandwidth.
Personally i've always used stackless approach on GPU. I must be missing something here...
It seems hardware vendors like how stack-based methods tend to yield more compact BVH structures in memory, don't have as many repeat traversals of nodes as many stackless methods, and the accesses that exist may play better with cache hierarchies than some stackless methods. Being able to play in the same conceptual space as many CPU methods may also be a bonus.
It's generally in the area where the vector cache, L/S units, and texture filtering units are, but the engine might be a hardware block sitting next to them. The texture path has a lot of buffers, ALUs, and sequencing capability already, so what gets reused versus re-implemented isn't clear."Fixed function ray intersection engine in a texture processor" - question, this texture processor in amd nomenclature is tmu or some sort of CU ?
This may have some relation to the existence of more than one graphics ring in the recent Navi driver commits. This allows for preemption at the granularity of a primitive by creating a duplicate pipeline and register and data storage for the main graphics pipeline and a real-time pipeline. This duplication goes from the command processor through the geometry processor. Whether there's an explicitly separate command processor or processor block per pipeline or some form of multi-threading isn't clear.http://www.freepatentsonline.com/y2019/0164328.html
PRIMITIVE LEVEL PREEMPTION USING DISCRETE NON-REAL-TIME AND REAL TIME PIPELINES
The big change is that a context switch and drain of the fixed-function pipeline doesn't happen in this form of preemption because the command processor and front end duplicates storage, and various blocks like input assembly and tessellation are not shared with the real-time pipeline. The non-realtime path would presumably be the high-performance standard graphics path, while the real-time path avoids stepping on its toes by emulating various stages in software rather than risk flushing them.
This would apply to workloads that are very latency sensitive, but aren't counting on the using some of those emulated resources much.
The shader back-end is generally agnostic of the front end, so its changes appear minimal.
(edit: Mentions a scheduling processor that may align with the MES controller added with Navi, which match priority tunneling in AMD's slides. Might relate to having a central geometry processor.)
This may not be strictly related to Navi or hardware. A skim of it makes me think it is a change in how the compiler can handle static instruction scheduling in terms of deciding on how it can compile or optimize sections of shader programs.
The supposed original way of doing this was to have the compiler walk through every block of a shader, record the number of registers it needs, and then indicate that the shader as a whole will need an allocation matching the consumption of the block that needs the most registers.
This serial process can lead to sub-optimal results if blocks evaluated earlier are compiled to use a certain number of registers, and then a later block needs a large allocation.
It may have been possible that if the earlier evaluations had known of this, they could have been compiled with more generous register constraints for better performance.
Alternately, it may be the case that a block that needs a lot of registers could lead to occupancy problems. If it's just one small part of a shader that has otherwise modest occupancy, then it might be better if that big block were compiled less-optimally in terms of performance if it lets the overall shader experience better occupancy.
The patent describes evaluating blocks with multiple scheduling algorithms in parallel, taking the accumulated results, and selecting the versions of each block that it thinks lead to a better overall result.
This sets up a vertically-stacked system where SIMD hardware can access a section of the DRAM above as if it were "local", which can be supplied at lower latency and apparently at data bus widths closer to the internal data paths of the DRAM, which interfaces whittle down. This local connection also drills directly down to the SIMD hardware and its register file. The cache hierarchy that exists in the case, exists for accessing other parts of the HBM that are non-local (above another distant SIMD).EXTREME-BANDWIDTH SCALABLE PERFORMANCE-PER-WATT GPU ARCHITECTURE
http://appft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=/netahtml/PTO/search-bool.html&r=11&f=G&l=50&co1=AND&d=PG01&s1="Advanced+Micro+Devices".AANM.&OS=AANM/"Advanced+Micro+Devices"&RS=AANM/"Advanced+Micro+Devices"
The rules for this type of access appear to be different than what would be for more traditional memory accesses that may also need to be consistent with CPU or other clients. Accesses are even aware of lane predication, so this seems like it's treated like an extension of a data share or local buffer. There's also consideration for load-balancing between local and remote access, and power consumption from much higher DRAM array activity.
edit: Of note, this talks about SIMD16 hardware, and it's another one of those DOE patents, which like variable width SIMD and a raft of near-threshold, per-ALU voltage regulation, or asynchronous processing patents in similar programs seem to have less correlation with any AMD products.
It's possible that an APU like that wouldn't need the big heatsink because there's no way it can draw enough current, since the base is also a heatink and the HBM dies are in the way.Seems like a method to stack an APU or GPU with HBM using TSVs.
I keep reminding myself of that patent that mentioned a method to dissipate the heat of a chip across the PCB with copper tubes.
Stacking HBM with an APU would be great to reduce costs, but the problem of dissipating the heat between the stacks should prevent it from happening.
Maybe this way they could do it.
Last edited: