Ext3h
Regular
Okay, so the patent exactly matches what you'd naively assume in 1 man-hour of brainstorming that a cache in that spot would do. That being patented is extremely bad news then, as working around that patent appears to be quite hard, and that guiding effect is a basic necessity to turn it efficient... Patents simply suck.
e.g. because you have a no-hit scenario with coherent rays as the worst case).
EDIT: No, you don't need to cache the (full) AS for the no-hit scenario either. At least on AS level, you only need to ensure that the empty space is filled with referable nodes too, so the non-hit is effectively cache-able too. But with a bias in the guidance so that any-hit has priority over no-hit. Only only the bottom-most triangle soup level, you actually need to re-sweep and would therefore benefit significantly from "preferred" caching of the entire section.
EDIT2: What the patent didn't mention, is the feedback channel from accepted nearest-hit back to the cache, so that previous near-hits are also tried first for coherent rays in order to massively speed up ray-truncation. It's misleading with the implementation hint about fulfillment order. The real order of evaluation is: Cached nearest hit > cached any-hit > stalled (potential) any-hit (by order of fulfilled request) > cached no-hit
EDIT3: At least that fast-truncation approach of re-using the cached hierarchy of a previous nearest-hit is possible to do in software on AMD hardware via LDS too!
EDIT4: Nothing in that patent prevents you from providing a cache which serves only to record nearest hits and yields the estimated new nearest hit (or at least the best possible entry point into the hierarchy), thereby at least solving most of the any-hit / close-miss scenarios trivially. It only restricts the use of streaming caches withing a fixed function traversal unit for AS data, but not when the cache is explicitly populated from shader code!
Manually enter nearest-hits, as well as all known empty nodes into the cache for no-hits (in both cases with absolute bounding boxes), run rays against only against that cache for the first round, and you can mostly decide with zero data fetches / indirection if the ray will be truncated (both on the near and far plane!) or not. Worst case, it will still give you a very good hint about the closest point where it is missing prefetched data (near plane truncation!) or in the best case that cache alone will truncate both new and far plane exactly to give you the nearest hit and nothing else. On average, it will still get you deep into the AS with a pre-computed global bounding box so you can still skip several rounds of pointer chase.
Only the coherent memory access on a non-cache-hit / any-hit is still blocked by that patent.
Yes and no. They will need the same hits, which satisfies all the any-hit use cases. They do not need to re-scan each group in that case. But you might of course use your generic L2 data cache, to cater for the usecases where you need to re-sweep (Discarding immediately sounds wasteful particularly if you're doing any sort of coherency sorting. Multiple rays/wavefronts will likely need the same data.
EDIT: No, you don't need to cache the (full) AS for the no-hit scenario either. At least on AS level, you only need to ensure that the empty space is filled with referable nodes too, so the non-hit is effectively cache-able too. But with a bias in the guidance so that any-hit has priority over no-hit. Only only the bottom-most triangle soup level, you actually need to re-sweep and would therefore benefit significantly from "preferred" caching of the entire section.
EDIT2: What the patent didn't mention, is the feedback channel from accepted nearest-hit back to the cache, so that previous near-hits are also tried first for coherent rays in order to massively speed up ray-truncation. It's misleading with the implementation hint about fulfillment order. The real order of evaluation is: Cached nearest hit > cached any-hit > stalled (potential) any-hit (by order of fulfilled request) > cached no-hit
EDIT3: At least that fast-truncation approach of re-using the cached hierarchy of a previous nearest-hit is possible to do in software on AMD hardware via LDS too!
EDIT4: Nothing in that patent prevents you from providing a cache which serves only to record nearest hits and yields the estimated new nearest hit (or at least the best possible entry point into the hierarchy), thereby at least solving most of the any-hit / close-miss scenarios trivially. It only restricts the use of streaming caches withing a fixed function traversal unit for AS data, but not when the cache is explicitly populated from shader code!
Manually enter nearest-hits, as well as all known empty nodes into the cache for no-hits (in both cases with absolute bounding boxes), run rays against only against that cache for the first round, and you can mostly decide with zero data fetches / indirection if the ray will be truncated (both on the near and far plane!) or not. Worst case, it will still give you a very good hint about the closest point where it is missing prefetched data (near plane truncation!) or in the best case that cache alone will truncate both new and far plane exactly to give you the nearest hit and nothing else. On average, it will still get you deep into the AS with a pre-computed global bounding box so you can still skip several rounds of pointer chase.
Only the coherent memory access on a non-cache-hit / any-hit is still blocked by that patent.
Last edited: