Ray Operation Scheduling
To provide high efficiency, the example non-limiting embodiment L0 cache 750 provides ray execution scheduling via the data path into the cache itself. In example non-limiting embodiments, the cache 750 performs its ray execution scheduling based on the order in which it fulfills data requests. In particular, the cache 750 keeps track of which rays are waiting for the same data to be returned from the memory system and then—once it retrieves and stores the data in a cache line—satisfies at about the same time the requests of all of those rays that are waiting for that same data.
The cache 750 thus imposes a time-coherency on the TTU 700's execution of any particular collection of currently-activated rays that happen to be currently waiting for the same data by essentially forcing the TTU to execute on all of those rays at about the same time. Because all of the rays in the group execute at about the same time and each take about the same time to execute, the cache 750 effectively bunches the rays into executing time-coherently by serving them at about the same time. These bunched rays go on to repetitively perform each iteration in a recursive traversal of the acceleration data structure in a time-coherent manner so long as the rays continue to request the same data for each iteration.
The cache 750's bunching of rays in a time-coherent manner by delivering to them at about the same time the data they were waiting on, effectively schedules the TTU 700's next successive data requests for these rays to also occur at about the same time, meaning that the cache can satisfy those successive data requests at about the same time with the same new data retrieved from the memory system.
An advantage of the L0 cache 750 grouping rays in this way is that the resulting group of ray requests executing on the same data take roughly the same traversal path through the hierarchical data structure and therefore will likely request the same data from the L0 cache 750 at about the same time for each of several successive iterations—even though each individual ray request is not formally coordinated with any other ray request. By the fact that the L0 cache 750 is scheduling via the data path to TTU blocks 710, 712, 740, the L0 cache is effectively scheduling its own future requests to the memory system on behalf of the rays it has bunched together in order to minimize latency while providing acceptable performance with a relatively small cache data RAM having a relatively small number of cache lines. This bunching also has the effect of improving the locality of reference in the L1 cache and any downstream caches.
If rays in the bunch begin to diverge by requesting different traversal data, cache 750 ceases serving them at the same time as other rays in the bunch. The divergence happens as the size of the bounding boxes in the BVH decreases at lower levels. What might have been minute, ignorable differences in origin or direction early on, now cause rays previously bunched to miss or hit those smaller bounding boxes differently.
Hierarchical Data Structure Traversal—how Ray Operations are Activated
...
Additionally, by grouping the requests together so that many ray-complet tests that are tested against the same complet data are scheduled to be performed at more or less at the same time, rays that are “coherent”—meaning that they are grouped together to perform their ray-complet tests against the same complet data—will remain grouped together for additional tests. By continuing to group these rays together as a bundle of coherent rays, the number of redundant memory access to retrieve the same complet data over and over again is substantially reduced and therefore the TTU 700 operates much more efficiently. In other words, the rays that are taking more or less the same traversal path through the acceleration data structure tend to be grouped together for purposes of execution of the ray-complet test—not just for the current test execution but also for further successive test executions as this bundle of “coherent” rays continue their way down the traversal of the acceleration data structure. This substantially increases the efficiency of each request made out to memory by leveraging it across a number of different rays, and substantially decreases the power consumption of the hardware.
...
Great advantages are obtained by the ability of the L0 caching structure 750 to group ray execution based on the data the grouped rays require to traverse the acceleration data structure. The SM 132 that presents rays to TTU 700 for complet testing, in a general case may have no idea that those rays are “coherent.” The rays may exist adjacent to one another in three-dimensional space, but typically traverse the acceleration data structure entirely independently. Whether particular rays are thus coherent with one another depends not only on the spatial positions of the rays (which SM 132 can determine itself or through other utility operations such as even artificial intelligence), but also on the detailed particular acceleration data structure the rays are traversing. Because the SM 132 requesting the ray-complet tests does not necessarily have direct access to the detailed acceleration data structure (not would it have the time to analyze the acceleration data structure even if it did have access), the SM relies on TTU 700 to accelerate the data structure traversal for whatever rays the SM 132 presents to the TTU for testing. In the example non-limiting embodiment, the TTU 700's own L0 cache 750 provides an additional degree of intelligence that discovers, based on independently-traversing rays requesting the same complet data at about the same time, that those rays are coherently traversing the acceleration data structure. By initially grouping these coherent rays together so that they execute ray-complet tests at about the same time, these coherent rays are given the opportunity again to be grouped together for successive tests as the rays traverse the acceleration data structure. The TTU L0 cache 750 thus does not rely on any predetermined grouping of rays as coherent (although it does make use of a natural presentation order of rays by the requesting SM 132 based simply on spatial adjacency of the rays as presented by the SM for testing), but instead observes based on the data the rays require for testing as they traverse the acceleration data structure that these rays are traversing the same parts of the acceleration data structure and can be grouped together for efficiency.