L1 and L2 cache latency can have an impact on ray tracing. The top level nodes of your global space decomposition structure lives in your caches, because they are hit all the time. The bottom levels almost always miss the caches and hit main memory (ray coherency not withstanding).
On a CPU a first order approximation of cost is to consider traversing the cached top levels free, but if the latency is as high as detailed above, then that's certainly not the case for GCN.
Cheers
There was a GDC 2018 optimization hot lap from AMD that gave ~114, ~190, ~350 cycles respectively for L1 hit, L2 hit, and L2 miss.
AMD indicated elsewhere 10% improvement with Navi, although I didn't see it giving specific values versus an overall improvement because of the additional capacity throughout. The most recent Navi architecture slides indicated there's a lower-latency path for loads that bypass the sampling hardware, but no concrete figures.
Perhaps a BVH block would see latency closer to the direct load path, whatever value that is. If it's not at least an order of magnitude better, then it might explain why AMD's method has the BVH hardware defer to SIMD after each node evaluation. The CU's register file and LDS might be necessary to buffer sufficient context for a traversal method with such a long-latency critical loop. Perhaps it's counting on the CU's larger context storage to support more rays concurrently, or its register file and LDS to have the more reasonable latency figures for frequently hit node data.
Reverse engineering of Nvidia's L1 in Turing shows it's on the order of 32 cycles, which while vastly better than GCN is sloth-like compared to CPUs. Less clear is which level Nvidia's RT cores interface with. It seems probable they're closely linked to the SM's memory pipeline, but some of Nvidia's patents might have it hooked up outside the L1 and reliant on the L2. The L2 is about as slow as GCN's, which might be problematic if that's how it's implemented. However, there were indications that the RT hardware would have its own storage at presumably reasonable latency, and there were some hints that there is storage management done by the RT core for memory not clearly associated with the L1 or L2.
links:
GCN memory:
https://gpuopen.com/gdc-2018-presentation-links/, optimization hot lap
Turing's memory:
https://arxiv.org/pdf/1903.07486.pdf
edit: Navi reference
https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf
What are the possibilities of existing an Hybrid access memory area that both GPU and CPU can feed on "simultaneously" or sequentially without the read and write penalty for some kind of Hybrid RT?
I'm not an expert on this but some of the problems of having a hybrid RT solution that i read about was exactly that. On a shared memory system this could be a much simpler solution if both had direct access.
I'm unclear on the use of the term hybrid ray tracing. That's often used to describe a rendering engine that combines rasterization and ray tracing, but that's usually still on the GPU.
When involving a read/write penalty in the context of CPU and GPU cooperation, it sounds like you might mean the heavier synchronization barriers between them. Those generally exist because the GPU's memory model is much weaker than the CPU's and the GPU's overall execution loop is vastly longer latency and unpredictable in length.
How closely you think the CPU and GPU are cooperating might make a difference. Directly reading and writing to the same cache lines would either be brutally slow or error-prone. Trading intermediate buffers between certain synchronization points seems possible, but I'm not sure how much that would be a change from some of the mechanisms available already.
A more integrated approach likely means the GPU's memory and cache pipeline is very different from what we've seen already described, and I am not sure a critical area like CPU memory handling is worth the risk of disrupting.
We know that Navi is in the works for a long time but...
Well, how much will "Navi 1.2" change, really? Beyond just some tweaks?
We're throwing around version numbers like they mean something. We don't have a good way to define how much a given change increments a counter, or whether AMD would care if we did. AMD resisted applying a number to GCN generations for quite some time, with sites like Anandtech going with terms like GCN 1.2 for the generation after Sea Islands to try to describe changes in an architecture AMD treated as an amorphous blob--until it relented and labelled things GCN1 (Southern Islands), GCN2(Sea Islands and consoles), GCN3 ("GCN 1.2", Fiji/Tonga/Polaris), etc.--then reverted to calling the next version Vega ISA.
In that context, the consoles were modifications from a GCN 1.x baseline already, one which AMD decided to give a whole number increment above the original hardware.
As for what is considered significant enough, hardware outside the CU array has often been updated relatively flexibly. The mid-gen consoles took on delta memory compression found in the Polaris and Vega products, and instructions for packed FP16 math from Vega showed up in the PS4 Pro. However, I think there's evidence that significant architectural changes like scalar memory writes from GCN3 did not show up, so outside of the additional hardware they were very close the the GCN2 baseline.
The transition from GCN to GCN2 may be a comparison point to RDNA1 to RDNA+?. One significant change to the ISA was the addition of a new instruction group for flat addressing, and a modest number of new instructions being added and some being deprecatd. Whether a new instruction or instructions for BVH node evaluation rises to the level a whole addressing mode may be up to the observer.
How much time does that give devs with the new hardware before launch? Are you suggesting push back the launch for a year to 2021, or is RDNA2 a zero-effort advance over RDNA requiring no changes to existing code?
I may have missed confirmation about details on the earliest dev kits for the current gen. I remember the rumor was PCs using GPUs like Tahiti.
The earliest Sea Islands GPU to be released was Bonaire in spring of 2013. Hawaii was launched a little before the consoles launched.
Early silicon for those GPUs would seem to be the absolute limit for when developers could have done anything with non-console hardware with Sea Islands features.