Techpowerup did have the slides posted here, they cover some of the memory latency aspects -
https://www.techpowerup.com/review/amd-radeon-rx-6800-xt/2.html
I'm interested in seeing the endnotes for some of the slides like the memory latency one. It might give some of the base values that go into their percentages. I'm not sure whether the infinity cache's latency improvement is a percentage of the total memory latency (L0,L1,L2,memory total) or it's relative to the latency of the DRAM access.
About this point, I'm thinking that the BVH structure retention / discarding by the cache may be a driver and application matter, more than being hardwired. This also explains a part of the need of having specific optimization for the AMD ray tracing implementation.
Driver commits indicate it can happen at page granularity, and there are also flags for specific functionality types. It's not clear BVH fits in that, unless it might hide under the umbrella of some of the metadata related to DCC or HiZ.
Some of those would seem to be better kept in-cache, since DCC in particular can suffer from thrashing of its metadata cache, injecting a level latency sensitivity normal accesses wouldn't.
The vanilla 6800 has one less SE so one less rasterizer. But higher clocks, and it not known if pre-cull numbers are the same.
Is there a source for this, or tests that can tell the difference between an SE being inactivated versus an equivalent number of shader arrays disabled across the chip?
I thought the general consensus was, that AMD disabled one entire SE.
AMD's Sienna Cichlid code introduced a function to track for disabling formerly per-SE resources like ROPs at a shader array level. This might lead to similar outcomes.
When you bring evidence to back up your assumptions, I guess you'll have an argument.
We do have some comparison in terms of AMD's patent for BVH acceleration versus Nvidia's. There are some potential points of interest, such as the round-trip node traversal must make to the SIMD from the RT block, and the implicit granularity of execution being SIMD-width.
There are some code commits that give instruction formats for BVH operations that look to be in-line with the patent.
Or there is a problem in a RBE, or in the scheduling HW of that SE.
RBEs are something that can be disabled at a different granularity than SEs, though.
I asked a few times earlier in the thread, but I didn't get clarification. Recall that there are 4 Packers per Scan Converter, so Navi21 has 32 Packers. And 8 Packers per Raster Unit (2 Scan Converters). Each Packer, up to 4 Packers, is being dispatched to each Shader Array with optimised fragments, arranged as 1x2, 2x1 or 2x2 fragment groups as discussed below for VRS (my speculation). The efficiency gains are from these packed fragments.
The packers I am thinking of are related to primitive order processing, which is related to rasterizer ordered views rather than how primitives are translated to wavefronts.
TSMC should probably consider researching the various EDRAM technologies to resolve the memory scaling issues, particularly for large cache arrays.
IBM and Intel already employ different integration methods, though these are very tightly related to their particular manufacturing process.
Perhaps as scaling falters, the pressure will resume to go back to EDRAM despite the cost and complexity penalties.
Neither IBM or Intel have that technique available at smaller nodes. IBM's next Power chip dropped the capability since IBM sold off that fab to Globalfoundries--which then gave up scaling to lower nodes, and Power was the standout for having EDRAM.