The point of my comparison: "work smarter nor harder".
AMD appears to be using primitive shaders, "to work smarter" in RDNA2. So we already have a proof. And that's before developers write mesh shaders.
Primitive shaders at least originally still slot into the traditional pipeline, which mesh shaders dispense with.
Some of the bottlenecks mesh shaders discard remain, although it's possible the description of what primitive shaders do has changed since AMD last discussed them in any depth with Vega. RDNA's primitive shader figures have peak figures that are significantly more modest than what Vega promised and failed to deliver.
The primary culling benefit primitive shaders provide is something Nvidia mentioned off-hand as a possibility for mesh shaders, if they felt the need. Nvidia mentioned a specific case where they seemed to allow for a benefit, but the it sounded like Nvidia's own front end hardware was capable enough that the additional step wasn't necessary.
Yes, it appears NVidia is spending way more die area on ray acceleration... and all the comparisons so far are based on code optimised for that hardware.
Without getting a look at silicon, it's assumed that Nvidia is adding more than a scheme that AMD's patent indicates is minimal extra area. In absolute terms, it may be comparing different single-digit percentages of overall die area.
There's no doubt that you don't do a dumb port of a MIMD algorithm to SIMD hardware, we've got over 10 years' proof of that
There's a long-time synergy between rasterization, SIMD, caches, and DRAM. A lot of the sizing for the various elements like screen tiles and their associated SIMD processors is that DRAM buses and DRAM arrays work very well at those granularities. There may need to be a more thorough accounting of what can be changed. Rasterization is nice in that there's now an established set of techniques that allow for rapidly building its acceleration structures on the fly, and the common case aligns with the hardware and memory architectures. Cache concepts and DRAM have changed even more slowly than the fixed-function pipeline.
BVH isn't built at the same time as geometry is being rasterized, and a lot of the research and complexity is trying to fit a divergent workload into the confines of an overall architecture that is not well-suited to it all the way to DRAM.
A custom BVH traversal algorithm that can sample from a texture might be useful
Too bad the sampling would directly compete with the BVH program running through the same cache, at least in AMD's case.