There are a number of different efficiency metrics.
Lets never forget that in the desktop market the by far most significant is performance/$. AMD is pretty much competitive here and differences will be small between the manufacturers unless one player in the duopoly makes a major push for market share.
The desktop market is in a bit of doldrums with respect to performance for the price paid these days. The tiers below enthusiast and high-end tend to be most cost-sensitive, while there is more opportunity to extract revenue beyond the performance improvement at the early-adopter and bleeding-edge market.
Being competitive in a range below the leading edge means not distancing itself significantly from existing inventory or already satisfied demand by virtue of arriving after many buyers have already bought marginally lower-performing chips or have the opportunity to buy them discounted.
AMD's not unique in this, but I think the timing and cost structure give it a less-forgiving position.
Nvidia's numbers seem to be dropping as well and may be in part due to this--if that can be teased out from the pricing effect. Perhaps adding something else new like RT was part of a ploy to get the newer generation to differentiate itself more within the constraints of manufacturing and power at hand.
It looks like it does indirectly through occupancy considerations. My interpretation is that is constrained in consider splitting 256 VPGRs budgets (4x64).
256/1 = 256 (4x64, 1 wave)
256/2 = 128 (4x32, 2 wave)
256/3 = 84 (4x21, 3 wave)
256/4 = 64 (4x16, 4 wave)
256/5 = 48 (4x12, 5 wave)
etc.
As you see there is no divider between 1 and 2 which would allow 4x48 VGPRs as resultant. The code then decides to maximize register use within the occupancy bin.
That would seem to be the choice made by the compiler, but that doesn't point to the hardware needing this.
A SIMD can host up to 10 wavefronts, which requires an average allocation of 24 or fewer registers per wavefront. The granularity at a minimum is 24 registers as far as what the hardware must be able to do, and AMD's documentation gives the actual granularity in 4 or 8.
I'm not following what you mean by having 4x64 when discussing the register budget for a wavefront. A single wavefront can address up to 256 registers, and to match each SIMD has that many on its own.
We can make some assumptions from the AMD RT patent, and from what we can guess about NV:
I ran across a link, perhaps on another board or reddit to something from Nvidia, which might be a better starting point than using AMD's decisions to speculate on Nvidia.
http://www.freepatentsonline.com/y2016/0070820.html or perhaps
http://www.freepatentsonline.com/9582607.html
There's a stack-based traversal block with a block which does evaluate nodes and decide on traversal direction like AMD, but there's also additional logic that performs the looping that AMD's method passes back to the SIMD hardware.
There may also be some memory compression of the BVH handled by this path.
Which means the shader core likely becomes available to other pending taskes after this command (like hit point shading, or async compute, ...).
Also we have no indication NV RT cores would use TMU or share the cache to access BVH.
From the above, it seems like the SM's local cache heirarchy would be separate from the L0 in the traversal block.
Conclusion is NV RT is likely faster but takes more chip area. AMD likely again offers more general compute performance which could compensate this.
Possibly also more power efficiency for Nvidia. The AMD method has to re-expand its work every node back to the width of a wavefront and involves at a minimum several accesses to the full-width register file.
After that i would make sense to decrease ROPs and increase RT Cores. Up to the point where rasteirzation is implemented only with compute. (Texture filtering remains, ofc)
At least for now, no clear replacements for the order guarantees or optimizations like the Z-buffer and other hardware present themselves. Nvidia is counting on the areas where rasterization is very efficient to remain very efficient, lest they lose the power/compute budget spare room that RT is being inserted into.
Turing caught up to GCN somewhat in terms of async compute at least in graphics workloads. That said Navi has it's own improvements to graphics workloads with single cycle simd32.
At least from a feature perspective and dynamic allocation, I think Pascal might have had similar checkboxes to early GCN. There are some features that AMD touts for later generations, though how many are broadly applicable or noticeable hasn't been clearly tested. The latency-oriented ones seem to be focused on VR or audio, although I'm not sure recent Nvidia chips have garnered many complaints for the former and I'm not sure many care for the latter.