AFAIK (and I really need to double check this), NVIDIA’s L1 cache is still in-order for all misses. That is, if all of a warp’s load requests hit in the L1, it can reschedule earlier than an older warp reques that missed - but if a single thread misses in the L1, then it might have to wait until another warp’s data comes back from the far DRAM controller even though it hit in the L2 and could reschedule much sooner.
A lot of these patents therefore make little sense in the context of the current architecture and the bigger question imo is how their cache hierarchy will evolve in Blackwell.
The main benefit of that patent seems to be earlier submission of memory ops from the SMs perspective which is a pretty nice benefit even if the memory subsystem itself isn’t any faster.
The hardware cost for all of the scoreboarding required is likely not worth it though. I think we talked about this same paper a while back.