Thought experiment - assuming that the basic premise of the tweets was correct:
The likely places Series X could fall behind are: SSD throughput/latency, GPU bandwidth, gpu front-end (clock speed)
I can't think of anything else that would be really obvious.
I'm just speaking hypothetically, but thinking about gpu bandwidth, the only thing I can come up with is that the ratio of memory access is not quite right. Maybe they need a little bit more than the 10GB for fast gpu accesses.
The rest of my thoughts really come down to bandwidth mitigation, the memory model and how things are streamed from the nvme. They are using some kind of virtual memory setup where the nvme is addressable through a new API called DirectStorage. Maybe DirectStorage is a solution, to CPU overhead and latency, but comes with complexity. For example, on PS5 maybe there's just one way to access the drive through the filesystem. Just a basic open, read, write, close asynchronous or synchronous api like any other api. Maybe DirectStorage has a different programming model, so accessing data from the CPU and from the GPU are different. There could be unwanted complexity there. On top of the raw bandwidth disadvantage, maybe the programming model is just harder to use.
As for GPU bandwidth, the Series X really seems to be built around the idea of efficiency in accesses vs raw bandwidth. For example, Sampler Feedback's intention is to access only the parts of textures that are needed vs loading the whole texture into memory for sampling. The Sampler Feedback API allows you to figure out which parts of the texture will be sampled, and then load only those parts. The thing is, it seems to best fit into a particular model of virtual textures with a tile cache. You have a bit of added complexity in terms of learning a new API, but also are somewhat forced to adopt a particular memory management model for textures. Maybe that's not compatible with how some existing code bases are already set up. I don't know how cryengine works right now. So, as a thought, in the situation where the engine is streaming large textures from nvme in RAM, you're now in a situation where you're exceeding the 10GB because you're loading entire textures instead of the necessary parts, and you're hitting the limit of the SSD bandwidth because you're not selectively reading.
As for the front-end, I think it's somewhat the same situation. Mesh Shaders is a totally re-write of the render pipeline before rasterization. It's a fully threaded and compute driven approach. It will take time for developers to learn and optimize for mesh shaders. If you haven't done that yet, you're left with the existing pipeline which has bottlenecks that will favour high clock speeds.
This is all speculation on my part.