Phil Spencer just strongly believes in the image as the developer intended and native resolutions.
In all seriousness I don't feel it's as simple as it's being made out as.
In hindsight (thanks to Nvidia I guess) these ML scaling techniques are now (well kind of) showing worth as differentiators and marketable but that sentiments kind of developed more so over time. The opening of this post certainly wasn't an uncommon reaction to DLSS, and the sentiment is still prevalent for reconstruction in general. Also with hindsight we now know games are going to struggle in some cases to even match last gen resolutions and upscaling is needed.
At the moment we have no idea what the cost is to actually develope these ML scaling techniques. You're looking at both dev costs and training costs for the model that need to be ammortized over what you sell somehow. I think a lot of discussions on this, and software add on features, by the general public just seem to assume the costs are trivial or something (and as a byproduct hate any sort of hardware feature locking). Sony can ammortize costs over through the PS5 Pro, there is no avenue here for Microsoft to do so.
Any gatekeeping of a solution by Microsoft might also be problematic. People are going to accept Sony gatekeeping PSSR to the PS5 Pro likely, but what about Microsoft in this case? One of the xbox's? Another xbox? Xbox only and not Windows? Controversey all around here.
When ML hardware was added for the Xbox my guess was it was a catch all as the direction of where this was going wasn't established yet. ML being used directly for graphics I don't think was in most peoples ideas of where the most prevalent and impactful use case in games would be. It was likely the thought was more so in term of actual game play impacting uses and determined by the developers. From what I remember at the time the prevelant concepts were of ML being used for things like game AI, or adapting the game world to the player (not the extent of the current LLM integrations).
This was a MS Devblog post with respect to machine learning and gaming back in 2018 prior to Turing/DLSS -
https://devblogs.microsoft.com/directx/gaming-with-windows-ml/
And remember even with Turing and DLSS (1) at the time Nvidia wasn't exactly sure where to take this technology either. Remember at the onset it was per game training for instance as opposed to a more universal model.
As an aside we might also want to see PSSR in practice first. I don't know if it's the best idea to just assume it's easy to match DLSS and XeSS.