Why are we assuming they are sending only 5.1/7.1 and downmixing?
Because it gets them the effect they're after with minimal complexity and cost and doesn't require any changes at the application/engine level for developers. Getting HRTF audio localization to sound 'right' is a matter of 'best-fit' because it's entirely dependent on the user's physiology, so trying to develop an all new audio stack that has no guaranty of sounding more accurate is a waste of time and money to implement. The question isn't 'Why are they only doing X?', but rather 'Why would they do more than X?'