There's a little more nuance to what MS built in XSX. We're waiting for more details, but it's a minor addition to a greater piece of the problem.Yes, the decompressor is a hardware block as said before. The lack of other IO hardware takes you from the 100% SSD speed to the 20% real result. See Cernys slide. No software can overcome that 80% without eating many CPU resources.
One might look at 360 emulation as being entirely software, but there is a small hardware custom component in which there is hardware to emulate a very specific part that would be too difficult to do.
In the same way, the XVA is a bit like that, mainly software, but there is a small component on how the GPU can access the SSD storage that isn't done in software. Unfortunately we know little about it. So that makes it hard to discuss and easily dismissed as being critical.
"We observed that typically, only a small percentage of memory loaded by games was ever accessed," reveals Goossen. "This wastage comes principally from the textures. Textures are universally the biggest consumers of memory for games. However, only a fraction of the memory for each texture is typically accessed by the GPU during the scene. For example, the largest mip of a 4K texture is eight megabytes and often more, but typically only a small portion of that mip is visible in the scene and so only that small portion really needs to be read by the GPU."
As textures have ballooned in size to match 4K displays, efficiency in memory utilisation has got progressively worse - something Microsoft was able to confirm by building in special monitoring hardware into Xbox One X's Scorpio Engine SoC. "From this, we found a game typically accessed at best only one-half to one-third of their allocated pages over long windows of time," says Goossen. "So if a game never had to load pages that are ultimately never actually used, that means a 2-3x multiplier on the effective amount of physical memory, and a 2-3x multiplier on our effective IO performance."
A technique called Sampler Feedback Streaming - SFS - was built to more closely marry the memory demands of the GPU, intelligently loading in the texture mip data that's actually required with the guarantee of a lower quality mip available if the higher quality version isn't readily available, stopping GPU stalls and frame-time spikes. Bespoke hardware within the GPU is available to smooth the transition between mips, on the off-chance that the higher quality texture arrives a frame or two later.
If you recall earlier, they used XBO code to test 4K on Scorpio. Scorpio gave MS the data they needed to optimize for 4K, because so many titles were. They could see what is happening with game code with respect to the memory usage. Then they focused on building this. So now they have this software solution, and they have bespoke hardware to improve it. I'm not going to tell you that this is a superior solution or anything. I'm just saying that Sony and MS had different data to use to solve a problem, the way that Sony approached it and the way that MS approached it differently based on their requirements.
So I think what I see here is that MS opted for a slower SSD hardware solution, in hopes that they could develop a solution in which streaming assets is significantly smaller in size. Sony may have been looking at more traditional methods and worked out that they needed a significantly faster drive to do the same thing.
It might be easy to say that Sony can just leverage these learnings and integrate it into PS5; but it's not that straight forward to just pick up and go unfortunately. So we don't know if Sony will support these features. They may have architected their API/chip differently here. And it happens, you can sometimes program your shit in such a way that the only way to get in a different feature is to re-write everything. Sometimes that's not worth it.
I would not be surprised if bespoke hardware operated using this manner - a little bit like how your phones respond very well to 'Hey Google', 'Hey Siri'. But if your voice command exceeds like 3-5 words it needs to send it to the cloud for processing, whereas very small 3-5 length word voice commands can be processed on a tiny NN locally.
"We knew that many inference algorithms need only 8-bit and 4-bit integer positions for weights and the math operations involving those weights comprise the bulk of the performance overhead for those algorithms," says Andrew Goossen. "So we added special hardware support for this specific scenario. The result is that Series X offers 49 TOPS for 8-bit integer operations and 97 TOPS for 4-bit integer operations. Note that the weights are integers, so those are TOPS and not TFLOPs. The net result is that Series X offers unparalleled intelligence for machine learning. (on a console, cough - iroboto's add)"
Small quick running NN for texturing up-resolution in the scenario the drive isn't fast enough.
Last edited: