Xbox Series X [XBSX] [Release November 10 2020]

They said it is a machine learning algorithm, tuned game by game and can be turned off.

They have hw for inference, but would it cost zero for the rest of the hw?

View attachment 4825
If they are using the GPU for ML for auto-HDR there should be an impact.
The simplest explanation is that there is significant margin to running these BC titles that the impact from auto HDR is a non-factor.
Another explanation is that, they were particular in mentioning that running BC titles meant using the GCN instruction set, which completes a wavefront every 4 cycles, and therefore launches a new one then; compared to RDNA which can launch a new wavefront every cycle. I do wonder if it's possible for MS to launch auto-hdr into those free cycles.
 
I understand the arguments, but autoHDR is an example, my original point is what impact the dual-pipe would have on the XS hw since navi21 apparently doesn't use it
 
The only problem with that being done outside system reservation, is that it's still using gpu resources that would otherwise be used.
Unless they can guarantee that it can fit in bubble of async compute.
So i would see it being both lower precision and system reservation.

Wasn't there some portion that was done as part of the extended instruction slots -- where it's a use it or not benefit from it kind of thing? I need to look at the HotChips and AMD RDNA documentation again to see how the TOPs fit in with FLOPS.
 
Wasn't there some portion that was done as part of the extended instruction slots -- where it's a use it or not benefit from it kind of thing? I need to look at the HotChips and AMD RDNA documentation again to see how the TOPs fit in with FLOPS.
Get better utilization if you only need lower precision.

But still using the same INT pipeline. Otherwise I'd see it as lot more than just adding lower precision and something more akin to tensor cores, or parallel lower precision pipeline, which would be lot more work than sounded like they did.
That was my impression though, so be interesting to get further input.
 
Really shitty analysis UI. You don't move the scales around all the time.
My takeaway was solid 60fps with no frame tearing..
Looked like it was bouncing around 61fps though which may not be good.
Have to admit I never listened much to what was said, I'll wait for DF for more nuanced breakdown.
 
We believe for AutoHDR that Microsoft is using the INT8 / INT4 portions of the GPU, that is entirely unused for BC games, hence the "for free" statement.

I vaguely recall it being a parallel path to the rest of the GPU, so it doesn't have a resource impact. But that part is fuzzy in my memory.
The problem with that is the fact that it's the same CU's no matter if you're doing INT8/INT4 or FP32, so it could affect performance too. It's of course possible that it's just so fast to do it won't affect performance in any noticeable way.
edit: or that backwards compatible titles will just always have free CUs.
 
We believe for AutoHDR that Microsoft is using the INT8 / INT4 portions of the GPU, that is entirely unused for BC games, hence the "for free" statement.

The problem with that is the fact that it's the same CU's no matter if you're doing INT8/INT4 or FP32, so it could affect performance too. It's of course possible that it's just so fast to do it won't affect performance in any noticeable way.

I think..

Calculating INT4 or FP32 costs the same, one cycle

edit: or that backwards compatible titles will just always have free CUs.

That would be against what they said "absolutely no performance cost to the CPU, GPU or memory and there is no additional latency"
 
I think..

Calculating INT4 or FP32 costs the same, one cycle
you can calculate 8x INT4 and 4XINT8 in a single cycle vs 1xFP32.
That's just rapid packed math.
With the inclusion of the ML features into the CUs, they can also perform mixed dot-products if required in a single cycle between Int4 and Int8.

That would be against what they said "absolutely no performance cost to the CPU, GPU or memory and there is no additional latency"
Not necessarily. BC titles can only submit work once every 4 cycles, there's is idle time that can still be taken advantage of.
 
you can calculate 8x INT4 and 4XINT8 in a single cycle vs 1xFP32.
That's just rapid packed math.
With the inclusion of the ML features into the CUs, they can also perform mixed dot-products if required in a single cycle between Int4 and Int8.

I know...

It may be my English, but I believe that the friend had suggested that because it is a simpler calculation (INT) it would be done "faster" so I pointed out that it doesn't matter if it is INT or FP the cost is the same.


Not necessarily. BC titles can only submit work once every 4 cycles, there's is idle time that can still be taken advantage of.

Ok would it make sense in BC simulating GCN, but doesn't it explain the story BVH offline

Says Andrew Goossen. "For the Series X, this work is offloaded onto dedicated hardware and the shader can continue to run in parallel with full performance. In other words, Series X can effectively tap the equivalent of well over 25 TFLOPs of performance while ray tracing."
 
I know...

It may be my English, but I believe that the friend had suggested that because it is a simpler calculation (INT) it would be done "faster" so I pointed out that it doesn't matter if it is INT or FP the cost is the same.




Ok would it make sense in BC simulating GCN, but doesn't it explain the story BVH offline

Says Andrew Goossen. "For the Series X, this work is offloaded onto dedicated hardware and the shader can continue to run in parallel with full performance. In other words, Series X can effectively tap the equivalent of well over 25 TFLOPs of performance while ray tracing."
I guess, doing 8x Int4 or 4x Int8 would be much faster if those are the precisions being used over FP32. Machine learning tends to be highly parallel in terms of just doing the same repeated calculation over and over again. Though some algorithms are serial, but I'm not sure what MS is doing here.

What is the BVH offline story, this is the first time heard of it. I think our original understanding of MS statements here is that the Ray Tracing hardware can perform all the intersection tests while the shader is running in full performance. But shaders are still required to traverse the BVH tree IIRC.
 
That's been around since XBO was developed. This particular passage seems to describe the command known as
DX12: ExecuteIndirect
Vulkan:VK_NV_device_generated_commands
CUDA: Kernels can be launched from within kernels since Kepler <<< >>>

There are some customizations by MS that allow for slightly more flexibility on executeIndirect than what is available on the PC space. In terms of what we think is available in functionality
PC < Xbox One < Xbox One X < XB |SX

ExecuteIndirect deals with drawcall stalls IIRC; it's supposed to help with reducing stalls on drawcalls for GPU instructions from what (limited) bits I've read on it.
 
Two more weeks, excluding today. For the longest time, these didn't feel totally real. That new hardware feeling is starting to settle in.
 
But shaders are still required to traverse the BVH tree IIRC.

It was also talked about in hot chips ...
"shade can run in parallel for BVH traversal, material shading, etc"

Based on Andrew Goossen's speeches I think it's reasonable to assume that the xbox sends a set of instructions to wgp0 and wgp1 when a job goes on hold (I don't know, maybe requesting data in the memory) the other executes improving occupancy

Riiiiiight?
 
That would be against what they said "absolutely no performance cost to the CPU, GPU or memory and there is no additional latency"
Auto HDR is added for backwards compatibility mode. Typically, such modes are relegated to a subset of the CU and memory resources available to the full console. If everything is relative to the performance of the original title, there is no loss since the full console may not be used for the game.

Not necessarily. BC titles can only submit work once every 4 cycles, there's is idle time that can still be taken advantage of.
I'm not sure there's a gap in submitting work like that. The only 4 cycles I'm aware of is the 4 cycle issue cadence for wavefronts in GCN, but they perform work and write back results every cycle since each instruction is applied to 4 cycles.

I know...

It may be my English, but I believe that the friend had suggested that because it is a simpler calculation (INT) it would be done "faster" so I pointed out that it doesn't matter if it is INT or FP the cost is the same.




Ok would it make sense in BC simulating GCN, but doesn't it explain the story BVH offline

Says Andrew Goossen. "For the Series X, this work is offloaded onto dedicated hardware and the shader can continue to run in parallel with full performance. In other words, Series X can effectively tap the equivalent of well over 25 TFLOPs of performance while ray tracing."
If you are referencing an article from Digitalfoundry: https://www.eurogamer.net/articles/digitalfoundry-2020-inside-xbox-series-x-full-specs
RDNA 2 fully supports the latest DXR Tier 1.1 standard, and similar to the Turing RT core, it accelerates the creation of the so-called BVH structures required to accurately map ray traversal and intersections, tested against geometry. In short, in the same way that light 'bounces' in the real world, the hardware acceleration for ray tracing maps traversal and intersection of light at a rate of up to 380 billion intersections per second.

"Without hardware acceleration, this work could have been done in the shaders, but would have consumed over 13 TFLOPs alone," says Andrew Goossen. "For the Series X, this work is offloaded onto dedicated hardware and the shader can continue to run in parallel with full performance. In other words, Series X can effectively tap the equivalent of well over 25 TFLOPs of performance while ray tracing."
I note that the first paragraph is not from Goossen, but is a statement by the author of the article. I think that paragraph has a good chance of being mistaken.
Goossen's statement could readily map to the fixed-function intersection and node evaluation hardware in the RDNA2 RT block. It might make more sense being interpreted this way, since his scenario has a shader is calling on the RT functionality and can run in parallel. BVH construction would precede any shader that might depend on it, a shader wouldn't be trying to build a BVH and trying to do something else in the meantime.
 
Back
Top