Xbox Series X [XBSX] [Release November 10 2020]

iroboto · Oct 26, 2020

mawver said:
They said it is a machine learning algorithm, tuned game by game and can be turned off.

They have hw for inference, but would it cost zero for the rest of the hw?

View attachment 4825

If they are using the GPU for ML for auto-HDR there should be an impact.
The simplest explanation is that there is significant margin to running these BC titles that the impact from auto HDR is a non-factor.
Another explanation is that, they were particular in mentioning that running BC titles meant using the GCN instruction set, which completes a wavefront every 4 cycles, and therefore launches a new one then; compared to RDNA which can launch a new wavefront every cycle. I do wonder if it's possible for MS to launch auto-hdr into those free cycles.

mawver · Oct 26, 2020

I understand the arguments, but autoHDR is an example, my original point is what impact the dual-pipe would have on the XS hw since navi21 apparently doesn't use it

BRiT · Oct 26, 2020

Jay said:
The only problem with that being done outside system reservation, is that it's still using gpu resources that would otherwise be used.
Unless they can guarantee that it can fit in bubble of async compute.
So i would see it being both lower precision and system reservation.

Wasn't there some portion that was done as part of the extended instruction slots -- where it's a use it or not benefit from it kind of thing? I need to look at the HotChips and AMD RDNA documentation again to see how the TOPs fit in with FLOPS.

Jay · Oct 26, 2020

BRiT said:
Wasn't there some portion that was done as part of the extended instruction slots -- where it's a use it or not benefit from it kind of thing? I need to look at the HotChips and AMD RDNA documentation again to see how the TOPs fit in with FLOPS.

Get better utilization if you only need lower precision.

But still using the same INT pipeline. Otherwise I'd see it as lot more than just adding lower precision and something more akin to tensor cores, or parallel lower precision pipeline, which would be lot more work than sounded like they did.
That was my impression though, so be interesting to get further input.

Jay · Oct 26, 2020

Playable PUBG on console

Strange · Oct 26, 2020

Jay said:
Playable PUBG on console

Really shitty analysis UI. You don't move the scales around all the time.

Jay · Oct 26, 2020

Strange said:
Really shitty analysis UI. You don't move the scales around all the time.

My takeaway was solid 60fps with no frame tearing..
Looked like it was bouncing around 61fps though which may not be good.
Have to admit I never listened much to what was said, I'll wait for DF for more nuanced breakdown.

Kaotik · Oct 26, 2020

BRiT said:
We believe for AutoHDR that Microsoft is using the INT8 / INT4 portions of the GPU, that is entirely unused for BC games, hence the "for free" statement.

I vaguely recall it being a parallel path to the rest of the GPU, so it doesn't have a resource impact. But that part is fuzzy in my memory.

The problem with that is the fact that it's the same CU's no matter if you're doing INT8/INT4 or FP32, so it could affect performance too. It's of course possible that it's just so fast to do it won't affect performance in any noticeable way.
edit: or that backwards compatible titles will just always have free CUs.

Unknown Soldier · Oct 26, 2020

Well Bleeding Edge might not be making it onto the console.

https://www.windowscentral.com/has-xbox-and-ninja-theorys-bleeding-edge-been-abandoned

mawver · Oct 26, 2020

BRiT said:
We believe for AutoHDR that Microsoft is using the INT8 / INT4 portions of the GPU, that is entirely unused for BC games, hence the "for free" statement.

Kaotik said:
The problem with that is the fact that it's the same CU's no matter if you're doing INT8/INT4 or FP32, so it could affect performance too. It's of course possible that it's just so fast to do it won't affect performance in any noticeable way.

I think..

Calculating INT4 or FP32 costs the same, one cycle

Kaotik said:
edit: or that backwards compatible titles will just always have free CUs.

That would be against what they said "absolutely no performance cost to the CPU, GPU or memory and there is no additional latency"

iroboto · Oct 26, 2020

mawver said:
I think..

Calculating INT4 or FP32 costs the same, one cycle

you can calculate 8x INT4 and 4XINT8 in a single cycle vs 1xFP32.
That's just rapid packed math.
With the inclusion of the ML features into the CUs, they can also perform mixed dot-products if required in a single cycle between Int4 and Int8.

mawver said:
That would be against what they said "absolutely no performance cost to the CPU, GPU or memory and there is no additional latency"

Not necessarily. BC titles can only submit work once every 4 cycles, there's is idle time that can still be taken advantage of.

mawver · Oct 26, 2020

iroboto said:
you can calculate 8x INT4 and 4XINT8 in a single cycle vs 1xFP32.
That's just rapid packed math.
With the inclusion of the ML features into the CUs, they can also perform mixed dot-products if required in a single cycle between Int4 and Int8.

I know...

It may be my English, but I believe that the friend had suggested that because it is a simpler calculation (INT) it would be done "faster" so I pointed out that it doesn't matter if it is INT or FP the cost is the same.

iroboto said:
Not necessarily. BC titles can only submit work once every 4 cycles, there's is idle time that can still be taken advantage of.

Ok would it make sense in BC simulating GCN, but doesn't it explain the story BVH offline

Says Andrew Goossen. "For the Series X, this work is offloaded onto dedicated hardware and the shader can continue to run in parallel with full performance. In other words, Series X can effectively tap the equivalent of well over 25 TFLOPs of performance while ray tracing."

AzBat · Oct 26, 2020

Official Walkthrough...

https://news.xbox.com/en-us/2020/10/26/what-to-expect-when-you-boot-next-gen-xbox/

Tommy McClain

iroboto · Oct 26, 2020

mawver said:
I know...

It may be my English, but I believe that the friend had suggested that because it is a simpler calculation (INT) it would be done "faster" so I pointed out that it doesn't matter if it is INT or FP the cost is the same.

Ok would it make sense in BC simulating GCN, but doesn't it explain the story BVH offline

Says Andrew Goossen. "For the Series X, this work is offloaded onto dedicated hardware and the shader can continue to run in parallel with full performance. In other words, Series X can effectively tap the equivalent of well over 25 TFLOPs of performance while ray tracing."

I guess, doing 8x Int4 or 4x Int8 would be much faster if those are the precisions being used over FP32. Machine learning tends to be highly parallel in terms of just doing the same repeated calculation over and over again. Though some algorithms are serial, but I'm not sure what MS is doing here.

What is the BVH offline story, this is the first time heard of it. I think our original understanding of MS statements here is that the Ray Tracing hardware can perform all the intersection tests while the shader is running in full performance. But shaders are still required to traverse the BVH tree IIRC.

thicc_gaf · Oct 26, 2020

iroboto said:
That's been around since XBO was developed. This particular passage seems to describe the command known as
DX12: ExecuteIndirect
Vulkan:VK_NV_device_generated_commands
CUDA: Kernels can be launched from within kernels since Kepler <<< >>>

There are some customizations by MS that allow for slightly more flexibility on executeIndirect than what is available on the PC space. In terms of what we think is available in functionality
PC < Xbox One < Xbox One X < XB |SX

ExecuteIndirect deals with drawcall stalls IIRC; it's supposed to help with reducing stalls on drawcalls for GPU instructions from what (limited) bits I've read on it.

chris1515 · Oct 26, 2020

https://twitter.com/x/status/1320713836920508416

https://twitter.com/x/status/1320722670170918912

Very strange if true maybe a later update

iceberg187 · Oct 26, 2020

Two more weeks, excluding today. For the longest time, these didn't feel totally real. That new hardware feeling is starting to settle in.

mawver · Oct 26, 2020

iroboto said:
But shaders are still required to traverse the BVH tree IIRC.

It was also talked about in hot chips ...
"shade can run in parallel for BVH traversal, material shading, etc"

Based on Andrew Goossen's speeches I think it's reasonable to assume that the xbox sends a set of instructions to wgp0 and wgp1 when a job goes on hold (I don't know, maybe requesting data in the memory) the other executes improving occupancy

Riiiiiight?

3dilettante · Oct 26, 2020

mawver said:
That would be against what they said "absolutely no performance cost to the CPU, GPU or memory and there is no additional latency"

Auto HDR is added for backwards compatibility mode. Typically, such modes are relegated to a subset of the CU and memory resources available to the full console. If everything is relative to the performance of the original title, there is no loss since the full console may not be used for the game.

iroboto said:
Not necessarily. BC titles can only submit work once every 4 cycles, there's is idle time that can still be taken advantage of.

I'm not sure there's a gap in submitting work like that. The only 4 cycles I'm aware of is the 4 cycle issue cadence for wavefronts in GCN, but they perform work and write back results every cycle since each instruction is applied to 4 cycles.

mawver said:
I know...

It may be my English, but I believe that the friend had suggested that because it is a simpler calculation (INT) it would be done "faster" so I pointed out that it doesn't matter if it is INT or FP the cost is the same.

Ok would it make sense in BC simulating GCN, but doesn't it explain the story BVH offline

Says Andrew Goossen. "For the Series X, this work is offloaded onto dedicated hardware and the shader can continue to run in parallel with full performance. In other words, Series X can effectively tap the equivalent of well over 25 TFLOPs of performance while ray tracing."

If you are referencing an article from Digitalfoundry: https://www.eurogamer.net/articles/digitalfoundry-2020-inside-xbox-series-x-full-specs

RDNA 2 fully supports the latest DXR Tier 1.1 standard, and similar to the Turing RT core, it accelerates the creation of the so-called BVH structures required to accurately map ray traversal and intersections, tested against geometry. In short, in the same way that light 'bounces' in the real world, the hardware acceleration for ray tracing maps traversal and intersection of light at a rate of up to 380 billion intersections per second.

"Without hardware acceleration, this work could have been done in the shaders, but would have consumed over 13 TFLOPs alone," says Andrew Goossen. "For the Series X, this work is offloaded onto dedicated hardware and the shader can continue to run in parallel with full performance. In other words, Series X can effectively tap the equivalent of well over 25 TFLOPs of performance while ray tracing."

I note that the first paragraph is not from Goossen, but is a statement by the author of the article. I think that paragraph has a good chance of being mistaken.
Goossen's statement could readily map to the fixed-function intersection and node evaluation hardware in the RDNA2 RT block. It might make more sense being interpreted this way, since his scenario has a shader is calling on the RT functionality and can run in parallel. BVH construction would precede any shader that might depend on it, a shader wouldn't be trying to build a BVH and trying to do something else in the meantime.

iroboto · Oct 26, 2020

thicc_gaf said:
ExecuteIndirect deals with drawcall stalls IIRC; it's supposed to help with reducing stalls on drawcalls for GPU instructions from what (limited) bits I've read on it.

no, not necessarily. ExecuteIndirect allows for the GPU to call it's own kernels without the assistance of the CPU.

Xbox Series X [XBSX] [Release November 10 2020]

iroboto

Daft Funk

mawver

BRiT

(>• •)>⌐■-■ (⌐■-■)

Jay

Jay

Strange

Jay

Kaotik

Drunk Member

Unknown Soldier

mawver

iroboto

Daft Funk

mawver

AzBat

Agent of the Bat

iroboto

Daft Funk

thicc_gaf

chris1515

iceberg187

mawver

3dilettante

iroboto

Daft Funk

Similar threads