Xbox One November SDK Leaked

liquidboy · Jan 13, 2015

Not really sure if this is standard thing in modern GPU's these days?!

Also anyone think of an application of this, if not for performance what other scenario would benefit from going off-chip?!

Tessellation
Describes the advantages and disadvantages of Onchip and Offchip tessellation. Applicable to only Hull Shaders and Geometry Shaders

Onchip tessellation mode
By default, the D3D11 driver always configures the GPU for onchip tessellation mode. In onchip tessellation mode, all of the data for input and output control points and per-patch constant is stored in LDS, including the constant factors. Because LDS is memory internal to the GPU, this means that no additional memory bandwidth is generated, and access to the data is guaranteed to be low-latency.

However, because this data is needed by the DS threads generated by tessellating the patches in a threadgroup, all such DS threads will need to run in the same CU that ran the VS and HS threads for the same threadgroup. This poses a severe limitation to the GPU’s ability to load-balance work amongst the 12 available CUs, especially when the tessellation factors are high.

With high tessellation factors, a single threadgroup might generate many waves of DS threads, and the LDS memory used will be blocked for any other use, including other threadgroups which might otherwise be able to run in the CU. (LDS is also used for threadgroup-local data in compute shaders, for PS interpolant values, and for onchip GS mode.)

Offchip tessellation mode
Offchip tessellation is an option that enables the use of non-LDS memory with hull and domain shaders.

The GPU's offchip tessellation mode is enabled by specifying the flag D3D11X_TESSELLATION_OFFCHIP. This mode uses the same amount of LDS as the onchip mode, but the HS also writes all output control points, tessellation factors, and per-patch constants to a memory buffer.

A heuristic is then used to run some DS waves in the same CU in onchip mode, to read the data from LDS, and to run other DS waves in other CUs, using offchip mode and reading the data from memory. This allows the GPU to release the threadgroup’s LDS memory before all DS waves are finished, or even launched, and it also allows the GPU to load-balance DS waves better across all CUs.

The advantage of doing tessellation off-chip is that LDS memory is freed-up for other graphics purposes. Whether a performance improvement is actually achieved by doing tessellation off-chip is very much dependent on the title code; a performance improvement might not be the result.

Jawed · Jan 13, 2015

However, because this data is needed by the DS threads generated by tessellating the patches in a threadgroup, all such DS threads will need to run in the same CU that ran the VS and HS threads for the same threadgroup. This poses a severe limitation to the GPU’s ability to load-balance work amongst the 12 available CUs, especially when the tessellation factors are high.

With high tessellation factors, a single threadgroup might generate many waves of DS threads, and the LDS memory used will be blocked for any other use, including other threadgroups which might otherwise be able to run in the CU.

Welcome to 4 years ago:

https://forum.beyond3d.com/threads/amd-r9xx-speculation.47074/page-44#post-1371522

Jawed said:
So the count of DS threads in flight is open to question. The more of them, the more aggregate LDS is available to support the output from HS and TS. But DS count is locked to HS count by SIMD usage, which means that HS/DS load-balancing isn't independent - it would be like having a non-unified GPU, back to the bad old days of VS-dedicated and PS-dedicated shader pipes.

Actually I like that text, since it affirms the view I had back then: that Radeon cannot load-balance HS and DS workloads in high-tessellation scenarios (EDIT: keeping all data on chip). Mintmaster never accepted that this is the fundamental problem with tessellation in Radeon.

And it still is, which is pretty tragic.

function · Jan 13, 2015

Though with off chip tessellation and high bandwidth, low latency sram perhaps not so much ...?

function · Jan 13, 2015

turkey said:
Your notes on cpu seem bang on with recent sdk revelation, I wonder of the second part is coming some time later.

As an aside I was wondering why processing voice for the OS was a separate slice of cpu reserve / power over voice for the game. Is the phrase "Xbox record that" much easier than "archers release" for Ryse given they are processed in the shared os.

XbOS (I know, I know, it's has 3) presumably needs to be able to guarantee the required CPU for it needs independent of the game having its own set of voice commands to check against and act upon.

Jawed · Jan 13, 2015

function said:
Though with off chip tessellation and high bandwidth, low latency sram perhaps not so much ...?

It would appear that the SDK would have to explicitly provide that option to find out.

Ethatron · Jan 14, 2015

GDS would be possible it it's large enough, otherwise it's probably the same memory as the one used for spilling.

Alucardx23 · Jan 14, 2015

Xbox One SDK & Hardware Leak Analysis CPU, GPU, RAM & More Part One – Tech Tribunal
http://www.redgamingtech.com/xbox-o...ysis-cpu-gpu-ram-more-part-one-tech-tribunal/

DrJay24 · Jan 14, 2015

He says the eSRAM has a real world performance of 102-130GB/s according to the document.

iroboto · Jan 14, 2015

DrJay24 said:
He says the eSRAM has a real world performance of 102-130GB/s according to the document.

That doesn't seem to agree with what Turn10 (I assume it was them since they were speaking of 1080p game at the time and ROPs) was producing. They peaked esRAM at 140 GB/s in game code and that was written in the XBox Architects interview and that was some time ago without the SDK improvements. edit: Thanks for the correction Newguy. Yea it was mentioned in the interview but that slide deck was what I wanted to reference. Don't quote me on the T10 yet, until I can find it.

Newguy · Jan 14, 2015

http://www.slideshare.net/DevCentralAMD/inside-xbox-one-by-martin-fuller

Page 9, 141 was the highest achieved at that time.

pMax · Jan 14, 2015

Alucardx23 said:
Xbox One SDK & Hardware Leak Analysis CPU, GPU, RAM & More Part One – Tech Tribunal
http://www.redgamingtech.com/xbox-o...ysis-cpu-gpu-ram-more-part-one-tech-tribunal/

hmm... ONION bus is not coherent? As well, if Host OS is an hypervisor, where is the 3rd OS?
Mah.

Scott_Arm · Jan 14, 2015

This is not complicated. The Host OS runs the hypervisor. It's basically a completely stripped down version of something like Hyper-V. Then there's the System OS and the Title OS. The Title OS is light-weight as well.

3dilettante · Jan 14, 2015

pMax said:
hmm... ONION bus is not coherent?

It's not cache coherent. It allows the GPU and CPU to read and write to the same pages it memory. Those pages can be cached normally by the CPU and read/written without worrying about some serious performance cliffs, and it does allow the GPU to be aware of data cached by the CPUs. It's the lack of snooping of the GPU that is a problem. If it's not treated as IO coherent (coarse synchronization, explicit flush of the GPU caches to memory before having anything else use it) there are going to be problems.
The Onion+ bus "solves" this problem by forcing traffic to bypass the GPU caches entirely.

As well, if Host OS is an hypervisor, where is the 3rd OS?

I think the hypervisor is the third OS.

iroboto · Jan 15, 2015

@mosen I just want to clarify with you your thoughts from earlier: just working with the facts it was revealed that an additional hardware graphics context was provided to X1 from 6 to now 7.
Earlier I asked as to what 'made' a hardware graphics context but... it seems it could be a wording change from deferred contexts in D3D 11/12 ?

So today for XBox One, we're looking at 7 CPU cores creating/generating 7 deferred contexts into DirectX11 (one of which will be the immediate context) and 6 others will be the deferred ones, which these contexts should directly push into both command processors which can receive 7 game title contexts and 1 system OS context.

Ignoring the dual pipeline and GCP for a moment; I'm just looking for an 'ok' to assume that hardware graphics contexts on GPU is lining up perfectly with Direct3D deferred contexts/immediate contexts - which are in turn filled by separate CPU cores.

mosen · Jan 15, 2015

iroboto said:
@mosen I just want to clarify with you your thoughts from earlier: just working with the facts it was revealed that an additional hardware graphics context was provided to X1 from 6 to now 7.
Earlier I asked as to what 'made' a hardware graphics context but... it seems it could be a wording change from deferred contexts in D3D 11/12 ?

So today for XBox One, we're looking at 7 CPU cores creating/generating 7 deferred contexts into DirectX11 (one of which will be the immediate context) and 6 others will be the deferred ones, which these contexts should directly push into both command processors which can receive 7 game title contexts and 1 system OS context.

Ignoring the dual pipeline and GCP for a moment; I'm just looking for an 'ok' to assume that hardware graphics contexts on GPU is lining up perfectly with Direct3D deferred contexts/immediate contexts - which are in turn filled by separate CPU cores.

If you read Cerny and Goossen comments, you'll see that both of them are saying the same thing:

Thirdly, said Cerny, "The original AMD GCN architecture allowed for one source of graphics commands, and two sources of compute commands. For PS4, we’ve worked with AMD to increase the limit to 64 sources of compute commands -- the idea is if you have some asynchronous compute you want to perform, you put commands in one of these 64 queues, and then there are multiple levels of arbitration in the hardware to determine what runs, how it runs, and when it runs, alongside the graphics that's in the system."

Andrew Goossen: The number of asynchronous compute queues provided by the ACEs doesn't affect the amount of bandwidth or number of effective FLOPs or any other performance metrics of the GPU. Rather, it dictates the number of simultaneous hardware "contexts" that the GPU's hardware scheduler can operate on any one time. You can think of these as analogous to CPU software threads - they are logical threads of execution that share the GPU hardware. Having more of them doesn't necessarily improve the actual throughput of the system - indeed, just like a program running on the CPU, too many concurrent threads can make aggregate effective performance worse due to thrashing. We believe that the 16 queues afforded by our two ACEs are quite sufficient.

The graphics/compute hardware contexts are there so the scheduler can determine which one of them should runs on the system. Considering both Cerny and Goossen comments, it seems that standard GCN architecture only supports one source of graphics command (graphics hardware context).

Cerny:

The original AMD GCN architecture allowed for one source of graphics commands, and two sources of compute commands. For PS4, we’ve worked with AMD to increase the limit to 64 sources of compute commands

On PS4 you can put your compute commands on one of those 64 queues which described as 64 sources of compute commands. So, on original GCN architecture there should be only one queue for graphics command or one source of graphics commands.

Goossen:

Rather, it dictates the number of simultaneous hardware "contexts" that the GPU's hardware scheduler can operate on any one time. You can think of these as analogous to CPU software threads - they are logical threads of execution that share the GPU hardware. Having more of them doesn't necessarily improve the actual throughput of the system - indeed, just like a program running on the CPU, too many concurrent threads can make aggregate effective performance worse due to thrashing. We believe that the 16 queues afforded by our two ACEs are quite sufficient.

Goossen, described ACEs queues as "hardware contexts" (that the GPU's hardware scheduler can operate on any one time) and compared them with CPU software threads and suggest that having more of them won't necessarily increase the actual throughput of the system.

So, I think that XB1 GPU has 8 graphics-hardware-contexts/queues/source-of-graphics-commands afforded by it's two GCPs, which seven of them are available to titles.

iroboto · Jan 15, 2015

mosen said:
If you read Cerny and Goossen comments, you'll see that both of them are saying the same thing:
The graphics/compute hardware contexts are there so the scheduler can determine which one of them should runs on the system. Considering both Cerny and Goossen comments, it seems that standard GCN architecture only supports one source of graphics command (graphics hardware context).
On PS4 you can put your compute commands on one of those 64 queues which described as 64 sources of compute commands. So, on original GCN architecture there should be only one queue for graphics command or one source of graphics commands.

I'm not sure if this is correct, someone else with more GCN knowledge will be able to chip in here.

Contexts as I understand them, if they aren't that different from how an OS handles context, it's just switching containers, as in registers and memory belonging to the context is moved together to be operated on. It's how an OS is able to multitask essentially, giving each program a small time slot. So a GPU in this case dealing with multiple contexts will change between the context as required. IIRC according to my DX11 textbook prior to GCN technology compute and graphics required a context switch which meant if you decided to draw something, do compute on it, and draw again after, the context would change between graphics and compute and back to graphics. That switch caused massive overhead, so it was recommended to do your best to keep it all graphics context and then switch to compute context.

With GCN, the ACE was developed as a separate compute command processor. Whenever a compute job is needed that context will run on it's own completely separated from the graphics contexts. GCN command processors are required to be ran 'in order' or at the least need to finish execution in order. The ACE and graphics command processor can shuffle the work up within the GPU provided it meets this criteria as I understand it.

What i'm curious about are the graphics contexts though, whether they are hardware or software, would mainly just come down to how fast they can be swapped. I'm under the assumption that all GCN based GPUs have hardware contexts because GCN supports 'fast-context' switching, which I'm going to assume is the result of going with a hardware context. Right now as we know it, MS has specifically aligned the number of possible DirectX contexts (8) with the number of available hardware graphics contexts (8). In this scenario, I'm assuming context switching will just occur as normal, unless having 8 contexts is considered special. The only advantage I could see here with having the contexts line up is to reduce the overhead of context switching in the scenario that there are more Dx contexts than there are hardware contexts.

I've not a clue how many hardware contexts are supposed to be available on GCN, we just know Xbox One has 8. All GCN and Nvidia's beyond 660 all will support D3D12. So how could they have guessed how many hardware contexts to make back then to line up with the number of CPU cores. It's just seems like if the number of cores having to line up with the number of contexts really matters, that information should be available to consumers.

The reasoning behind that 2nd GCP can't be about supporting more contexts imo, we don't know if the command processor on the X1 supports 4 or 8 contexts.

a) If 4, then they work together to form 8 contexts.
b) If 8 contexts per command processor, than they hold a replication of the context in each command processor.

What i"m thinking now, if it's (a) the second command processor is only utilized when it's required and _maybe_ but proofless that it can run 2 independent contexts at the same time, likely there will be some waiting, so we're talking about performance ranging anywhere between 1x (completely running serially)- 2x (perfectly threaded no contention), but likely averaging to being only better than 1x.

Or if it's option [which I am leaning towards] (b) it's about instant context switch. Like a dual clutch transmission. Instead of switching from context 7 to context 8 for instance, since both GCPs hold all 8 contexts, it will just start pulling from GCP 2 number 8, as opposed to having to do a hardware context swap. Then GCP 1 will switch to context 1 (while the other context is being worked on), and when the pipeline is clear is pulls from GCP 1, and then GCP2 switches to context 2 and when the pipeline is clear pulls from GCP2 and so forth.

edit: I suppose there could be an option C that mixes both A and B together. Hmmm that would be seriously cool.

dcbronco · Jan 16, 2015

Hello everyone, I'm new to these forums and am not a trained engineer or programmer. Or untrained for that matter. I love tech and I'm trying to understand so please be gentle in your responses.

I've read most of the recent post and was wondering about the feasibility of a few theories. I've read the talk of there seeming to be extra parts and their potential use. There have been rumors over time of dual GPUs and there appears to be parts on the die to maybe support that. And there is the extra GCP. Given the OS design, is it possible that the Xbox One can connect to either a second Xbox One or the rumored Xbox Surface for a local compute cloud?

I just thinking about the various rumors over the years. Forward compatibility, dual GPUs, two GPUs writing to the same scene.

Is a Hyper V cluster possible with a tablet containing a modified Beema APU?

Polyteres · Jan 16, 2015

Hi. I have read a lot of papers and articles about GCN architecture and a big part of the XDK, not completely of course, and there are some interesting things.

Tessellation

I think that off-chip tessellation is a common feature of GCN architecture. In this paper we can see a direct reference, and in Anantech's article too:

http://www.slideshare.net/DevCentralAMD/gs4152-michael-mantor (page 9).
http://www.anandtech.com/show/7457/the-radeon-r9-290x-review/3

Anantech says:

Alongside the additional geometry processors AMD has also improved both on-chip and off-chip data flows, with off-chip buffering being improved to further improve AMD’s tessellation performance, while the Local Data Store can now be tapped by geometry shaders to reduce the need to go off-chip at all.

8 graphics contexts

There are a lot of references to the graphics contexts in the XDK, specially on the PIX. In "PIX Shader Timelines" we can see how these contexts works. This made me wonder if it was a unique feature of XboxOne or was common to the GCN architecture. I recommend this reading especially :

http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/10/si_programming_guide_v2.pdf

There are a lot of coincidences between the XDK and this papers. Given its definition in the XDK, and the concept of context-roll, the paper...I believe that this feature is common to all GCN architecture.

Of course Microsoft could have modified these features.

Bye.

dcbronco · Jan 16, 2015

Having read a ton of articles over the years I have noticed that Microsoft has been designing chip architectures for sometime now. I believe they started around 2004 with a small group and expanded to over 200 engineers. When they introduced the SoC for the 360 at Hot Chips it seemed a lot of the work was done in-house. RRoD was caused by a change one of their people made. Even one of the AMD engineers said Microsoft came to them with a roadmap for die shrinks and the eventual combining of the CPU and GPU. I wouldn't underestimate what Microsoft does in-house. They spend close to ten billion a year on R&D and a huge chunk is in the Devices division.

NRP · Jan 16, 2015

Holy shit, that's a lot of money. One would think they could have come with better hardware for that kind of investment.

Xbox One November SDK Leaked

liquidboy

Jawed

function

None functional

function

None functional

Jawed

Ethatron

Alucardx23

DrJay24

iroboto

Daft Funk

Newguy

pMax

Scott_Arm

3dilettante

iroboto

Daft Funk

mosen

iroboto

Daft Funk

dcbronco

Polyteres

dcbronco

NRP

Similar threads