Xbox One November SDK Leaked

Would the virtual addressing for GPU buffers allow developers to control precisely which sections of a buffer were is sram? I'm talking about at a much finer level than just "the top 30%".

Allocation is tracked via page table translations, so it can be allocated at a page granularity. The supported sizes are 4KB and 64KB.
I've not determined just how arbitrary the current interfaces allow the mappings to be, but that seems to be what it is at the hardware level.
 
I found the following interesting, not sure if its standard in other architecture or not and or if opengl exposes such things in it's apis..

The following explicit datatypes exist in the xb1, whose states are saved by the d3d runtime during "suspend"

context registers, CE RAM, ESRAM, GDS, Index buffers, some GPU registers (not all the GPU registers are readable), and CP internal memory

CE RAM : constant engine RAM

and each stage of the "traditional" graphics pipeline has CE RAM

typedef enum D3D11X_CERAM_OFFSET
{
D3D11X_CERAM_OFFSET_CS,
D3D11X_CERAM_OFFSET_VS,
D3D11X_CERAM_OFFSET_HS,
D3D11X_CERAM_OFFSET_DS,
D3D11X_CERAM_OFFSET_GS,
D3D11X_CERAM_OFFSET_PS,
D3D11X_CERAM_OFFSET_VB,
D3D11X_CERAM_OFFSET_PS_UAV_INC,
D3D11X_CERAM_OFFSET_CS_UAV_INC,
D3D11X_CERAM_OFFSET_LIMIT
} D3D11X_CERAM_OFFSET;

Also GDS support

GDS = Global data store

On Xbox One GPU, the UAV counters are stored in GDS. Performing atomic operations is faster in GDS than in main memory, so AppendBuffers and ConsumeBuffers use GDS for the counters by default, to reduce contention on the main memory

D3D11X now has explicit GDS support

typedef enum _D3D11X_GDS_OPERATION_FLAGS
{
D3D11X_GDS_OPERATION_DEFAULT,
D3D11X_GDS_OPERATION_READ_AT_EOS,
D3D11X_GDS_OPERATION_READ_AT_TOP,
D3D11X_GDS_OPERATION_WRITE_COPY_DATA_TO_CB,
D3D11X_GDS_OPERATION_WRITE_USE_DATA_BY_POINTER,
D3D11X_GDS_OPERATION_COMPUTE
} D3D11X_GDS_OPERATION_FLAGS;
 
The GPU is definitely doing the graphics rendering, and it is definitely virtualized using something like RemoteFX and Hyper-V. You can definitely run games in a virtualized environment on a GPU using RemoteFX. It's just that the workstation cards are the only ones that support it in the drivers, as far as I know. In a console, making their own driver, that's not really a limitation.

There is absolutely no hidden hardware in the Xbox One, besides what's in the SDK. The move engines are just slightly customized DMA. That's in the SDK. No hidden magic. The hardware is as described in the doc.
 
The GPU is definitely doing the graphics rendering, and it is definitely virtualized using something like RemoteFX and Hyper-V. You can definitely run games in a virtualized environment on a GPU using RemoteFX. It's just that the workstation cards are the only ones that support it in the drivers, as far as I know. In a console, making their own driver, that's not really a limitation.

There is absolutely no hidden hardware in the Xbox One, besides what's in the SDK. The move engines are just slightly customized DMA. That's in the SDK. No hidden magic. The hardware is as described in the doc.

When we talked about GCP earlier, you mentioned that there was very to no detail about the 2nd GCP in the SDK. I'm beginning to think this GCP could be dedicated for the system/OS now, or possibly even Kinect at a system level.
 
Last edited:
There are two for sure. And it is not normal. I guess we can look into future AMD IPS and see if it shows up but if it does not it is very specific to Xbox and likely specific to a function that could require that sort of separation.
Edit: grammar
 
Last edited:
There are two for sure. And it is not normal. I guess we can look into future AMD IPS and see if it shows up but if it don't it is very specific to Xbox and likely specific to a function that could require that sort of separation.

Whatever it's doing, it must be transparent to the developers, because I didn't see it referenced anywhere.
 
The move engines are just slightly customized DMA. That's in the SDK. No hidden magic. The hardware is as described in the doc.

For those interested heres the quotes from the xb1 architects interview

Digital Foundry: So often you're CPU bound. That explains why so many of the Data Move Engine functions seem to be about offloading CPU?

Andrew Goossen: Yeah, again I think we under-balanced and we had that great opportunity to change that balance late in the game. The DMA Move Engines also help the GPU significantly as well. For some scenarios there, imagine you've rendered to a depth buffer there in ESRAM. And now you're switching to another depth buffer. You may want to go and pull what is now a texture into DDR so that you can texture out of it later and you're not doing tons of reads from that texture so it actually makes more sense for it to be in DDR. You can use the Move Engines to move these things asynchronously in concert with the GPU so the GPU isn't spending any time on the move. You've got the DMA engine doing it. Now the GPU can go on and immediately work on the next render target rather than simply move bits around.

Nick Baker: From a power/efficiency standpoint as well, fixed functions are more power-friendly on fixed function units. We put data compression on there as well, so we have LZ compression/decompression and also motion JPEG decode which helps with Kinect. So there's a lot more than to the Data Move Engines than moving from one block of memory to another.


Digital Foundry: What were your takeaways from your Xbox 360 post-mortem and how did that shape what you wanted to achieve with the Xbox One architecture?

Nick Baker: It's hard to pick out a few aspects we can talk about here in a small amount of time. I think one of the key points... We took a few gambles last time around and one of them was to go with a multi-processor approach rather than go with a small number of high IPC [instructions per clock] power-hungry CPU cores. We took the approach of going more parallel with cores more optimised for power/performance area. That worked out pretty well... There are a few things we realised like off-loading audio, we had to tackle that, hence the investment in the audio block. We wanted to have a single chip from the start and get everything as close to memory as possible. Both the CPU and GPU - give everything low latency and high bandwidth - that was the key mantra.

Some obvious things we had to deal with - a new configuration of memory, we couldn't really pass pointers from CPU to GPU so we really wanted to address that, heading towards GPGPU, compute shaders. Compression, we invested a lot in that so hence some of the Move Engines, which deal with a lot of the compression there... A lot of focus on GPU capabilities in terms of how that worked. And then really how do you allow the system services to grow over time without impacting title compatibility. The first title of the generation - how do you ensure that that works on the last console ever built while we value-enhance the system-side capabilities.
 
I've never seen that page in the document.

The document has details on 4 "Move Engines" that "perform various types of fast direct memory access (DMA)".

Move engine 1
plain copy, swizzle/unswizzle (title exclusive use)

Move engine 2
plain copy, swizzle/unswizzle (title exclusive use)

Move engine 3
plain copy, swizzle/unswizzle, Lempel-Ziv (LZ) lossless encode/decode (title exclusive use)

Move engine 4
plain copy, swizzle/unswizzle, JPEG decode (title/system shared use)

All four can handle arbitrary source and destination addresses in main RAM or ESRAM.

In other places they are referred to as DMA engines 0 - 3
DMA engines 0, 2 are on one 25 GB/s bus (1) and DMA engines 1, 3 are on another 25 GB/s bus (0)


That seems to be all there is to it.
 
I have the SDK open in front of me. Unfortunately the CHM viewer I got is primitive and doesn't even have search or copy/paste ... wtf.

Anyway, in the sections detailing the hardware overview or the memory systems, there is nothing mentioned about an "xdma chip." DMA means Direct Memory Access. It's a part of the GPU. The GPU block diagram shows 4 Move Engines, as we've expected. They are detailed in the SDK as we expected. There doesn't seem to be anything more.
Hey Scott. If you don't mind but could you link your reader? I wouldn't mind diving into the documentation myself now.
 
Hey Scott. If you don't mind but could you link your reader? I wouldn't mind diving into the documentation myself now.

I have a macbook, so I downloaded CHMSimpleViewer from the app store because it was free. There may be something better out there than can actually do basic things like search, copy and paste.
 
Please keep this thread about the actual technical SDK. If you're new here, that means don't post in it with your novice unrelated questions. Do that in a different non technical thread.

Thanks...
 
There is actually quite a bit of information about the command processors in the SDK. It can be found in Caches and Coherency on the Xbox One GPU topic under White Papers.

I don't know how they are supposed to work, but there are more knowledgable people here about it.
But here are some things I would like to know more about if you know it!

On a very basic level, a GPU can be thought of as a pipeline in which draw calls and compute-shader dispatches enter from the command processor and retire after they have been fully executed. Even though draws and dispatches can run concurrently on the GPU, they don’t overtake each other and always retire in order.

But there are 2 separate CPs so there must be some sort of communication happening between the two to make sure that they are retiring in order right?

The following is just information about the command processor - this is the first time I've read in depth about a CP this detailed, so here is the info. What the interesting bit was that I didn't know the CP could stall. So it got me thinking that maybe the 2nd one is there to continue if the 1st one stalls?
Pipeline stages at a glance
A look at the pipeline as far as synchronization is concerned:

At the very top of the pipeline is the pre-fetch processor (PFP), which is the part of the command processor that reads memory for the micro-engine (ME). The PFP also kicks off the direct memory access (DMA) for the vertex geometry and tessellation (VGT) block, specifically for index buffers and indirect draw buffers.

The PFP is responsible for:

  • Reading command-buffer data.
  • Triggering index-buffer DMA transfers for the VGT.
  • Reading indirect buffer parameters.
  • Reading predication information.
  • Communicating and synchronizing with the ME.
The next stage in the pipeline is the ME. The PFP and the ME, both of which are parts of the command processor, are connected by two first-in, first-out queues. Before the ME can start executing commands, all of the data must have been read from memory by the PFP.

The command processor always tries to run as far ahead of the rest of the GPU as possible and always attempts to execute command packets if it is not stalled. Synchronization at the PFP level is more expensive than synchronization at the ME level because it’s farther away from the GPU and because the mechanism for PFP-to-ME synchronization is expensive.

The command processor also contains a unit called the CP DMA, which can be used to perform generic copies of memory and global data store (GDS) through direct memory access and through the GPU L2 cache. The CP DMA can run asynchronously or synchronously with regard to the rest of the command processor packets and to itself. It can be kicked off by the PFP or the ME, but it’s used mostly from the ME.

For the purposes of this discussion, it is useful to consider the shader-execution blocks of the GPU as a monolithic block. Unless manual synchronization using shader atomics is involved, that part of the GPU pipeline is relatively straightforward when it comes to synchronization.

The fixed-function part of the GPU that executes color-buffer writes, blending, depth tests, and so forth can also be seen as a monolithic block.

Conceptually, the synchronization process through the pipeline is as follows:

  1. A draw packet or a dispatch command packet enters the command processor through the pre-fetch processor.
  2. The packet is executed by the micro-engine.
  3. The packet’s shader stages are executed by the shader processor input (SPI) and the sequencer (SQ).
  4. The fixed-function render back end (comprising the shader export block [SX], depth block [DB], and color block [CB]) finishes the non–shader related work.

Lastly, looking around there is a particular driver call that will purposefully stall the CP called Surface Sync
Surface sync
A surface sync is a top-of-pipe command that the driver uses both to trigger various cache flushes and invalidates. It also ensures that the write-confirms of memory writes have been received. This is called a coherency operation. Coherency operations can be performed on color blocks and depth blocks and on four arbitrary memory locations, which are used for stream-out (SO) synchronization. Both a cache-invalidate operation and a coherency operation can be specified in the same packet.

A coherency operation works by first inspecting whether a surface given by the base pointer and size parameters is now being used by active graphics contexts, then stalling the CP until those graphics contexts are finished and write-confirms for the surfaces are received. A surface sync operation involves a context roll unless it is issued between a context register setting and a draw. Note that a pipelined event is associated with a current context, so a sequence of Draw/ContextRoll/InsertPipelinedEvent/InsertSurfaceSync-that-is-waiting-on-the-pipelined-event will not work because the freshly rolled context has no draws on it.

Why might a coherency operation be required? Triggering a cache-invalidate doesn’t make that memory immediately available and visible to other GPU blocks. Sometimes it takes thousands of cycles to finish the cache-flush operation. Apart from making cache flushes really expensive, it also means that a mechanism is needed to ensure the flush is complete before starting a draw call that consumes this data.

My question is, is stalling normal on GCN? Is it a desirable behaviour or a behaviour you try to avoid? And lastly is the 2nd one there to continue operations if the first one is stalled, do you really gain much from that type of scenario?

edit: @Scott_Arm : Even after reading all the way through this, still not a word about that second command processor. So I'm actually sitting on your side of the fence now. Maybe it was just mislabelled as a second GCP. It might have been just indicating the 2 inputs 2 outputs on the GCP (1 Draw Call + 1 Compute Shader Dispatch)
 
Last edited:
@iroboto Here is the AMD GCN white paper http://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

So far, I haven't seen anything in detail that would suggest Xbox One's command processor is anything but normal. Maybe that was a mislabeled slide as you suggest.
Agreed.

There is this:
Suspend and Resume events
On Xbox One, Game OS apps can be suspended and resumed by the Process Lifetime Manager (PLM). In the suspended state, the app’s memory is left intact, but the app has no CPU or GPU resources. It is an XR requirement for Xbox One that Game OS apps implement suspend and resume.

If an app receives the Suspending event, the app must call Suspend, otherwise the app will be terminated. Both the Suspend and Resume calls must operate on the title's render thread, to ensure that GPU state is saved off correctly by the Suspend call.

When the Suspend call is made, the Direct3D runtime will save the state of the context registers, CE RAM, ESRAM, GDS, Index buffers, some GPU registers (not all the GPU registers are readable), and CP internal memory. This state will be restored on the Resume call.

It doesn't go in depth as to how this is accomplished entirely. But I wonder if the only way to suspend a game was to [stall one of two CPs] while the game title is suspended and have the second command processor do your other things while the title was suspended.
 
Back
Top