Xbox One (Durango) Technical hardware investigation

Status
Not open for further replies.
Here it is on the slide titled: "CPU and GPU Bandwidth Interaction"

http://develop.scee.net/files/presentations/gceurope2013/ParisGC2013Final.pdf

It shows the maximum bandwidth with no CPU utilisation even at under 140GB/sec. (I'd say 135GB) That should resemble "most situations" then, no?

Make no mistake, I am truly a Sony fan, I just brought it up :)

Btw I initially stated:

Thanks!

Then, if esram is well used, can we assume ~140GB/s?
and GDDR5 is ~140GB/s?

Not the same, of course, esram is only 32MB, but for some ops can we expect +/- the same bandwidth?
 
Last edited by a moderator:
It's been discussed numerous times and I don't understand the desperation or need to bring it up, again and again.

The theoretical bw of the X1 ESRAM is 204 GB w/ a caveat, which is dual-porting in a R/W mode, and if the title doesn't not utilizing this, it'll be at 109 GB "guaranteed" for uni-directional ops, in the same sense that the 176GB for the PS4 and the 68GB for the X1 DDR3 is "guaranteed".

"Guaranteed" in the sense that this much bw is available to be used regardless of the scenario (the 109/176/68), and the dual-porting will allow developer to exceed that 109GB uni-directional limit only when read/write are done at the same time.

and the Xbox architects had already mentions that "real world" measurement show that ESRAM typically reach 140-150GB in practice, meaning that dual-porting is already being utilized.
 
Last edited by a moderator:
...Plus the ddr3 bandwidth of 68GB/sec (max) 50GB/sec real world. Gives the x1 a realistically achievable total bandwidth of 200GB/s. Once the devs get all their esram ducks all in a row...like the stage 4 esram usage mention earlier.
 
The theoretical bw of the X1 ESRAM is 204 GB w/ a caveat, which is dual-porting in a R/W mode, and if the title doesn't not utilizing this, it'll be at 109 GB "guaranteed" for uni-directional ops, in the same sense that the 176GB for the PS4 and the 68GB for the X1 DDR3 is "guaranteed".

Dual porting would have no difficulty getting full theoretical peak in situations that demand it.
The solution is described as being banked.

"Guaranteed" in the sense that this much bw is available to be used regardless of the scenario (the 109/176/68), and the dual-porting will allow developer to exceed that 109GB uni-directional limit when read/write are done at the same time.
The guarantee for minimum bandwidth should be stronger for the ESRAM.
DRAM makes no promises of acceptable bandwidth if you run a problematic access pattern.
 
Dual porting would have no difficulty getting full theoretical peak in situations that demand it.
The solution is described as being banked.

Right, a true dual port implementation would allow that, and a banked dual port is a cheaper implementation but not as ideal (arbitration has a cost). There are a few different implementations.

The guarantee for minimum bandwidth should be stronger for the ESRAM.
DRAM makes no promises of acceptable bandwidth if you run a problematic access pattern.
Yes, there are advantages to the SRAM, it's just unclear whether they'd benefit from GPU access patterns.
 
So is it correct to say the individual SRAM banks are single ported at 109, but the memory bus is full-duplex 109/109?

I thought "Dual Ported RAM" meant the cells are dual ported, like registers of something. It doesn't seem to be the case here.
 
I thought "Dual Ported RAM" meant the cells are dual ported, like registers of something. It doesn't seem to be the case here.

It's not. What you are describing is a true dual port memory, where each cell is dual port.
That would probably make other implementation fake? but it's not like one has direct access to the memory cells to begin with.
 
Last edited by a moderator:
It's been discussed numerous times and I don't understand the desperation or need to bring it up, again and again.

The theoretical bw of the X1 ESRAM is 204 GB w/ a caveat, which is dual-porting in a R/W mode, and if the title doesn't not utilizing this, it'll be at 109 GB "guaranteed" for uni-directional ops, in the same sense that the 176GB for the PS4 and the 68GB for the X1 DDR3 is "guaranteed".

"Guaranteed" in the sense that this much bw is available to be used regardless of the scenario (the 109/176/68), and the dual-porting will allow developer to exceed that 109GB uni-directional limit only when read/write are done at the same time.

and the Xbox architects had already mentions that "real world" measurement show that ESRAM typically reach 140-150GB in practice, meaning that dual-porting is already being utilized.

If this was right, then why is this from Sony to it's developers?

http://develop.scee.net/files/presentations/gceurope2013/ParisGC2013Final.pdf

PAGE 13. Looks like 135GB/s to me.....
 
What's the importance of GPU compute context switch and GPU graphics pre-emption? Better GPGPU? more efficiency? Or it reduces CPU overhead?

AMD_hsa_evolution.jpg
 
From what I know so far Compute shaders can only work with buffers and textures. You can do this while the rest of the pipeline is doing its thing. So this would be the context switch I think the diagram is referring to. Essentially you are increasing the utilization of the GPU.

As for graphics pre-emotive I think shifty geezer alluded to is attempting to setup or render frame N+1. This is to further increase the utilization of the GPU
 
6. GPU compute context switch and GPU graphics pre-emption: GPU tasks can be context switched, making the GPU in the APU a multi-tasker. Context switching means faster application, graphics and compute interoperation. Users get a snappier, more interactive experience. As UI's are becoming increasing more touch focused, it is critical for applications trying to respond to touch input to get access to the GPU with the lowest latency possible to give users immediate feedback on their interactions. With context switching and pre-emption, time criticality is added to the tasks assigned to the processors. Direct access to the hardware for multi-users or multiple applications are either prioritized or equalized

http://www.anandtech.com/show/5847/...geneous-and-gpu-compute-with-amds-manju-hegde

2.4. Preemption and Context Switching
TCUs provide excellent opportunities for offloading computation, but the current generation of TCU hardware does not support pre-emptive context switching, and is therefore difficult to manage in a multi- process environment. This has presented several problems to date:

• A rogue process might occupy the hardware for an arbitrary amount of time, because processes cannot be preempted.
• A faulted process may not allow other jobs to execute on the unit until the fault has been handled, again because the faulted process cannot be preempted.

HSA supports job preemption, flexible job scheduling, and fault-handling mechanisms to overcome the above drawbacks. These concepts allow an HSA system (a combination of HSA hardware and HSA system software) to maintain high throughput in a multi-process environment, as a traditional multi-user OS exposes the underlying hardware to the user.

To accomplish this, HSA-compliant hardware provides mechanisms to guarantee that no TCU process (graphics or compute) can prevent other TCU processes from making forward progress within a reasonable time.

http://developer.amd.com/wordpress/media/2012/10/hsa10.pdf
 
Last edited by a moderator:
6. GPU compute context switch and GPU graphics pre-emption: GPU tasks can be context switched, making the GPU in the APU a multi-tasker. Context switching means faster application, graphics and compute interoperation. Users get a snappier, more interactive experience. As UI's are becoming increasing more touch focused, it is critical for applications trying to respond to touch input to get access to the GPU with the lowest latency possible to give users immediate feedback on their interactions. With context switching and pre-emption, time criticality is added to the tasks assigned to the processors. Direct access to the hardware for multi-users or multiple applications are either prioritized or equalized

http://www.anandtech.com/show/5847/...geneous-and-gpu-compute-with-amds-manju-hegde

2.4. Preemption and Context Switching
TCUs provide excellent opportunities for offloading computation, but the current generation of TCU hardware does not support pre-emptive context switching, and is therefore difficult to manage in a multi- process environment. This has presented several problems to date:

• A rogue process might occupy the hardware for an arbitrary amount of time, because processes cannot be preempted.
• A faulted process may not allow other jobs to execute on the unit until the fault has been handled, again because the faulted process cannot be preempted.

HSA supports job preemption, flexible job scheduling, and fault-handling mechanisms to overcome the above drawbacks. These concepts allow an HSA system (a combination of HSA hardware and HSA system software) to maintain high throughput in a multi-process environment, as a traditional multi-user OS exposes the underlying hardware to the user.

To accomplish this, HSA-compliant hardware provides mechanisms to guarantee that no TCU process (graphics or compute) can prevent other TCU processes from making forward progress within a reasonable time.

http://developer.amd.com/wordpress/media/2012/10/hsa10.pdf

Thanks, very informative.
 
Can anyone find power draw figures for the Xbox One in game, but without Kinect attached?

It'd be interesting to see how much power it draws in "standard console" mode...
 
So in the end Kinect not only required a 10% GPU reservation but also a 5% CPU reservation as well. I had never heard of this before tbh.

The Crew developers achieved 1080p on the Xbox One and freeing up the CPU helped a lot.

http://gamingbolt.com/the-crew-deve...p-resolution-on-xbox-one-5-cpu-reserve-helped
seems like it. This could explain why the cpu was weaker than expected.
well at least I expected that the Kinect-cpu resources were part of the 2 OS cores. but it seems that this wasn't the case.
also, if 5% CPU and ~10% GPU was reserved, how much memory bandwith did Kinect use or was reserved?
 
Actually, I've been thinking about the effect of the Kinect reserve on effective CPU performance since a post I made in the esram thread last night.

Someone clever once said something funny about quoting themselves to make conversation more betterer, or something, so pretend I've come up with something witty like that and then read this:

Incidentally, after MS's comments about bandwidth increases I found myself wondering if the reserved slice had applied to both esram and main ram, especially given how tightly coupled the two memory pools seem to be (ROPS can write just as easily to both - reserving the esram bus but not the main bus would seem inconsistent).

If you were taking 10% - or some percentage of the dram bus - and reserving it with a time slice in order to make sure that you could feed the GPU that you were also reserving a 10% slice of, then it'd cause quite a hit on CPU performance. Without main memory the CPU isn't going to be able to do much of anything.

So you might as well take a good chunk of CPU time and reserve that too, because game threads would stall anyway.

Sooo ... GPU reserve requires main memory bus time slice reserve -> main memory bus slice reserve will kill CPU during reserved period -> might as well reserve a good chunk of CPU time (all cores) too.

... surely?
 
To some degree, but the XB1 CPU is clocked 9.4% higher than PS4's so even with a 5% reservation it should still perform 4% better.

As it is, it performs significantly worse and it's a mysters as to why...

We don't know if that 5% is the extent of the reservation though. The GPU and at least one and possible both memory buses had a 10% reservation, so it seems possible (probable) that the CPU reservation was at least 10%. After all, there is still at least some Kinect stuff to manage as a matter of course.

Plus, the current implementation of DX11 may be quite overhead-heavy relative what runs on the PS4. Xbox one stands to see a reduction in CPU overhead when DX12 lands, as well as better support for multi threaded rendering.

Even now, some nine months after launch, Xbone still hasn't hit the base level of performance it will offer games.
 
I've noticed the PR side of things have been using "bandwidth" loosely. There's a brief interview published today on Destiny Xbox One confirmed to be 1080p, in which the rendering engineer mentioned it's [just] extra GPU time. It's such a complex system I feel that by taking few parts out is at best a apple to orange comparison.
 
For the esram, "GPU time" includes both processing and esram bus use. So effective 'flops' and 'ops' and 'BW' increase proportionally. MS's use of 'BW' is as correct as their use of 'GPU time' or 'GPU power' or whatever (for whatever that's worth).

In terms of the GPU, reserving 10% GPU time but reserving no main ram bandwidth to feed it would seem ... odd .. seeing as it's doing nothing without main memory.

Anyway, we know there was a 5% CPU reserve freed up now, so that's both dram and esram BW that's been effectively released, if you look beyond that which is instantaneously available to the game during it's slice, and at GB/frame.
 
In terms of the GPU, reserving 10% GPU time but reserving no main ram bandwidth to feed it would seem ... odd .. seeing as it's doing nothing without main memory.
You can't not reserve memory BW, so it doesn't need to be explicitly specified. That is, for every clock the processors are working on the system, they can't be accessing the RAM for the game, which means those clocks of RAM access are lost to the game, regardless of whether the processors are actually consuming data or not. Or, if the RAM is capable of 2133 million transactions per second, and the processors are tied up for 10% doing system stuff, you can't then access the RAM 2133 million times in the remaining 90% of the second for the game.
 
Status
Not open for further replies.
Back
Top