Xbox One November SDK Leaked

For the current gen, they all were substantially incomplete. Wii U massively so, PS4 still can't suspend to RAM.
The Xbox One had a very abrupt 180 on its always-on connectivity requirement which had to be reengineered at a time when the platform should have been solidifying.

One side note, the VMEM documentation indicates a GPU L1 miss can take 50+ clock cycles to serve, an L2 to DRAM miss can take 200+ cycles to serve, while a miss to ESRAM can take 75+ cycles. This in the context of GPU units, so it looks like this is GPU cycles.
This seems to put 250+ for a miss to DRAM and 125+ for a miss to ESRAM, or roughly half the latency. In a different portion, there is an expectation of a texture access realistically taking at more than 100 cycles and possibly around 400 if there is a miss.
Your right the other consoles have issues software side as well. I am not in any way condemning the X1. I own and enjoy the console a lot. Just the little I have read of the document leaked above (with my limited knowledge) it seems like they where behind driver and API wise. They launched with 2 APIs. A standard vanilla DX 11.1 and the Monolithic low to the metal proper console api. They switched to only the Mono version half way thru this year. I remember reading a DF article with the tech head of 4A studios where he mentions both and states they did not have time to use the mono api. It brings the whole COD and BF4 resolution thing to mind. I'm sure launch window titles are always difficult to make, but it seems they were working with a set of place holder drivers and APIs. I am glad things have improved so much. I hope they continue improving.
 
We are a bit off-topic but could they really be blamed? They didn't want a reversal situation of what 360 did to PS3.

It was a ballsy move for sure but I wonder how many people bought a PS4 because of the resolution/framerate discrepancy of early games but would have waited to chose a platform even if Xbox One launched later.

The early stuff, COD: Ghosts and Tomb Raider in particular, we're not flattering. I think they may have done better over the long haul if they had waited a year.
 
i think late 2015 xbox one titles will show great enhancements over 2014 titles.
I wonder if forza 6 is set to launch in 2015 ?
i would not mind a "1080pr" resolution if the iq is clean.
 
As I made some quiet mention in the past, MS pulled all their OS guys and all their engineers onto Xbox before launch.

All of them? Nobody left to work on Windows, Office, VS et al.? That's quite a few engineers and OS guys (are the latter not engineers? Did they put all non-engineering staff from the OS group into Xbox before launch...would explain a few things!).
 
mm... Thanks. So for shaders & tex ops, does it come down to higher utilization (fewer gaps) with the lower latency?

Maybe I'll move this...

The VMEM documentation shows that this can be complicated.
One thing of note that I was curious about when GCN was announced: vector memory operations monopolize the vector memory path. While it is true
It breaks a vector memory operation's execution into 4 stages once a wavefront reaches the point of issuing it:

1) Issue: send operands for this instruction down three parallel buses. Command (type of access, texture and sampler data if necessary), Address (number of coordinates), Data (if writing)
Whichever takes longest determines the cycle count, with buffer accesses being simplest and fastest.

2) Address computation and data request
Depending on the type of request, particularly the kind of filtering, multiple addresses are computed and sent. A memory instruction can sit many cycles here spitting out addresses for the L1.
More complex cases like anisotropy can actually be a bottleneck here.

3) The requests go to the L1, and it starts collecting the necessary data. This is subject to what happens to what the rest of the system recognizes as memory requests.
This would be the stage that is heavily influenced by memory latency. The cache is described as being non-blocking, so L1 is able to overlap latencies by keeping multiple accesses in flight. This means that if there are one or more misses, the latency of a miss can be overlapped by other hits or even other misses. This, and measures to increase locality can cause the more complex accesses to not scale to insanely high latencies while also making the issue and address generation latencies a measurable portion of overall latency that memory has no influence over.

4) When the L1 gathers the data, it submits the data to the path responsible for filtering and conversion, and this then returns to the shader. This is subject to the filtering/blend rates, which neither DRAM nor ESRAM can influence.

I think, going by this, that the 125 or so cycles saved matters more for accesses that have very low issue overhead and filtering/conversion which means buffers and possibly compute in particular. The next requirement is something like a producer/consumer relationship or poor enough locality.
This may lend itself to compute, and perhaps the ROPs could find this useful.

This might reduce the necessary level of occupancy to hide memory latency somewhat. Various matrix multiple kernels already favor heftier contexts and perversely lower occupancy in order for better cache blocking / register use / data share use, so perhaps more aggressive measures can be taken without being hindered by reduced latency hiding.
I saw somewhere, but cannot remember where, about divvying up the workload so that at least for some scenarios texturing can be satisfied by DRAM because it is latency tolerant and not the heaviest bandwidth consumer.
ROPs burn serious bandwidth to hide their limited tolerance for latency, so there may be a benefit where reduced latency also means the ROPs can use less bandwidth or be better utilized.
In the case of ROPs, it might be nice to know if their latency is different, like if they can bypass even more with the ESRAM than the vector path can.
 
All of them? Nobody left to work on Windows, Office, VS et al.? That's quite a few engineers and OS guys (are the latter not engineers? Did they put all non-engineering staff from the OS group into Xbox before launch...would explain a few things!).
IIRC correctly two teams had to join Xbox team. One was OS for sure. I don't recall the other. The practice is not uncommon, I believe that OSX team will sometimes join IOS when the deadline is close. This was just more than 1 IIRC.

But it was a lot of hands on deck. I don't think the office team was pulled Lol, but a good catch in my poor post.
 
i think late 2015 xbox one titles will show great enhancements over 2014 titles.
I wonder if forza 6 is set to launch in 2015 ?
i would not mind a "1080pr" resolution if the iq is clean.
If it follows the cycle they claimed of late we should see it at E3, part of me hopes they give it more time in the oven and go for 2016. Then again they did already port some form of their F5 engine to alpha DX12 code eight months ago or so.

Your right the other consoles have issues software side as well. I am not in any way condemning the X1. I own and enjoy the console a lot. Just the little I have read of the document leaked above (with my limited knowledge) it seems like they where behind driver and API wise. They launched with 2 APIs. A standard vanilla DX 11.1 and the Monolithic low to the metal proper console api. They switched to only the Mono version half way thru this year. I remember reading a DF article with the tech head of 4A studios where he mentions both and states they did not have time to use the mono api. It brings the whole COD and BF4 resolution thing to mind. I'm sure launch window titles are always difficult to make, but it seems they were working with a set of place holder drivers and APIs. I am glad things have improved so much. I hope they continue improving.

Either playing the long-game where they knew the future api would be rolled out at a later date. b) they screwed up somewhere c) they rushed the launch by a year

I don't buy c as if it launched today the api is still not what we will see come full DX12/Win10 OS
 
Last edited:
If your PS4/XOne game is CPU dependent, you're doing it wrong...

No more so than if you're GPU dependent.

Ast he developer of Metro put it, your bottleneck shifts around many, many times during the creation of a frame. At some points this'll be the CPU, and this is probably happening on every game with a none trivial CPU load, especially with CPUs as weak as PS4Bone have.

I'm quite confident that you'll be seeing a lot more developers "doing it wrong" (as you put it) as the generation progresses.
 
none trivial CPU load

Why would you need a non-trivial CPU load?
I think that GPU is the modern CPU. Because massive-multicore - is the future. The performance of x86 CPUs is growing at slower and slower rates with each day. Using CPU for any task is a dead end.
Yes, there are problems with current GPUs, the most obvious: they are not directly programmable (they are still accessed through fixed-function interfaces, command processors and use proprietary byte code) but still they are the only way to get performance from any modern gaming-oriented architecture.
In the new future CPU should be the "command processor" of now, and GPU should be used for all other tasks.
And in fact you can work this way on PS4 even now (yeah, some hacks would be needed, but not too many).
 
It was a ballsy move for sure but I wonder how many people bought a PS4 because of the resolution/framerate discrepancy of early games but would have waited to chose a platform even if Xbox One launched later.

The early stuff, COD: Ghosts and Tomb Raider in particular, we're not flattering. I think they may have done better over the long haul if they had waited a year.
I'm sure it's a huge factor, at least if not directly than at least indirectly through word of mouth and simplification. I've had many other parents have their kids tell them they want a PS4 because it's better and they ask me to verify. I explain it to them what better means, in the end we often settle on, "get your ps4 now appease your child, though truthfully I have a strong feeling that you're going to end up buying an X1 eventually"

Despite that though, if MS can "hold on" in North America for the first 2 operating years and these SDK improvements finally get X1 where MS wants it, then it might be worth it as opposed to launching a year later.
 
Last edited:
Why would you need a non-trivial CPU load?
I think that GPU is the modern CPU. Because massive-multicore - is the future. The performance of x86 CPUs is growing at slower and slower rates with each day. Using CPU for any task is a dead end.
Yes, there are problems with current GPUs, the most obvious: they are not directly programmable (they are still accessed through fixed-function interfaces, command processors and use proprietary byte code) but still they are the only way to get performance from any modern gaming-oriented architecture.
In the new future CPU should be the "command processor" of now, and GPU should be used for all other tasks.
And in fact you can work this way on PS4 even now (yeah, some hacks would be needed, but not too many).

Yes, GPU performance is growing faster than CPU performance. But that's not the issue at hand.

The issue at hand is: why would a PS4Bone developer leave the CPU underutilised (and used for only trivial tasks)? Moving work over to the GPU isn't done just for the sake of it, and dropping frames because you're GPU limited isn't inherently better than dropping frames because you're CPU limited.

Also bear in mind that many (almost all) games are multiplication in nature, and different hardware has different traits. PC's can be relied upon to have increasingly beefy CPU's but can be not relied upon to have fast GPUs. PS4, on the other hand, has a decent GPU (especially given console TDPs) but a poor CPU.

Compromise is the nature of the business. CPU workloads are more likely to increase over the generation than decrease.
 
leave the CPU underutilised

I didn't say "underutilized", just don't use any performance(bandwidth)-sensitive code there. Use latency-sensitive, for example, CPU has quite a big caches to help with latency.

different hardware has different traits

Cool so you want to reuse code between platforms? Sorry, not going to happen.
Unless you want to sacrifice performance and then bitch about "poor CPU". :)

CPU workloads are more likely to increase over the generation than decrease.

Why? Because people are not smart enough to work properly with GPUs?
P.S. on the "it should also run on PC" stuff: it should not, right now PC has the biggest problem with using GPU properly: the drivers do not support it, API does not support it. And I'd say: screw that, just don't launch on PC.
 
Can we take the arguments about programming models for GPU vs CPU elsewhere, unless it's specific to the leak or X1? I'd rather read posts with insight to the actual document.
 
Cool so you want to reuse code between platforms? Sorry, not going to happen.
Unless you want to sacrifice performance and then bitch about "poor CPU". :)

Code gets shared between platforms a lot! It's not only not-not happening, but its been been happening for the more than a decade now. Basically almost every multiplatform game, and yes, that includes performance sensitive stuff like the sim-side work.

Why? Because people are not smart enough to work properly with GPUs?

Has absolutely nothing to do with smarts, and everything to do with economic realities of making more and more ambitious games.

P.S. on the "it should also run on PC" stuff: it should not, right now PC has the biggest problem with using GPU properly: the drivers do not support it, API does not support it. And I'd say: screw that, just don't launch on PC.

Meanwhile, in development studios all over the world ... :p
 
Can we take the arguments about programming models for GPU vs CPU elsewhere, unless it's specific to the leak or X1? I'd rather read posts with insight to the actual document.

Sorry dad. Yeah, we've gone way off topic. No more on that from me.

Now this leak has caused so much info to be in the public domain, perhaps more developers will be safe to speak out. It's clear that much of the Bone's "gap closing" is due to changes in the platform and development environment. If this platform were open to XNA style development we'd already have lots of examples of speedups made available on things like large buffers that spill into dram, but due to the secret nature of the platform we've so far got nowt.
 
Was the mystery on what the 3 display planes did (as I thought we did narrow it down to OS, HUD + Game?) or was it the 2 graphics command processors and their purpose with managing the 3 display planes? Is there anything in the SDK documentation about those GCPs?

I don't think there was really much of a mystery about the display planes, except maybe what options there were for upscaling. There seemed to be a lot of crazy ideas persisting about what they did, and what they were for, but those should pretty much be put to bed.

I haven't looked for any info on the command processors. Don't know anything about it.
 
There is a page on a split render target sample, but unfortunately it doesn't have much info. The picture shows about 70% of the render target in ESRAM and 30% in DRAM. That would be the top 20% of the screen and the bottom 10% of the screen in DRAM. They say there typically is not a lot of overdraw in sky and ground. Seems like z, g and colour buffers are all split, but are able to be split differently as needed. Mentions HTile should reside in ESRAM, but I don't know what that is (assuming it has something to do with tiled resources. The tile buffer?).
 
Last edited:
Back
Top