Xbox One (Durango) Technical hardware investigation

Status
Not open for further replies.
The article still has the theoretical max at 204 GB/s.
The side bar of the article points to a peak mix of full-rate reads and a bubble on the write pipeline.
That is the apparent best-case.

You referring to this?
"The same discussion with ESRAM as well - the 204GB/s number that was presented at Hot Chips is taking known limitations of the logic around the ESRAM into account. You can't sustain writes for absolutely every single cycle. The writes is known to insert a bubble [a dead cycle] occasionally... one out of every eight cycles is a bubble so that's how you get the combined 204GB/s as the raw peak that we can really achieve over the ESRAM. And then if you say what can you achieve out of an application - we've measured about 140-150GB/s for ESRAM.

But it is confusing, it sounds like when they measures with that "one app" it was only able to do R/W at 204GB total, but all reads or all writes they get 109GB?
 
Last edited by a moderator:
You referring to this?


But it is confusing, it sounds like when they measures with that "one app" it was only able to do R/W at 204GB total, but all reads or all writes they get 109GB?

It seems like there are limiting parts in the controllers or latencies in eSRAM access that can be pipelined in pure read or pure write traffic that cannot be hidden in the case that the two types are being juggled.
Apparently full-issue reads can still happen, but not writes.
 
Glad ya remembered my field of research! :D



Charlie over at S|A had suggested the display planes (the scalers you are talking about) might be a bigger deal than anyone expects, as per insider info. He claimed that after HotChips. I mention it only because he also claimed to have other insider info saying the real world eSRAM bandwidth was 140GB/s-150GB/s, which is verbatim the range given off hand by Baker in the DF article.

Just something I thought I'd note...I had personally heard 142GB/s for eSRAM but I found it interesting Charlie's specific range was quoted by Baker precisely. Wonder if Charlie is right about the display planes too.



Slightly off topic, and mods are welcome to edit or move this part if need be, but does this info about how a 53MHz clock boost did more to improve real game code than 2 CU's (in terms of graphics rendering which is what MS was testing) suggest anything useful for the 14/4 speculation in PS4's setup?

If PS4 is balanced at 14 CU's, that's 1.435Tflops of raw performance, but if the gain thereafter for every 2 CU's is similar to what MS's testing showed (in real game code I mean, not on paper spec) then by my calculations you'd be looking at a huge falloff for CU raw spec utilization of around 60% (1.6Tflops for 18 vs 1.435Tflops for only 14). Maybe that is what Sony/Cerny meant in that regard? Not sure what thread this part should go in...so apologies in advance. :/

i think npr is talking about the actual scaler not the display planes.
 
What has changed? Are we saying the PPU in the PS3 is somehow more capable than 8 x86 Jaguar cores?
The PS3 was GPU bound only because their CPU was a beast with, theoretically, 2x the performance of the 360. The 360 was CPU bound in most games, due to the in-order processor being capable of _very_ crap performance at times, and the GPU being easier to optimise for.

For the HD DVD player, we offloaded everything we could to the GPU, we were running with something crazy like 10-15 frames in flight so that we could keep the ALUs busy, we used Memexport heavily for a GPGPU-like solution, and we were _still_ CPU bound.
 
The PS3 was GPU bound only because their CPU was a beast with, theoretically, 2x the performance of the 360. The 360 was CPU bound in most games, due to the in-order processor being capable of _very_ crap performance at times, and the GPU being easier to optimise for.

For the HD DVD player, we offloaded everything we could to the GPU, we were running with something crazy like 10-15 frames in flight so that we could keep the ALUs busy, we used Memexport heavily for a GPGPU-like solution, and we were _still_ CPU bound.

I'm talking about general processing, the PPU, not the SPUs. There is no way that meager PPU can out class a Jaguar cores.
 
To a certain extent they are obviously playing up the clockspeed increase over enabling all 14/14 cu because of yield and cost issues.

Imagine the hit on yields if they couldn't accommodate for defects in any compute units.

I'm disappointed this being a "versus" article he didn't immediately pick up on that, and question their statements.
 
The reaction to DF interviews are imo not putting the gaming community under a good light to say the least. I got to the point where I actually believe that your average customer, ignorant about anything technical, is less dumb, less biaised, than a lot pretty vocal people on the web acting under the pretense they represent the gaming market.
Anyway interesting, too bad they don't get further into the details, though it is an interview, I would be really interesting to see a lengthy presentation about the arbitration they made, their measurement, etc.
Though most like the main reaction (but at some point why care about dumb asses...) would be the same: pointless flame war.
I think they gave pretty comprehensive overview about the arbitration they made, the esram and how the system is put together as it is, it is just an interview after all.

They spoke about speed bump and what I get from their choices is that bumping the CPU clock speed was more relevant than the GPU.
I won't dispute their conclusion (number of CU vs clock speed), they made their measurement, their analysis, even if they were wrong, I won't be the one able to make the call, devs could if they had access to the measurements made by MSFT => NDA, unavailability of data make it so you believe it or you don't. I will, actually it sounds right.

That is where I agree with Shifty, power constrains had a profound effect on both durango and Orbis, both went with pretty conservative clock speed if you compare those designs to shipping GPU.
Pretty much following their line of thining, one could be better off with say 8 Cus clocked @1GHz, compromising further the raw throughput of the chip but not real world performances (same would apply to the PS4). The issue is power consumption.

So to make it short, if I were to criticize the design in the context of what those 2 high rank engineers did, I would not criticize their choices (I'm not legitimate to do it to begin with and we should be able to asses the end result (shipping games) pretty soon) but possibly the power constrains they had to deal with.

A better Durango could be different than what people on the web would want aka more CUs, more raw theoretical throughput, etc.
My understanding of their pov is that a better durango would be something like this:
6 CPU cores @2GHz
8 CU/16ROPs GPU @1GHz
Keep all the embedded processors
Invest all the "saved" silicon in more eSRAM.
Trade off, less PR friendly actually I think it would be a disaster, extremely tough to manage/explain, and a significant jump in power consumption (though I would bet far from unmanageable).

I can only imagine what reactions would be if people were indeed comparing a ps4 as it is now and an Xbox such as described above => web implosion.
Yes, I am ashamed of some of the comments in the article and they make me worry about some part of the human race. C'mon people you know you can do better.

I agree with the guy who said that the Xbox One is the new Gamecube, as I'd rate it as the GC in the 128bits era, both power-wise and particular capabilities wise.

My only disappointment with the console is that I would like an even more wild design, a la PS2 for instance, :smile:, but the inclusion of SHAPE makes everything worthwhile for me. I LOVE that part of the design.

It's my existential angst, when it comes to consoles, wild designs. Other than that, Aye Bonny Shape -like my best friend on Live would say just applied to Scotland instead of Shape-
 
From what I've seen, the games that run on PS3/360/PC are more GPU bound if anything, apart from the few titles that limit the player numbers due to CPU issues (yes I'm looking at you BF3).

I can hardly think of a recent game that really stresses the CPU more than the GPU recently apart a few that suffer from badly written code/engines that even top of the line I7s can't solve.

I may be missing something...by why, on a console, in your busiest scene, would you not want to be CPU bound? If you are not then you could have made it busier. This is a simplification to make the point but....Look at DR3. Each on screen zombie is going to take a certain amount of cpu ops to run for AI/physics etc. If you are only running 50% utilised with your max number of zombies why wouldn't you add more Zombies? Or make the AI smarter to use up those free cycles?

ps when I say CPU bound I don't mean bouncing off 100%...you need some headroom....but getting up that way...
 
I'm talking about general processing, the PPU, not the SPUs. There is no way that meager PPU can out class a Jaguar cores.
And if the PS3 only used the PPU for CPU tasks, you'd be right. But audio, AI, and almost everything they could manage was shunted off to the SPE units. In some games, the PPU was mainly a glorified scheduler, handing out jobs for the rest of the CPU. So, yes, if you only count the PPU, it cannot outclass an 8 core jaguar. But if you include the entire CPU, it's twice as fast as an 8 core jaguar. In theory. In practice, not so much. The X1 CPU and the 360 CPU have the same FLOPS, in theory. In real running code, the X1 is about 6-8X faster. But optimize a VMX calculation on the 360, and it will execute twice as fast as the X1. Same for the PS3.
 
I have HUGE curiosity related to CPU power and... the cloud.
I really hope the times now are a bit more mature for talk about the "cloud" topic.

A part of all the talking about extra AI and extra Physic achivable via cloud (that seems something that they are really implementing in Titanfall, by the way, but we will see), at the moment, I would like to talk only about the purely "multiplayer" aspect.

Which would be the impact on the CPU of the "multiplayer managment system" in a game like Titanfall if the multiplayer could be managed in the old fashion way (without dedicated servers)?

Some of my "tech wiser" folks seem to believe that the impact could be quite big. And this without mentioning the much weaker performances in term of gameplay (and the absence of extra AI and extra physic).

My 2 questions are:

1) Titanfall could be doable without dedicated servers?

2) Which would be the impact on the CPU without dedicated servers?
 
I can hardly think of a recent game that really stresses the CPU more than the GPU recently apart a few that suffer from badly written code/engines that even top of the line I7s can't solve.
That's only because most games are console ports, and thus are optimized to work on 8 year old (in-order PPC) CPUs. Your i7 is way faster than the (game design) target CPU. It's easy to scale up graphics without redesigning game play (add post process effects, increase resolution & improve filtering quality & improve antialiasing quality, etc). Scaling up game play on the other hand is hard. If you add more enemies or improve your AI, the levels become harder, and you need to rebalance your whole game design. If you add more physics (destructible content) to the levels, that likely affects game difficulty and progress as well (you need to test every case again, as falling obstacles might block the way, and you get stuck, or players might find new shortcuts that ruin the game progress / storyline).

I have been working for the console industry for more than 10 years, and on every (released) console that I have programmed we have always been slightly more CPU bound than GPU bound. The only reason why PC games aren't at all CPU bound right now is: Current gen consoles are 8 year old. Any currently sold dual core CPU beats these old CPUs. Simple as that.
 
Yes, I am ashamed of some of the comments in the article and they make me worry about some part of the human race. C'mon people you know you can do better.

I agree with the guy who said that the Xbox One is the new Gamecube, as I'd rate it as the GC in the 128bits era, both power-wise and particular capabilities wise.

My only disappointment with the console is that I would like an even more wild design, a la PS2 for instance, :smile:, but the inclusion of SHAPE makes everything worthwhile for me. I LOVE that part of the design.

It's my existential angst, when it comes to consoles, wild designs. Other than that, Aye Bonny Shape -like my best friend on Live would say just applied to Scotland instead of Shape-
Well what is bothering is that those 2 guys have been pretty honest, they haven't claimed that their systems is more potent, etc. They explained why from their POV their system should hold its own, their choices, etc. They did not enter a flame war with Sony, show respect for their design choices, etc.
There might be a hint of PR flavor to how they present things and then?
The reaction to such a posture might discouraging for the PR guys aka, why not simply lie, go back to showing "rendering target" etc.
I think the web is turning into a more terrible and terrible place, lots of shows on gametrailer (and elsewhere) for example are just a display of ego, as if people were not already entitled enough in their opinon.
There is proliferation of stupid meme, reliance on stupid 4words sentences, that means nothing can applies vaguely to anything, etc. It is as if people were falling from being in love with their belly buttom to starting a cult of their belly button :LOL:

Anyway I want to learn more about their custom command processor and in which way it alleviates some work for the CPU (I wonder if it could make in some form in the PC world lowering the load drivers put on the CPU).
 
Last edited by a moderator:
The main thing I take from this is the cpu choice for both consoles was pretty underwhelming...reading bkillians posts.
I know we have 8 physical cores which are out of order and less latency sensitive...but as bkillian points out, in pure numbers they dont seem much of a step up even if in real world they are to a certain extent.

The article also mentions how they picked that setup (esram+ddr3)as to allow a large chunk of ram and get an acceptable bandwidth figure. ..whilst keeping a respectable power consumption, fine . ..but what he fails to explain is how the ps4 managed comparable ram and higher averge memory bandwidth and a more powerfull gpu..in the same power profile using gddr5... which is what he said they couldnt have done by going with gddr5. So did they make the best choice or not?

Finally we have heard Microsoft talk up the latency benefit of the esram..something we all were excited about very early on but were befuddled on why microsoft didnt shout about it fom the roof tops..at least we know now they had added that benefit into their thinking when picking esram which is good.
Still I would have liked a more in depth explaination on the non kinect benefits of such low latency would allow over the competing set up...a small mention of gpgpu...but would that greatly enhance gpgpu compute? Would it make both consoles comparable in that regard as sony have opted to add extra execution units + more ACEs to get the same effect...interesting.

I know some on here have explained that 16 rops is a good balance for the bandwidth. ..but I would liked an explaination of that in the article. ..surely double the rops is going to be making some difference in certain scenarios...was this subject deliberately avoided in the interview? Or did that mean that it wasnt considered a worthwhile performance differential talking point?

Please correct me if im wrong, but is there also more tmus? Also they keep quoting 200+ gbs bandwidth when their own internal testing gets a maximum of 150....why not then just quote 150 and not mention the 200 figure?..creative marketing? Or can the xbone achieve this bandwidth figure in select scenarios?..if so why not in internal testing? .

Finally, the combined bandwidth figure does seem a little misleading, as part of that (100gbs or so real world) is from just 32mb of esram....is there any scenarios in which this 32mb data limit become a bottleneck? I hav read some very good posts on fom some members on this subject (sebbi?) but I still need some convincing.

Cheers.

Edit, I would also like to add that one unexpected upshot of going with esram was the die space pocket for the SHAPE audio chip, how much of a cpu advantage this is is yet to be seen, but in the article he does not mention that sony will have a less glamorous competing sound chip which would save cpu cycles.

Still its interesting that both socs are neck and neck if you take into consideration the cpu clocks of xbone.
 
Last edited by a moderator:
The main thing I take from this is the cpu choice for both consoles was pretty underwhelming...reading bkillians posts.
I know we have 8 physical cores which are out of order and less latency sensitive...but as bkillian points out, in pure numbers they dont seem much of a step up even if in real world they are.

The article also mentions how they picked that setup (esram+ddr3)as to allow a large chunk of ram and get an acceptable bandwidth figure. ..whilst keeping a respectable power consumption, fine . ..but what he fails to explain is how the ps4 managed comparable ram and higher averge memory bandwidth and a more powerfull gpu..in the same power profile...so did they make the best choice or not?

Finally we have heard Microsoft talk up the latency benefit of the esram..something we all were excited about very early on but were befuddled on why microsoft didnt shout about it fom the roof tops..at least we know now they had added that benefit into their thinking when picking esram which is good.
Still I would have liked a more in depth explaination on the non kinect benefits of such low latency would allow over the competing set up...a small mention of gpgpu...but would that greatly enhance gpgpu compute? Would it make both consoles comparable in that regard as sony have opted to add extra execution units + more ACEs to get the same effect...interesting.

I know some on here have explained that 16 rops is a good balance for the bandwidth. ..but I would liked an explaination of that in the article. ..surely double the rops is going to be making some difference in certain scenarios...was this subject deliberately avoided in the interview? Or did that mean that it wasnt considered a worthwhile performance differential talking point?

Please correct me if im wrong, but is there also more tmus? Also they keep quoting 200+ gbs bandwidth when their own internal testing gets a maximum of 150....why not then just quote 150 and not mention the 200 figure?..creative marketing? Or can the xbone achieve this bandwidth figure in select scenarios?..if so why not in internal testing? .

Finally, the combined bandwidth figure does seem a little misleading, as part of that (100gbs or so real world) is from just 32mb of esram....is there any scenarios in which this 32mb data limit become a bottleneck? I hav read some very good posts on fom some members on this subject (sebbi?) but I still need some convincing.

Cheers.

They get the 200GB/s figure because "real-world" measured performance of esram bandwidth utilisation can be 150GB/s. In addition the cpu/gpu/move engines have access to around 50GB/sec to the main system ram. Apparently it's been measured getting 200GB/sec. Smoking. Hopefully not literally!

From what I can gather the 32mb esram will be sufficient. They have some fancy hardware compression to maximise it's use. Further, because of the pipelined nature of the GPU process where the output of one stage is the input to the next you are working with comparativly low volumes of data, just hitting them a lot.

I thought it was commendable for the ms tech guys to use the "real-world","measured" figures for bandwidth rather than spin people the theoretical maximums, even though it knocks 25% off the headline figures...
 
They get the 200GB/s figure because "real-world" measured performance of esram bandwidth utilisation can be 150GB/s. In addition the cpu/gpu/move engines have access to around 50GB/sec to the main system ram. Apparently it's been measured getting 200GB/sec. Smoking. Hopefully not literally!

From what I can gather the 32mb esram will be sufficient. They have some fancy hardware compression to maximise it's use. Further, because of the pipelined nature of the GPU process where the output of one stage is the input to the next you are working with comparativly low volumes of data, just hitting them a lot.

I thought it was commendable for the ms tech guys to use the "real-world","measured" figures for bandwidth rather than spin people the theoretical maximums, even though it knocks 25% off the headline figures...

Yea I thought it was commendable also, I was perhaps using the wrong terminology, I forgot about the move engines, interesting.
 
They quote peak bandwidth, as does everyone. There has to be a scenario where this is possible.

Around and above 200GB/s is realistic with their eSRAM and DDR3.
 
Esram average 150gb/s. Ddr, 68gb/s. Magically 218 gg/s

French you are correct...that's a lot of convoluted work. The main constraints on xb1 were Kinect and RAM choice. Different choices in both regards and you would get a design closer to PS4.
 
Status
Not open for further replies.
Back
Top