Xbox One (Durango) Technical hardware investigation

Status
Not open for further replies.
Wait, the OS sometimes takes away 2 more cores than the 2 already reserved? Is this another B3D exclusive?!?! :oops:

I think he is saying that when you go to the OS it will use 4 CPU cores instead of the 2 that it use when you are playing games.


so playing a game it's 6 for the game & 2 for the OS but when you go to the OS with the game in the background you have 4 cores for the OS & 4 keeping the game in background mode.


Edit:

I'm not sure how they'll do it exactly, Durango games run in VMs (B3D rumour exclusive ;) ) so that offers a lot more versatility in backgrounding and suspend/resume etc.

Games only 'see' a 5GB/6 core machine, using VMs makes resource sharing with the OS much easier - eg when Durango games background the system can fold two of their 6 cores onto their other 4 physical ones, and devs don't have to do anything special with thread affinity etc.


Wait what? do you mean like Turbo mode?
 
Last edited by a moderator:
I think he is saying that when you go to the OS it will use 4 CPU cores instead of the 2 that it use when you are playing games.


so playing a game it's 6 for the game & 2 for the OS but when you go to the OS with the game in the background you have 4 cores for the OS & 4 keeping the game in background mode.

Yes, I really didn't think it'd be that hard to understand.
 
I think he is saying that when you go to the OS it will use 4 CPU cores instead of the 2 that it use when you are playing games.


so playing a game it's 6 for the game & 2 for the OS but when you go to the OS with the game in the background you have 4 cores for the OS & 4 keeping the game in background mode.

Oh, I was confused. I didn't (and still don't) see why a game needs any cores while it's in the background if it's not active and thought it was "paused" like with iOS.
 
They're using VMs as it's the best way to ensure proper sharing of resources between the game and the OS.

The game only sees a 5GB, 6 core machine (it can't even see the remaining cores and memory, nevermind try and access it) and when games background, 2 more cores are taken away from them (and given to the OS) with 6 of their virtual cores are folded onto their remaining 4 physical cores - all without devs needing to do anything special with thread affinity and such.

As if you would need a VM to limit such OS resources for non qualified applications.
 
Oh, I was confused. I didn't (and still don't) see why a game needs any cores while it's in the background if it's not active and thought it was "paused" like with iOS.

Yeah, I don't understand that concept either. Why a game in a suspended state needs 4 physical cores when it seems they would mostly sit there idle.

I wonder whether that extra GB of unused RAM is there to facillitate placing games in a suspended state. Where the number of games allowed to be suspended is determined by the amount unused RAM and the ability of the two OS cores to handle multiple VMs in the background.
 
They're using VMs as it's the best way to ensure proper sharing of resources between the game and the OS.

The game only sees a 5GB, 6 core machine (it can't even see the remaining cores and memory, nevermind try and access it) and when games background, 2 more cores are taken away from them (and given to the OS) with 6 of their virtual cores are folded onto their remaining 4 physical cores - all without devs needing to do anything special with thread affinity and such.

I'm not sure that a VM is the best way to do hardware abstraction, if that is the only end goal. There has to be a lot more of a reason to do it than just that. I can understand if there were a multitude of other reasons.

How would this work exactly? Can VMs talk to each other? Wouldn't the game VM require some light-weight OS to handle the DirectX API calls and system services? In which VM do services like voice chat, parties, matchmaking reside? If my game is in one VM and my services for chat and parties are in another, how does the game know which chat channel to use (party, in game) etc? Does each VM run a version of the same service? How are they synchronized?
 
I'm not sure that a VM is the best way to do hardware abstraction, if that is the only end goal. There has to be a lot more of a reason to do it than just that. I can understand if there were a multitude of other reasons.

How would this work exactly? Can VMs talk to each other? Wouldn't the game VM require some light-weight OS to handle the DirectX API calls and system services? In which VM do services like voice chat, parties, matchmaking reside? If my game is in one VM and my services for chat and parties are in another, how does the game know which chat channel to use (party, in game) etc? Does each VM run a version of the same service? How are they synchronized?
All good questions. And something like this would not be just for hardware abstraction, a simple Hypervisor like the 360 has could do that. Hardware partitioning, on the other hand, and security, that could benefit greatly from a system like this.
 
One question

First the question:
We know that Jaguar cores have no FMA units, though Jaguar support somehow AVX instructions.
So the question is "does Jaguar core support the FMA instruction"?
---------------------

Now details about the question in case it is not clear, what I mean is:
"does Jaguar can decode the FMA instruction (and split into the matching operations) by it self, or the compiler does it?"
I would think so (the same thing happening for the other AVX instructions) but I don't know for sure.

Now why I asked, I wrote a big post yesterday that was both unfocused and I think was missing part of the point, I deleted it.

My idea is that on top of other benefits (convenience) of virtual machines as Interference described it, I wonder if MSFT learned something while dealing with BC on the 360.
Emulation of the xbox on the 360 was a headache to pull out, and was achieved through virtual machine (and also as other pointed translation from one ISA to another).
Now I think that MSFT won't move away from the X86 ISA (at least with the next xbox), so I think that if MSFT designs the 'virtual machine" before hand, it could provide them enough of an abstraction layer that they deploy the architecture across significantly different piece of hardware.

For me the main win would be cost reduction, Interference says that Durango can fold its 6 virtual cores onto 4 physical ones (as any virtual machine can do), the thing is if done properly I think that in the future Durango (as a virtual machine) could actually runs on only 4 cores (arbitrary number I mean less cores depending on their performances).

I think that it could have been extremely clever for MSFT to go down that road, virtualize the system so it can be deploy later on on significantly difference hardware architectures without having to deal with constrain set by the technology at the time of Durango v1 design.

Though there was something I did not take in account yesterday /I was not thinking properly,
binary compatibility especially AVX support.
Jaguar cores provide partial support for AVX, I expect the follow up from AMD to offer a more complete support for it (I don't expect them to jump straight to AVX2).
So for MSFT to leverage the higher performances on the new architecture, they would need to "enforce" the use of AVX now, so later renditions of Durango can leverage the benefits in performance without having to recompile the code (hence my question at the beginning of the post).

Looking at things from that perspective makes a lot of sense to me, we have noise about MSFT keeping the devs away from the metal, noise about possible "refresh" for the next xbox.
If they manage to have a proper virtual machine, they might want to keep the devs away from the metal, further than on the 360, closer than what happens on Windows for example, and the noise about "refresh" could not be about offering improved performance (though if the software is designed as described above it should tolerate slight difference in how the underlying hardware perform) but actually instead of trying to shrink the system to save costs, they may re-design every time a new node is available a system that meet the requirement of Durango as a virtual machine.

That could be a pretty massive win down the road. Interposer+DRAM as in Haswell, faster memory, better/faster CPU and GPU might allow for pretty massive saving in silicon and possibly power budget.
With regard to R&D cost I think it would also be a win, not too mention the elegance of the solution, instead of implementing logic on a new process that is out dated, you do a lesser job of integrating existing blocks on the process you are to use (still quiet a job, there is nothing trivial in putting a SoC together).

For the ref it seems founder are mostly passing on 22nm lithographies to jump toward 14/16nm process or 14/22 hybrid (GF), dealing with 256 bit bus could for example be troublesome pretty fast, I actually made calculations in the post I deleted but jumping a node and a half even with pretty dreadful scaling result in chip that seems too tiny to me.
All those process should rely on FinFet transistors and might offer lot of opportunities for higher clocks, lowering power /the proper blend of both.
 
First the question:
We know that Jaguar cores have no FMA units, though Jaguar support somehow AVX instructions.
So the question is "does Jaguar core support the FMA instruction"?
---------------------
All indications from the description of the architecture and the compiler flags for an alleged benchmark of a Jaguar-based chip indicate no.

Now details about the question in case it is not clear, what I mean is:
"does Jaguar can decode the FMA instruction (and split into the matching operations) by it self, or the compiler does it?"
The compiler (and this is potentially iffy), if not the developer.
FMA does not behave the same as a standard MUL and ADD. Some kind of special instruction and hardware behavior would need to be added to Jaguar, and that hasn't been disclosed.

A microcoded split of an FMA onto hardware lacking an FMA pipe or lacking internal extensions to pass around unrounded results would be a performance disaster.
Even with hardware extensions, if it's even a short microcoded sequence, it would be hard to justify.
 
Honestly, I don't remember the 360 being treated any different at least not here. The common argument was "how can flops of the 360 cpu/gpu compete with Sony's 2Tflop monster". A year was spent talking about how unimpressive 360 sales were while being outsold by the PS2. We got in an after holiday session of how MS stuffed the channel with 360s to claim "first to 10 million consoles". We spent 2 years talking about how Sony would eventually get over its BOM problems and the 360 days were "numbered" and not "5-6 years later numbered" either.

But can u blame anybody given MS's track record. The negativity is a reflection of MS never being able to create a product as successful as Windows or Office. Outside of the Xbox, MS endeavors have never really panned out well.

I wasn't talking about the discussion here at B3D. The discussion here has been pretty thoughtful and informative. I really like the collection of ideas here, even if some tend to be a bit too dismissive for a hardware investigation thread. I thought it would include some broader ideas, but it's been an entertaining read so far.

You seem to be down on most things Microsoft though. lol. It was certainly not my intention to drag the entire company's history into this, I was only mentioning the near 100% negativity that's reported from articles, speculations and shared rumors. Every bit of information from hardware specs to the slightest nuance is spun with doom and gloom.

I'm only saying that it's ALL too dark and gloomy to be true. If all the news was 100% positive I'd think that would be too good to be true as well.
 
First the question:
We know that Jaguar cores have no FMA units, though Jaguar support somehow AVX instructions.
So the question is "does Jaguar core support the FMA instruction"?
AVX and FMA are two separate instruction extension. Jaguar supports AVX but not FMA of any kind (neither FMA3 nor FMA4).
It supports AVX 128, rather than the full AVX1 or 2.
That statement makes no sense. It fully supports AVX(1), it can execute all instructions of it, also the 256bit ones. It only executes it on 128bit wide SIMD units, so the throughput using 256bit AVX instructions is the same as using 128bit instructions.
 
I once again made a post that in my opinion is way too long for what it tries to say, though I give up and let it as it is...

I would sum-up the whole thing as such, Durango from the developers and software perspective is a PC with fixed, if not hardware, specifications. The specifications are defined as an extensive set of QoS requirements.

I think that the GPU could also be virtualized, the "virtual CPU" see the "virtual" queues
it don't have to see all the queues the physical GPU offers some could be reserved for the "real OS/system for example if Kinect benefits for some GPGPU computation and Kinect is managed by the real underlying OS, you may want some direct access to the GPU
, then you have the black box GPU (like on PC the underlying is mostly hidden by the API+driver layer) and the end of the chain the virtual memory space (which includes the ESRAM).

I've come to think about why MSFT may reserves as much as 2 cores while playing games and up to 4 in media/ui/whatever mode.
Jaguar cores are pretty potent and they are not clock that low if you compare them to the hardware found in mobile devices (especially tablets). I would think that Jaguar are per cycle every bit as good as A15 for example, possibly better, they should run at at least 1.6 or 1.8 (I do not remember last AMD presentation on the matter). That is quiet some computing power.
My answer is that I think that in game mode, on top managing background tasks, and some services which should take that much resources, the "real OS", some form of Windows 8, is actually running a plain driver for the GPU card. That could indeed take on core by self and let one cores for background tasks.

So indeed I think that the whole thing is fully virtualized, the GPU is hidden behind the driver/API, from the developers pov, you have your virtual CPUs cores, you virtual queue for the GPU (/command lists that are handle physically by the command processor and ACEs), and then they see the virtual memory space (a part of which is mapped to the ESRAM).

I think the whole thing is abstracted with no more access to the underlying hardware than in a plain PC.

To some extend I even wonder if it could allow MSFT to use hardware that is not provided by AMD (/Intel) if they had to.
 
Last edited by a moderator:
AVX and FMA are two separate instruction extension. Jaguar supports AVX but not FMA of any kind (neither FMA3 nor FMA4).That statement makes no sense. It fully supports AVX(1), it can execute all instructions of it, also the 256bit ones. It only executes it on 128bit wide SIMD units, so the throughput using 256bit AVX instructions is the same as using 128bit instructions.

I thought it only supported the 128 bit subset/instructions of AVX1. My mistake.
 
I once again made a post that in my opinion is way too long for what it tries to say, though I give up and let it as it is...

I would sum-up the whole thing as such, Durango from the developers and software perspective is a PC with fixed, if not hardware, specifications. The specifications are defined as an extensive set of QoS requirements.

I think that the GPU could also be virtualized, the "virtual CPU" see the "virtual" queues
it don't have to see all the queues the physical GPU offers some could be reserved for the "real OS/system for example if Kinect benefits for some GPGPU computation and Kinect is managed by the real underlying OS, you may want some direct access to the GPU
, then you have the black box GPU and the end of the chain the virtual memory space (which includes the ESRAM).

I've come to think about why MSFT may reserves as much as 2 cores while playing games and up to 4 in media/ui/whatever mode.
Jaguar cores are pretty potent and they are not clock that low if you compare them to the hardware found in mobile devices (especially tablets). I would think that Jaguar are per cycle every bit as good as A15 for example, possibly better, they should run at at least 1.6 or 1.8 (I do not remember last AMD presentation on the matter). That is quiet some computing power.
My answer is that I think that in game mode, on top managing background tasks, and some services which should take that much resources, the "real OS", some form of Windows 8, is actually running a plain driver for the GPU card. That could indeed take on core by self and let one cores for background tasks.

So indeed I think that the whole thing is fully virtualized, the GPU is hidden behind the driver/API, from the developers pov, you have your virtual CPUs cores, you virtual queue for the GPU (/command lists that are handle physically by the command processor and ACEs), and then they see the virtual memory space (a part of which is mapped to the ESRAM).

I think the whole thing is abstracted with no more access to the underlying hardware than in a plain PC.

To some extend I even wonder if it could allow MSFT to use hardware that is not provided by AMD (/Intel) if they had to.

AMD seems to have some kind of GPU virtualization in Windows 8 (http://en.wikipedia.org/wiki/RemoteFX) on their firepro GPUs. I don't know what special hardware that requires compared to their consumer desktop flavours.

RemoteFX seems to have all of the features they would need to run games/apps inside VMs. Mind you, Windows 8 and RT do not support those features, but they are in Windows 8 Pro and Windows 8 enterprise.

http://blogs.msdn.com/b/rds/archive...es-for-windows-8-and-windows-server-2012.aspx
 
All indications from the description of the architecture and the compiler flags for an alleged benchmark of a Jaguar-based chip indicate no.


The compiler (and this is potentially iffy), if not the developer.
FMA does not behave the same as a standard MUL and ADD. Some kind of special instruction and hardware behavior would need to be added to Jaguar, and that hasn't been disclosed.

A microcoded split of an FMA onto hardware lacking an FMA pipe or lacking internal extensions to pass around unrounded results would be a performance disaster.
Even with hardware extensions, if it's even a short microcoded sequence, it would be hard to justify.

AVX and FMA are two separate instruction extension. Jaguar supports AVX but not FMA of any kind (neither FMA3 nor FMA4).That statement makes no sense. It fully supports AVX(1), it can execute all instructions of it, also the 256bit ones. It only executes it on 128bit wide SIMD units, so the throughput using 256bit AVX instructions is the same as using 128bit instructions.
First I did not read your posts before posting the previous post (/it took me a while to post it for various reasons).

Thanks for bringing me those information :)
OK no easy FMA supports down the road /existing titles would not benefit from it /pointless without breaking BC.

Though I guess it would still be a quiet win for MSFT to enforce the use of AVX, more precisely having the devs to consider the width in FP as 8 instead of 4 even if Jaguar cores handles those operations at half speed or less if there is a light performance overhead (vs native handling of 4wide vectors).

For example on those ~14nm process AMD should have the budget to improve the SIMD so they handle 256 bit FP operation natively and may to also widen the data path along with the load/store units. Anyway whether they do only the former or both it should result in a significant increase in perfs per cycle in the SIMD throughput (obviously greater if they were doing both but Intel only got there with Haswell so it could really well not happen).
Still quiet a win, along with further architectural improvements, the clock speed those new cores can reach, I could see ultimately MSFT using only 4 cores to run the "virtual" (obviously as far as the CPU is concerned. The end result would be that they only have to pack 6 cores in Durango V2.

I would also think that the API+drivers hide for most part the ESRAM and the move engines, I could also see MSFT pass on ESRAM altogether in the next revision in favor of an interposer and a beefier amount of fast memory (same solution Intel used in Haswell).
Again that could be quiet a win, if the API hide/take in charge, the movement of the various render targets from or to the ESRAM to the main RAM, in a "haswell like" solution, the API+driver, could map every space of the virtual memory that is used by render targets to the pool of fast memory on the interposer, voiding the necessity for moving data around.
Pretty much API (given hint by the devs) would decide in durango V1 where the ROPs are reading/writing, in the Durango v2 things would be simplified and ROPs would always read/write to the new pool of RAM (I would think that the fact that the system use virtual memory should make that not too hard).
Down that line of thinking, you may no longer as much bandwidth to main RAM, actually the bus could be scaled back to a 128bit bus and a light bump in RAM speed.

If the system is indeed programmed like a PC those slight differences in how the system performs should not prevent the system to "run", if anything it could meet the QoS more often actually.

To put a picture, durango v2 could be composed of 6 Jaguar" type cores (but if it meets power consumption requirement possibly a 2 modules /4 cores escavator could do even better), 12 SIMD GPU, without ESRAM, with fast access to a pool of RAM on an interposer, and a 128 bit bus to faster DDR4.
It would be build around newer IP blocks from AMD, Jaguar 2 and GCN2 (or whatever they are called).
Using those "~14nm" process we are speaking of a pretty tiny (and cool) piece of silicon, that should be mass producible for cheap.
 
Last edited by a moderator:
AMD seems to have some kind of GPU virtualization in Windows 8 (http://en.wikipedia.org/wiki/RemoteFX) on their firepro GPUs. I don't know what special hardware that requires compared to their consumer desktop flavours.

RemoteFX seems to have all of the features they would need to run games/apps inside VMs. Mind you, Windows 8 and RT do not support those features, but they are in Windows 8 Pro and Windows 8 enterprise.

http://blogs.msdn.com/b/rds/archive...es-for-windows-8-and-windows-server-2012.aspx
Actually I wonder if they would in fact need to go that far, if there is a driver running on the "real" system, the situation is not different than on PC, the underlying OS can balance access to the GPU by either the Game, the UI or other services.
Thinking more about it I think it is unnecessary (to virtualize the GPU), the virtual Machine would be client of the GPU (through the driver+API layer) in the same way a plain game is under Windows /not an issue.

Actually I wonder if the point of virtualization is really about being able to Pause/resume a game, especially when you shift between games and ultimately the data have to dump from the RAM.
For Pause/resume it may help to allocate resources but when you swap game, I think that it offers nothing saving doesn't provide already neither it would have performance advantage.

The more I think about, the more I believe that the main reason for the use of virtualization could be security related, as even the type compatibility I'm renting doesn't require that, PC games have done that for ever. You just need proper coding practices (/not get close to the hardware) and the game runs on different CPU, GPU with different amount of RAM, etc. you can put two PC together with different parts that perform close enough, there is no need for virtualization.

But may be designing a virtual machine along with a directx type of API, the driver for the GPU, etc. altogether was the best to do it from MSFT pov. May be they did not want the devs to toy with the OS as the OS could pin different process to different cores (/more or less cores), etc. all the same.
I guess that could be it, prevent the devs to access the OS+ security concerns.

Altogether it could be a pretty big win wrt hardware down the road as I tried to explain (/if it made sense).

EDIT

Actually I thought more... and virtualization of the GPU could useful for security reason too. (Assuming always online) there could be 2 checks, when you log to the system and you access the "OS" but the OS has no access to the GPU, so to speak the UI interface, etc could also be include in a virtual machine.
Then another (online) check to start whatever virtual machine you need to use.
There could be different virtual machines, the "6 cores + game GPU" one for games, a 3 cores+ lesser GPU (the API+driver than run that lesser GPU could have no idea there was ever 12 SIMD and ESRAM in the system for example) one for the "visible OS" (fold on one core during games).
The real OS would run on one core, a win 8 kernel + hypervisor, but the Drivers+API for the GPU(s) would be partially embedded into the different virtual machines. That same CPU could run the driver+API (or part of it as ultimately it is to arbitrate between the different client and should not be that busy) but as long as no virtual machine is started it has "nothing" to run pretty much.
So pretty much if somebody manage to break the security and log on the "real" OS he doesn't find anything handy, ready to use, from the API+Drivers to the UI, etc. so he can't the develop conveniently whatever he wants on the system /homebrew.
So multiple online checks, from the real login, to green lighting the start of various virtual subsystems.
 
Last edited by a moderator:
I'm not really sure how AMD's GPU virtualization works in Windows 8. All I know is it can serve multiple clients from a single GPU. Whether that is through some kind of scheduling or actual partitioning of resources into logical units, I do not know. I'm guessing it's probably the former, not the latter. It's probably not that complicated, because only the program in "focus" would be using the GPU heavily anyway. Any other game would be suspended in the background and not taxing the GPU. I guess there is the case of online games like COD that can't be paused if you're looking at the OS UI.
 
I was to edit but as you posted in the mean time :)

I forgot one virtual machine, which would have no access to the GPU all, the "idle" state virtual machine that deal with background download, etc. when the system is in sleed mode.

I could put it that way.
You have 8 cores, from 0 to 7.

When the system is plugged:
Core 0 is started, booting the OS (/minimal win8 kernel) and hypervisor.
May ask you either to create an account or would automatically log you (no acces to the GPU ala safe mode in windows)
Once the system is logged.
You have three choice, the sleep, mode, the UI mode and the game mode.
Say you go into UI mode by default.
THe hypervisor check online to start the proper virtual machine.
That machine as access to up to 3 cores and most likely only a fraction of the GPU power (GPU in low power state, /real low clock speed, eSRAM is shunt).
Now core 0 run the underlying OS+ Hypervisor (around peanut in cycles), the driver+API for the virtual machine (or part of of it).
The three cores (1, 2, 3) are fully available to whatever you are doing from browsing to kinect, etc.
You start a game.
the UI/ "visible OS" virtual machine fold onto core 1 only.
The game virtual machine is deploy on the cores 2, 3, 4 ,5, 6, 7
Now core 0 runs the OS+hypervisor (should be around peanuts in cycles), the API+drivers for thee UI virtual machine (which now should pretty low as far as CPU usage is concerned even if kinect uses some GPGPU implementation), and the API+Drivers for the Game machine (which should take the bulk of the cycles) (or part of both).
The 6 cores from 2 to 6 are freely available to the devs.

Now you switch to the UI, the core 2 to 7 are fold onto the cores 4 to 7. etc.

Now the scary part, the idle mode, I guess that on top of managing background tasks, the Core 0 while running the "sleep mode" virtual machine have to have access to Kinect, at least partially so it can recognize a restricted set of command able to pull the system out of sleep => big brother :LOL:
Edit wrt to the requirement (in resources) to do voice recognition I could think that actually it is handled by the cloud at least for a limited set of case, low power state could one of those.
The idle system could monitor the DB in a room and specific spike in db that could trigger a search on the cloud. I'm kidding wrt to Big Brother, for the obvious reason of costs...

EDIT
With the lack of reaction, I fell pretty lone now, just as if I wrote a bunch of nonsenses... :(

Anyway that looks pretty efficient to me. 1 cores for the service while gaming, another that run the Os+Hypervisor and run a part of the fused API+driver subsystem, then the cores handling the games as usual I would say.
If it functions remotely close to that it is far from a waste of resources, like only 6 cores are available, insert system comparison, etc.
It could the system to real low price tag once 14nm process hit the road, a price tag that the ps360 failed to reach till (and even lower with subscription).
 
Last edited by a moderator:
Status
Not open for further replies.
Back
Top