Xbox One (Durango) Technical hardware investigation

Status
Not open for further replies.
It's possible a Haswell-EX server chip might have that much last level cache, but nothing has been suggested that consumer chips would have that much.
Intel's restricted transactional memory extensions are initially implemented to only work with in the L1 cache. The LLC is currently used as the cache the system checks for coherence, and putting values there would expose transactions in process--breaking TM.

TM really needs changes in the cache and CPU pipelines to function, and nothing disclosed for Jaguar indicates it has the necessary changes, nor is it clear that the memory in Durango is being used as an last-level cache.

Hmm. Not everyone has comes to that conclusion.

http://www.realworldtech.com/haswell-tm/3/

Jaguar has received changes to both its cache and pipelines.

http://semiaccurate.com/2012/08/28/...of-the-bag-with-the-jaguar-core/#.URhZbaVkyAg

However, I am not a engineer so I might be fully wrong. It should be noted that AMD unified shader tech first showed up in the 360 and this may end up being AMD using the 720 to explore transactional memory before implementation on its PC side.

We don't know anything Durango's SRAM so any exploration into its possible function can be met with "nor is it clear that the memory in Durango is being used as an...".
 
Last edited by a moderator:
In my opinion, this has dropped out of relevance for this particular Durango thread, and probably is irrelevant for Durango discussion in general.

That first article was an educated guess on the likely implementation given the information available in February 2012.
Have you read the Haswell architecture article from November 2012 that featured the correct version?

Just saying there were changes to Jaguar means very little. The indicated modifcations to pipe stages, units, and L2 cache do not say anything about what is required for TM. Large caches are not strictly necessary, and really aren't for Intel's first effort.
As far as the on-die memory goes. While it is not clear on everything that it is being used for, its apparent place on the output path of the GPU's non-coherent traffic seems to hint against it being an LLC.
 
In my opinion, this has dropped out of relevance for this particular Durango thread, and probably is irrelevant for Durango discussion in general.

That first article was an educated guess on the likely implementation given the information available in February 2012.
Have you read the Haswell architecture article from November 2012 that featured the correct version?

Just saying there were changes to Jaguar means very little. The indicated modifcations to pipe stages, units, and L2 cache do not say anything about what is required for TM. Large caches are not strictly necessary, and really aren't for Intel's first effort.
As far as the on-die memory goes. While it is not clear on everything that it is being used for, its apparent place on the output path of the GPU's non-coherent traffic seems to hint against it being an LLC.

The November article links to February article for a more detailed look at TSX and I don't remember reading anything to discount the early speculated use of L3.

Furthermore, Jaguar is not a TM chip so it wouldn't be officially described in anyway to hint at that functionality.

Haswell is not a radical redesign and alot of its changes aren't specifically for TSX.

http://www.realworldtech.com/wp-content/uploads/2012/10/haswell-5.png?dc2136

And what makes you believe that the SRAM is on out path of the GPUs non coherent traffic? Maybe I am missing something.
 
its there a possibility of having a much stronger CPU to compensate the GPU specs?

Everything is possible. Whether it's likely or unlikely, that's the question. :)

In other words, maybe it's more powerful, maybe it's less powerful, or maybe it's basically the same core CPU.

Hopefully that answered your question. :D

Regards,
SB
 
Can also consider that quoted LZ decoder speeds aren't great either (IIRC single SPE would outperform that handily) - I'd suspect these units cost next to nothing to have in there.

I think they wanted LZ decompression for DXT texture data. The quoted 200MB/s compressed stream is 30% faster than a single core on my 2600s based workstation. They'd need two jaguar cores to get that kind of performance.

The 200MB/s compressed data would decompress to 300-400MB/s DXT data or 300-800MTexels, 5-10 Mtexel per 60 Hz frame. - Probably fast enough by any measure.

I'm guessing they integrated a 3rd party decompressor/compressor block which is why there is a compressor in the first place. The compressor seems to be of very limited use, other than possibly speeding up mass storage transfers.

Once they had the LZ decompressor, they already had a symbol decoder and adding an IDCT and re-quantizer to add JPEG support was super cheap (as Ethatron pointed out earlier)

Cheers
 
I think they wanted LZ decompression for DXT texture data. The quoted 200MB/s compressed stream is 30% faster than a single core on my 2600s based workstation. They'd need two jaguar cores to get that kind of performance.

Far more than that surely. A Jaguar core runs at less than half the clock speed of a 2600 so if Jaguar were the equal of Sandybridge clock for clock you'd still need at least 3 cores to cover off that level of performance. But as it's not the equal clock for clock then I'd guess you'd need at least 4 cores to achieve that performance level.
 
I think they wanted LZ decompression for DXT texture data. The quoted 200MB/s compressed stream is 30% faster than a single core on my 2600s based workstation. They'd need two jaguar cores to get that kind of performance.

The 200MB/s compressed data would decompress to 300-400MB/s DXT data or 300-800MTexels, 5-10 Mtexel per 60 Hz frame. - Probably fast enough by any measure.

I'm guessing they integrated a 3rd party decompressor/compressor block which is why there is a compressor in the first place. The compressor seems to be of very limited use, other than possibly speeding up mass storage transfers.

Once they had the LZ decompressor, they already had a symbol decoder and adding an IDCT and re-quantizer to add JPEG support was super cheap (as Ethatron pointed out earlier)

Cheers

Thanks for pointing that out. I sometimes forget how computationally intensive decrompression is. That becomes even more of a win with the relatively slow processors (compared to PC's that I'm used to using) in the upcoming consoles.

Regards,
SB
 
Far more than that surely. A Jaguar core runs at less than half the clock speed of a 2600 so if Jaguar were the equal of Sandybridge clock for clock you'd still need at least 3 cores to cover off that level of performance. But as it's not the equal clock for clock then I'd guess you'd need at least 4 cores to achieve that performance level.
Fafalada said one SPE could handily outperform that figure. So, an SPE's performance, at this task, is better than 4 Jaguar cores?
 
Today we present another Durango GPU custom feature shown in our first exclusive article: the display planes.

The Durango GPU supports three independent display planes, which are conceptually similar to three separate front buffers. The display planes have an implied order. The bottom plane is combined with the middle plane using the middle plane’s alpha channel as an interpolation factor. The result of this operation is combined with the top plane using the top plane’s alpha channel as an interpolation factor. Blending occurs at 10-bit fixed-point precision. The following diagram illustrates the sequence of operations.

j5JGAYg.jpg

The three display planes are independent in the following ways, among others:

They can have different resolutions.
They can have different precisions (bits per channel) and formats (float or fixed).
They can have different color spaces (RGB or YCbCr, linear or sRGB).
Each display plane can consist of up to four image rectangles, covering different parts of the screen. The use of multiple screen rectangles can reduce memory and bandwidth consumption when a layer contains blank or occluded areas.

The display hardware contains three different instances of various image processing components, one per display plane, including:

A hardware scaler.
A color space converter.
A border cropper.
A data type converter.
Using these components, the GPU converts all three display planes to a common output profile before combining them.

The bottom and middle display planes are reserved for the running title. A typical use of these two planes is to render the game world at a fixed title-specified resolution, while rendering the UI at the native resolution of the connected display, as communicated over HDMI. In this way, the title keeps the benefits of high-quality hardware rescaling, without losing the pixel-accuracy and sharpness of the interface. The GPU does not require that all three display planes be updated at the same frequency. For instance, the title might decide to render the world at 60 Hz and the UI at 30 Hz, or vice-versa. The hardware also does not require the display planes to be the same size from one frame to the next.

The system reserves the top display plane for itself, which effectively decouples system rendering from title rendering. This decoupling removes certain output constraints that exist on the Xbox 360. For example, on Durango the system can update at a steady frame rate even when the title does not. The system can also render at a lower or higher resolution than the title, or with different color settings.
 
Video streaming to tablets and other devices? :?:

Not according to that. 2 planes for running title and 1 for system. The 2 planes for the title would be the game and the HUD on top. The system would be any OS notifactions, menus etc that pop up.

Interesting thing to note is each plane can be made up of 4 rectangles (titles) and the resolution can change from one frame to the next. Looks in-line with the display planes patent. A focus on framerate is nice, and making dynamic resolution a bit easier should promote that concept.

Seems like this unit has some pretty nice features. Hopefully we get mostly gamma-correct games this gen.

The confusing statement from this: "Each display plane can consist of up to four image rectangles, covering different parts of the screen. The use of multiple screen rectangles can reduce memory and bandwidth consumption when a layer contains blank or occluded areas."

Are they suggesting tile based rendering, or is there something I'm not clueing into about Display Planes that would reduce bandwidth or memory consumption? How much does this help the ROPs on the GPU?

Also, what is the data type converter, since it is listed separately from the colour space converter? Edit: Sorry, float - fixed conversion.
 
The system reserves the top display plane for itself, which effectively decouples system rendering from title rendering. This decoupling removes certain output constraints that exist on the Xbox 360. For example, on Durango the system can update at a steady frame rate even when the title does not. The system can also render at a lower or higher resolution than the title, or with different color settings.

I like this. It basically means, that if MS wanted to, they could certainly have a PiP video window overlayed on a game you are playing.

That could be a 24 FPS film quality movie, a 60 FPS made for TV movie, a video chat with friends or relatives, whatever. The system could overlay a tracklist of music that is playing in the background, social network feeds updated in realtime, stock quotes, whatever.

Not saying any of that is necessarily what they intend or what they will do, but it opens up a lot of possibilities in managing multiple sources of information while playing a game. Now if they do any of this, here's to hoping they make it very user configurable.

As well, as I personally expected, they'd only use 2 display panes for games themselves. I thought the 3 planes for gaming was a bit excessive myself leading to potentially over-complicated game development.

Regards,
SB
 
Far more than that surely. A Jaguar core runs at less than half the clock speed of a 2600 so if Jaguar were the equal of Sandybridge clock for clock you'd still need at least 3 cores to cover off that level of performance. But as it's not the equal clock for clock then I'd guess you'd need at least 4 cores to achieve that performance level.

I have turbo disabled, so the 2600s runs at 2800MHz. The decompression must be limited by the symbol decoding, since reading the sliding window should run at full tilt, which means load-to-use latency and overall pipeline latency is a factor. These last two parameters are rather good for Jaguar cores.

Cheers
 
Are they suggesting tile based rendering, or is there something I'm not clueing into about Display Planes that would reduce bandwidth or memory consumption?

It seems it wouldn't be a lot of memory, but you could quadrant the UI HUD and only push updates to the needed rectangle, no? Thus, every frame wouldn't have to have all the unchanged info too.
 
The confusing statement from this: "Each display plane can consist of up to four image rectangles, covering different parts of the screen. The use of multiple screen rectangles can reduce memory and bandwidth consumption when a layer contains blank or occluded areas."

Are they suggesting tile based rendering, or is there something I'm not clueing into about Display Planes that would reduce bandwidth or memory consumption?
My interpretation is the layer is divided into corner tiles, and if one tile is obscuring another, it isn't rendered/composited. So a HUD in the top left won't need any compositing with the other three tiles. Although a quarter is a larger area to obscure by the OS in the case of something like PIP.

Also, what is the data type converter, since it is listed separately from the colour space converter? Edit: Sorry, float - fixed conversion.
Bit reduction? HDR>32>24 bit colour formats?
 
Status
Not open for further replies.
Back
Top