Xbox One (Durango) Technical hardware investigation

Gubbi · Feb 11, 2013

Lucid_Dreamer said:
Fafalada said one SPE could handily outperform that figure. So, an SPE's performance, at this task, is better than 4 Jaguar cores?

Doubtful, even though this form of decompression is a good fit for SPEs, the symbol table and the sliding window fits in LS, with the rest of the LS used for output buffering the decompressed stream.

Also, I measured using gunzip, which is probably not optimized for ultimate performance.

Cheers

AzBat · Feb 11, 2013

Sounds like you're allowed to create custom rectangles instead of creating a whole frame. Say you're only updating a HUD that's 1920x 100 along the top or bottom of the screen. No need to create a plane that's 1920x1080 just for a simple HUD. Looks like that will reduce memory & bandwidth to me.

Tommy McClain

PalmiNio · Feb 11, 2013

AzBat said:
Sounds like you're allowed to create custom rectangles instead of creating a whole frame. Say you're only updating a HUD that's 1920x 100 along the top or bottom of the screen. No need to create a plane that's 1920x1080 just for a simple HUD. Looks like that will reduce memory & bandwidth to me.

Tommy McClain

So basicly any extra OS functions and any Game HUDs that is used on the screen will take smaller GPU hit compared to that we see in PS4?

Seems like Micorosft is truly aiming at something beyond just gaming.

Scott_Arm · Feb 11, 2013

So, I have a question: How does this thing handle vysnc? If this is a fixed piece of hardware, does it automatically operate at a fixed rate (60Hz, 120Hz)? What if your game is not vsynced?

ramr · Feb 11, 2013

AzBat said:
Sounds like you're allowed to create custom rectangles instead of creating a whole frame. Say you're only updating a HUD that's 1920x 100 along the top or bottom of the screen. No need to create a plane that's 1920x1080 just for a simple HUD. Looks like that will reduce memory & bandwidth to me.

Tommy McClain

Also, if I am reading this correctly you could have a 800 x 600 tile in the middle of the game plane that is rendered at full resolution (1080p or even double resolution 2160p), an overlapping plane of 1080 x 1080 at 1/2 resolution with the center 800 x 600 part not rendered, and 2 420 x 1080 planes, one on each side of the screen, rendered at 1/4 resolution (or whatever). You could also have varying degrees of IQ processing on each of the tiles so you could really stack the IQ on the center plane and gradually back-off.

A quick pixel counting analysis of the following scenario, tile 1 at 2160p, tile 2 at 1080 p with occulusion, and 2 side tiles of 540p yields 2.1 mpixels versus 2.07 mpixels for straight 1080p.

edit - after rereading the article, it is unclear if the custome rectangles can have different resolutions prior to compositing that plane or not, so the above may not apply.

Shifty Geezer · Feb 11, 2013

custom rectangles would be far better; more versatile and with far greater benefits.

patsu · Feb 11, 2013

AzBat said:
Sounds like you're allowed to create custom rectangles instead of creating a whole frame. Say you're only updating a HUD that's 1920x 100 along the top or bottom of the screen. No need to create a plane that's 1920x1080 just for a simple HUD. Looks like that will reduce memory & bandwidth to me.

Tommy McClain

It sounds like Apple's QuartzCore/CoreAnimation layer.

If you're rendering in a full, standalone game, it doesn't really save bandwidth or memory. In a game/app, you should be able to allocate an arbitrarily sized off-screen framebuffer and refresh only part of the front buffer. The scan out engine will then "move" the entire frame to the monitor.

The saving will come into play if you have overlapping window. This is because the QuartzCore layer has a systematic way to determine what gets obscured and refuse to update those parts. It's like a low level window or view management scheme.

On a Mac/PC, this would be handled mostly via software while the actual framebuffers are GPU-assisted. On Durango, it looks like MS tries to make the high level view management hardware assisted as well. The trade off is it's less flexible (Only 4 quadrants per plane, and only 3 total display planes whereas on a Mac it can be any number of planes/layers with the covered parts calculated on-the-fly).

In short, it is a resource saving feature for an OS with overlapping window. But the game itself is still updated at its native framerate unabated.

liolio · Feb 11, 2013

Well I wonder if those pieces of hardware purpose goesfurther, a bit like the decompression units free the cpu, I can see those units freeing the GPU /Rops of the "dumbest" operations.
Pretty much I wonder if the gpu could render the render targets and then be completely done with it.
Along with the dme those units would load the different RTs do the scaling and blending and send the result to the display.

I don't know how those units interface with the esram and the main ram, but in the best case scenario (the system is designed so those units do the blending for free "as far as the gpu is concerned they are provided.with enough bandwidth, etc) I could.see.that being a massive win, a lot more relevant to the system rendering capabilities than the move engine.

As for the UI, I would make sense to me if it part of the "system" screen. It would make sense for MSFT to present sort of an API for the UI that include proper polices scaling etc.

AzBat · Feb 11, 2013

Interesting Patsu. I figured they had to put a limit on the the # of planes in order to keep from over-taxing the GPU. Hopefully they due support custom size rectangles, because of re-reading it doesn't actually say you can. Just that there are 4 of them. Would suck if they fixed sizes. Also, anybody notice that there wasn't anything specific about IllumiRoom support. Guess you could still 2 planes for it. Plan 0 would be for IllumiRoom & Plane 1 for the TV.

Tommy McClain

Silent_Buddha · Feb 11, 2013

Scott_Arm said:
So, I have a question: How does this thing handle vysnc? If this is a fixed piece of hardware, does it automatically operate at a fixed rate (60Hz, 120Hz)? What if your game is not vsynced?

I wouldn't matter. Whatever is in the display buffer when the signal goes to the TV will be sent.

So if the TV is 60 hz, the system overlay (SO) is 60 hz, the game UI (UI) is 30 hz, and the game (G) is 40 hz you'd have.

Frame 1 - SO1, UI1, G1
Frame 2 - SO2, UI1, G1
Frame 3 - SO3, UI2, G2
Frame 4 - SO4, UI2, G3
Frame 5 - SO5, UI3, G3
...etc...

Where the number after the display plane designation is the frame being rendered. So each SO plane gets updated everytime the display is refreshed. The game UI is only refreshed every 2 frames. And the game rendering plane is refreshed every 2 frames then 1 frame then 2 frames then 1 frame, etc.

But that said, I believe that all 3 planes get composited every screen refresh. It's just that's when each individual display plane is refreshed prior to compositing.

Regards,
SB

XpiderMX · Feb 11, 2013

liolio said:
Well I wonder if those pieces of hardware purpose goesfurther, a bit like the decompression units free the cpu, I can see those units freeing the GPU /Rops of the "dumbest" operations.
Pretty much I wonder if the gpu could render the render targets and then be completely done with it.
Along with the dme those units would load the different RTs do the scaling and blending and send the result to the display.

Then 1.2TFs/12CUs could be "enough"?

PalmiNio · Feb 11, 2013

Newbie question

Can this have any impact on 3D?

Like a background process creating dept from a frame rendered by the GPU when splitting it up to two seperate planes?

AlphaWolf · Feb 11, 2013

Don't see how it would be useful for 3d where you'd have 2 basically identical planes. It wouldn't save you anything.

Unless you could somehow have only part of the frame 3d by using planes.

patsu · Feb 11, 2013

liolio said:
Well I wonder if those pieces of hardware purpose goesfurther, a bit like the decompression units free the cpu, I can see those units freeing the GPU /Rops of the "dumbest" operations.
Pretty much I wonder if the gpu could render the render targets and then be completely done with it.
Along with the dme those units would load the different RTs do the scaling and blending and send the result to the display.

I don't know how those units interface with the esram and the main ram, but in the best case scenario (the system is designed so those units do the blending for free "as far as the gpu is concerned they are provided.with enough bandwidth, etc) I could.see.that being a massive win, a lot more relevant to the system rendering capabilities than the move engine.

As for the UI, I would make sense to me if it part of the "system" screen. It would make sense for MSFT to present sort of an API for the UI that include proper polices scaling etc.

Following up from 360's built-in scaler...

For this display plane technology, it is also a convenience feature since, like PS3's split memory pools, someone needs to move or blend the last frame data from whichever pool to its final resting place (Durango's display planes). Its advantage is we don't have to care what exactly the apps/games do (e.g., wireless display, 3D, 4K TV, etc.), everything works uniformly in the display planes.

Resource saving techniques can be done in software also. So I don't necessarily think it's a big saving win comparatively. However because the OS knows the final frame will always be _structured_ in the display planes, they can employ more tricks uniformly to all the apps and game output.

On Orbis, Sony will have to do it via software libraries for each of these modes (3D, wireless display, general purpose OS integration, multi-console rendering, blah).

AlphaWolf said:
Don't see how it would be useful for 3d where you'd have 2 basically identical planes. It wouldn't save you anything.

Unless you could somehow have only part of the frame 3d by using planes.

I suspect the display planes will be complemented by a set of library calls to enable more interesting (and flexible) use cases, but that's just my guess.

EDIT:
If my guess is correctly, then people probably can't see the real benefits of these display planes until they see the Durango OS running for real. I think resource saving is secondary (and there are substitutes using software), these display planes should be able to enable new and consistent user experience in the final OS. ^_^

Silent_Buddha · Feb 11, 2013

PalmiNio said:
Newbie question

Can this have any impact on 3D?

Like a background process creating dept from a frame rendered by the GPU when splitting it up to two seperate planes?

The only way in which it could be used if you were using the Red/Blue or Cyan/Magenta or other such dual color glasses type of 3D. Otherwise all other methods rely on interleaved frames or alternate frames for 3D stereoscopic display.

Although I suppose for a half-SBS (side by side) or OU (over/under) you could have one plane doing one half while the other plane does the other half. But, er, what would be the point of that?

Basically the only method that would benefit from 2 planes being composited together is the ones with glasses with 2 different colors.

Regards,
SB

Ethatron · Feb 11, 2013

Actually I start to have my doubts about the efficiency of the setup. The DMEs can steal bandwidth from embedded and external RAM bandwidth. The 8 cores steal from external RAM bandwidth. The GPU steals from embedded and external RAM bandwidth. The RAMDAC steals from external memory bandwidth. The display-planes feature force you not only to have an additional r/w bandwidth eating resolve, it also forces you to double-buffer if you want to have the GPU busy frame-to-frame without waiting for the resolve. Which in turn interferes with other memory-acesses (and AFAWK there is no priorization of memory-consumers), which can be any or all of GPU, DME or CPU. Non-Vsync may be less attractive, assuming the display-plane feature is in the RAMDAC, as the latency of backbuffer updates is now directed by max-composition time, not raw bandwidth.

UMA with so much consumers itself appears a bit problematic from a bus-locking and/or memory-access queue perspective, except if all consumer are connected to a LLC (last-level cache), including the RAMDAC. But then it isn't really UMA, it's NUMA again for some consumers because of the embedded RAM, forcing you to do the undesireable manual data-coherence management.

I expect you have full RAM contention faster than you can say "RAM", and new months of tweaking and cursing and hacking lie ahead.

Laa-Yosh · Feb 11, 2013

All these little custom pieces of processing hardware are starting to add up to quite something, aren't they?

scently · Feb 11, 2013

Laa-Yosh said:
All these little custom pieces of processing hardware are starting to add up to quite something, aren't they?

Indeed they are.

Love_In_Rio · Feb 11, 2013

Laa-Yosh said:
All these little custom pieces of processing hardware are starting to add up to quite something, aren't they?

Yes, to something like the prior poster has hinted: too convoluted. I thought in the beginning Durango could be elegant,above all when we talked here about gains that low latency could provide to a GPU, but after all these leaks the elegant adjetive is going far away from my mind...as fast as frankenstein approach.

Laa-Yosh · Feb 11, 2013

I'd wait to see the final hardware before declaring MS engineers incapable of solving this issue...

Xbox One (Durango) Technical hardware investigation

Gubbi

AzBat

Agent of the Bat

PalmiNio

Scott_Arm

ramr

Shifty Geezer

uber-Troll!

patsu

liolio

Aquoiboniste

AzBat

Agent of the Bat

Silent_Buddha

XpiderMX

PalmiNio

AlphaWolf

Specious Misanthrope

patsu

Silent_Buddha

Ethatron

Laa-Yosh

I can has custom title?

scently

Love_In_Rio

Laa-Yosh

I can has custom title?

Similar threads