Xbox One (Durango) Technical hardware investigation

Scott_Arm · Feb 6, 2013

GPU uses virtual addressing to access to both the ESRAM and DDR3. It looks like the GPU has full throughput to both pools. CPU does not appear to have access to ESRAM.

Read pages 1 and 2: http://www.vgleaks.com/durango-gpu/1/

anexanhume · Feb 6, 2013

Scott_Arm said:
GPU uses virtual addressing to access to both the ESRAM and DDR3. It looks like the GPU has full throughput to both pools. CPU does not appear to have access to ESRAM.

Read pages 1 and 2: http://www.vgleaks.com/durango-gpu/1/

Yup.

What would be the benefit it the CPU could access the ESRAM? How much do the CPU and GPU need to talk if they address the same address space?

Is it possible that the eSRAM is addressable over the NB via the CPU?

liolio · Feb 6, 2013

Scott_Arm said:
GPU uses virtual addressing to access to both the ESRAM and DDR3. It looks like the GPU has full throughput to both pools. CPU does not appear to have access to ESRAM.

Read pages 1 and 2: http://www.vgleaks.com/durango-gpu/1/

wow I missed that one, I don't know how though

So it is pretty confirmed that it is a GPU based of GCN architecture, OK.

I don't they could have get the GPU+Scratchpad under 185mm^2, If they go higher they may have vouched for a single chip solution.

I'm no longer that confident in them having the level of price advantage (against Sony solution) I was anticipating, bothering especially if the performances are below the ps4 one and the price is a wash.

function · Feb 6, 2013

If Durango can save post transform polys that straddle or fall on the other side of a tile boundary to a cache in main memory then the tiling penalty that the 360 saw might be considerably reduced.

Perhaps one of the move engines could be used to DMA said polys into main ram without impacting on GPU performance.

Tiling makes an awful lot of sense, especially after looking at the amount of Wii U GPU taken up with edram.

Grall · Feb 6, 2013

Gubbi said:
You have much higher compression ratios with jpeg than DXTC variants.

Yeah, but as you note, there are more modern, better algos than ole plain-jane jpeg.

If they operated at full speed, the CPU and GPU alike would be starved for data.

That could easily be handled in hardware. Even the old Amiga had a mechanism that made sure DMA didn't starve the CPU. As durango seems to be designed now, if the CPU or GPU aren't using much memory bandwidth the move engines still can't utilize it all. It's wasted. Also, don't forget that video scanout for example apparantly happens across the same internal bus as data transfers (which is several hundred MBs/s in scattered bursts at full-HD rez), and who knows what overhead is associated with switching from one device "owning" the bus to another.

LZ77 is a good choice of entropic compression/decompression, it is fast and used everywhere.

But these days we have Ogg, for example. No need to pick a ubiqutous algo in a proprietary piece of hardware. You can use whatever the F you like, it makes no difference since you have complete control over every piece of media that goes through that device.

Memory access can be a lot more efficient using a dedicated chunk of silicon like this.

Gaining you HOW MUCH exactly, really...? A few tenths of a percent, what? It can't be any huge amounts, that's for sure. Copying data must only take a tiny fraction of frame time.

Also, considering how much faster a CPU core is than the memory it's connected to, just coding up a subroutine to do odd bits of copies on one CPU core can't be much of a chore if the regular DMA mechanisms all GPUs have had since Luther nailed up his words of complaint on that church door way back when don't suffice, for whatever bizarre reason. Saying it would save some CPU time, well, what game is seriously going to load 8 CPU cores fully all the time...? You gotta use those cycles for something or they're just wasted.

That is a really poor fit for a hardware prefetcher on a CPU and if you use a GPU, you'll have your massive shader-array sit idle while you just load and store values.

The GPU already has dedicated DMA hardware, you wouldn't need to use shader programs just to move stuff around in RAM...and I don't see how a "data movement engine" would automagically lead to greater GPU utilization just by existing.

Shifty Geezer · Feb 6, 2013

liolio said:
wow I missed that one, I don't know how though

I think the initial diagram has been misleading. It shows CPU, ESRAM, GPU, and DDR3 all with bidrectional IO into the northbridge, but doesn't show what communication between components there are. It's pretty fair to assume from that diagram that CPU and communicate with the ESRAM via northbirdge just as it can the DDR3. I'm not seeing anything saying the CPU doesn't have the same virtual addressing though, so I wouldn't bet on Scott being right just yet. It may not have a direct connection to the ESRAM, but may have complete access to ESRAM.

Shifty Geezer · Feb 6, 2013

Putting two and two together, I am wondering if this box is designed for virtual texturing as tunafish suggests. The ESRAM is described as supporting virtual assets spread across memory pools including partially loaded assets. The DME's include tile copies and LZ decompression. What if MS's tools include a VT creation tool that generates a set of virtual textures and the system is designed to stream them on the hardware level, saving a fair bit of redundant texturing? If so, the design choices can be consider entirely in those terms, such as tunafish says with 4megapixels of texture being plenty for a 720p or even 1080p screen (four texels per pixel. See sebbbi's post on VT for how that should be enough).

mrcorbo · Feb 6, 2013

Shifty Geezer said:
I think the initial diagram has been misleading. It shows CPU, ESRAM, GPU, and DDR3 all with bidrectional IO into the northbridge, but doesn't show what communication between components there are. It's pretty fair to assume from that diagram that CPU and communicate with the ESRAM via northbirdge just as it can the DDR3. I'm not seeing anything saying the CPU doesn't have the same virtual addressing though, so I wouldn't bet on Scott being right just yet. It may not have a direct connection to the ESRAM, but may have complete access to ESRAM.

Well there's this from the VGLeaks GPU article:

VGLeaks said:
The difference in throughput between ESRAM and main RAM is moderate: 102.4 GB/sec versus 68 GB/sec. The advantages of ESRAM are lower latency and lack of contention from other memory clients—for instance the CPU, I/O, and display output.

Laa-Yosh · Feb 6, 2013

Virtual texturing would conserve a lot of bandwidth, and with DMA support it would also run faster than on a regular GPU.

And it doesn't have to be unique texturing - after all, Carmack has been asking for virtualization for more than a decade, when Megatexture could not even been a theory (content creation tools have gone through small revolutions in that time, allowing for several orders of magnitudes of increase in detail).

Also consider that Lionhead has had a complete engine using VT for at least 1.5-2 years now, with plenty of experience but no released product. As far as I know they're owned by Microsoft, so it could have been the testbed application to study hw usage patterns and bottlenecks for VT.

I also don't like the magic secret sauce talk, but this could make up for a lot of the supposed performance deficiencies. Maybe even get us some nice anisotropic filtering on ground textures, which has been quite lacking in all X360 titles

anexanhume · Feb 6, 2013

mrcorbo said:
Well there's this from the VGLeaks GPU article:

If the CPU only accesses the eSRAM to interface with the GPU, is it true contention? Sure, the GPU may be trying to read data the same time the CPU is placing it there for the GPU, but it's not like the CPU is going to be pulling instructions from it like a cache.

Laa-Yosh said:
Also consider that Lionhead has had a complete engine using VT for at least 1.5-2 years now, with plenty of experience but no released product. As far as I know they're owned by Microsoft, so it could have been the testbed application to study hw usage patterns and bottlenecks for VT.

Lionhead launch title confirmed

. But to be serious, lionhead has never been that graphically exciting for the xbox or 360. Seems an odd studio to incubate such a thing.

Shifty Geezer · Feb 6, 2013

Laa-Yosh said:
Virtual texturing would conserve a lot of bandwidth, and with DMA support it would also run faster than on a regular GPU.

And it doesn't have to be unique texturing - after all, Carmack has been asking for virtualization for more than a decade, when Megatexture could not even been a theory (content creation tools have gone through small revolutions in that time, allowing for several orders of magnitudes of increase in detail).

We can even go one step further. What if Durango is a TBDR with VT? That'd make for interesting discussion! Although it's not a hardware TBDR I'm guessing, so devs could be free to branch out if the APIs are flexible enough. But it could be designed around small tile rendering.

XpiderMX · Feb 6, 2013

anexanhume said:
Lionhead launch title confirmed . But to be serious, lionhead has never been that graphically exciting for the xbox or 360. Seems an odd studio to incubate such a thing.

You can read about mega meshes here: http://miciwan.com/GDC2011/GDC2011_Mega_Meshes.pdf

Laa-Yosh · Feb 6, 2013

Yes, Lionhead's tech was extremely impressive, and looking even better to some extent than Rage (the lake scene).

As for TBDR, I'm wondering about the geometry related issues, binning would require lots of extra steps and memory. Overdraw isn't such a big issue nowadays anyway, but geometry is becoming more and more complex. Not sure if CD Projekt is involved with nextgen consoles, but they do promise tessellated terrain for the Witcher III, for example.
So what advantages could this approach bring, apart from making good use of only 32MB of fast memory?

Arwin · Feb 6, 2013

Alucardx23 said:
Durango’s Move Engines

Moore’s Law imposes a design challenge: How to make effective use of ever-increasing numbers of transistors without breaking the bank on power consumption? Simply packing in more instances of the same components is not always the answer. Often, a more productive approach is to move easily encapsulated, math-intensive operations into hardware.

The Durango GPU includes a number of fixed-function accelerators. Move engines are one of them.

Durango hardware has four move engines for fast direct memory access (DMA)

This accelerators are truly fixed-function, in the sense that their algorithms are embedded in hardware. They can usually be considered black boxes with no intermediate results that are visible to software. When used for their designed purpose, however, they can offload work from the rest of the system and obtain useful results at minimal cost.

More after the link
http://www.vgleaks.com/world-exclusive-durangos-move-engines/

Pretty intriguing stuff.

liolio · Feb 6, 2013

function said:
If Durango can save post transform polys that straddle or fall on the other side of a tile boundary to a cache in main memory then the tiling penalty that the 360 saw might be considerably reduced.

Perhaps one of the move engines could be used to DMA said polys into main ram without impacting on GPU performance.

Tiling makes an awful lot of sense, especially after looking at the amount of Wii U GPU taken up with edram.

I could see the G-buffer being rendered straight into the main ram, so pretty much avoiding the geometry overhead and the restriction on it size (KZ2 went with 64MB G-buffer right?). (I speak more of tile based deferred shading than tile based deferred or not rendering).

Then if tiles are stream to the scratchpad to be used by the GPU, I can't see the DME filling the whole scratchpad with tiles in advance, there may be a better use.
If one aim a close box system could see them optimizing the size of the tile wrt more to the ROPs property, I would guess that ROPs cache offers more bandwidth than the scratchpad somehow and one would choose a tiles size that "works" with the ROPs.

Overall if you go that road, I wonder to which extend the DME + scratchpad makes things significantly better than properly optimized deferred shading as ultimately as I understand thing the critical factor is having tiles that fits well into the ROPs and their cache, I see it as a more important factor than how much bandwidth you have to either the main ram or the scratchpad.
You could have the g-buffer to fit (making compromise, etc.) in the scratchpad but the difference in bandwidth between the scratchpad and the ram is hardly to make a nigh and day difference.
I guess devs may choose depending on their needs.

It got me to wonder if MSFT plans for less obvious uses of the scratchpad, and so I wonder about the CPU being able to access it.

Shifty Geezer said:
I think the initial diagram has been misleading. It shows CPU, ESRAM, GPU, and DDR3 all with bidirectional IO into the northbridge, but doesn't show what communication between components there are. It's pretty fair to assume from that diagram that CPU and communicate with the ESRAM via northbridge just as it can the DDR3. I'm not seeing anything saying the CPU doesn't have the same virtual addressing though, so I wouldn't bet on Scott being right just yet. It may not have a direct connection to the ESRAM, but may have complete access to ESRAM.

Interesting, it would make sense to me that actually it is the CPU that "set" the DMEs. I mean it fixed function hardware but it has to receive commands.
They states somewhere that one DME is used by the system (I guess there is an API/Driver for the things) that hint at the CPU being in control and aware of what those units are doing.

The support of generic lossless compression really make me wonder if the scratchpad could be used as some form of "software L3" for the CPU too. Software would scatter/gather compressed or decompressed data for the system as a whole, not only assist the GPU.

Laa-Yosh · Feb 6, 2013

All of this actually makes a lot of sense. The previous generation had a lot more uncertainty at the beginning, a lot of techniques and approaches were invented on the run and we've seen huge changes from the start and through the cycle as well. Even the latest titles from the Uncharted, Halo, Gears, and COD series are all very different from the first releases.

But now the approaches are more and more similar, we've seen a lot of convergence and quite a lot of general, must-have features have been crystallized. It is quite unlikely to see significant changes in the structure of the rendering pipeline now, at least until realtime raytracing becomes truly affordable.
So it makes sense to design a more special purpose architecture which is tuned for these convergent engines.

It might limit experimentation with radically new approaches like a fully voxelized engine - but such new approaches would also require a complete overhaul of the content creation pipeline, which is the most expensive aspect of game development and thus the hardest to change. By the time all major studios could be convinced to do that, the current generation would possibly be close to its end anyway.
I wonder what Tim Sweeney thinks about this, though.

patsu · Feb 6, 2013

XpiderMX said:
You can read about mega meshes here: http://miciwan.com/GDC2011/GDC2011_Mega_Meshes.pdf

Cool, sounds like the Virtual Texturing discussion is right on the money.

What did the authors implement their demoes on ? GPU with DMAs ? Would be interesting to see the performance gain attributed to the DMEs.

IMHO, workflow advancement is a more important leap than console specs.

Scott_Arm · Feb 6, 2013

This is why I asked about Partially Resident Textures, which is a hardware feature of AMD GCN!

Scott_Arm · Feb 6, 2013

patsu said:
Cool, sounds like the Virtual Texturing discussion is right on the money.

What did the authors implement their demoes on ?

IMHO, workflow advancement is a more important leap than console specs.

Presentation said they were intending for this to run on 360, but I'm guessing the demos are running on the creator tools on the PC. Maybe some of these tools have found their way into the new SDK.

liolio · Feb 6, 2013

Laa-Yosh said:
I wonder what Tim Sweeney thinks about this, though.

Really interesting post as usual

I wonder too, I was reading those 2 posts again: 1 and 2 from sebbbi, actually the talk is great

I mean it seems that a lot of of the short coming of CPU could be overcame by a fair amount of on board memory (in the case he discussed L3 in Intel processors). Clearly packing enough computational power in CPU cores and having lot of on chip cache is still out of reach.

Though now we see this scratchpad in Durango made out of SRAM as cache, in beefy quantity.
DME is not making sctachpad L3 but they handle nice thing compression/decompression on the fly (you could load whatever you want in a scratchpad memory, keep it compress, stream tile to the CPU L2).

I could definitely see T.Sweeney being the kind of guy to think, actually if you have ~ 300mm^2 as you silicon budget wanted something like this:
Get those jaguar cores to support AVX natively (doubling the FP throughput).
Get plenty of them
get me that scratchpad and more move engines and a reasonably low latency access it.
lot of memory.

The Tim sweeney "next generation system"
32 reworked Jaguar with native support for AVX @ 1.6 GHz 896GFLOPS
32MB of scratchpad.
More move engines

=> you will get less thought better pixels

Though definitely I'm joking, it is interesting to notice that CPU based set-up would not be that far in "big numberz".

Xbox One (Durango) Technical hardware investigation

Scott_Arm

anexanhume

liolio

Aquoiboniste

function

None functional

Grall

Invisible Member

Shifty Geezer

uber-Troll!

Shifty Geezer

uber-Troll!

mrcorbo

Foo Fighter

Laa-Yosh

I can has custom title?

anexanhume

Shifty Geezer

uber-Troll!

XpiderMX

Laa-Yosh

I can has custom title?

Arwin

Now Officially a Top 10 Poster

liolio

Aquoiboniste

Laa-Yosh

I can has custom title?

patsu

Scott_Arm

Scott_Arm

liolio

Aquoiboniste

Similar threads