Xbox One (Durango) Technical hardware investigation

Status
Not open for further replies.
This will be a important speedup for next gen and it is one of the reason why a single APU with a 5W TDP (!!!) and a thick DirectX API can render Dirt: Showdown fluently in 1920x1080 while a 80W current gen game console can't. I find this most impressive! This is progress!

No, this is wrong. You're implying that HSA makes a system so much more efficient that a 5W HSA system can outperform an 80W non HSA system. But your reasoning is totally flawed. Dirt is not build for HSA and therefore would take no advantage of the improved communication between CPU and GPU in Tehmash. It runs so well purely because Tehmash is pretty powerful in the traditional sense of decent CPU and decent GPU. FOr the record it doesn't even run all that well, the framerate is clearly in the 20's at best, probably in the teens.

If what you were saying about HSA automatically making systems so much more powerful/efficient were true then why isn't Trinity displaying mind blowing performance? It may not be as fully HSA as Kaveri but it's certainly a lot more unified than discrete systems.

Sure, coding specifically for HSA will allow new approaches to certain operations to be exploited. Probably most significantly, operations that would previously have had to be performed on the CPU but which lend themselves well to SIMD performance (like phsyics). So the biggest impact will likely be in areas that have traditionally been seen as CPU limited. It's not going to suddenly make a 1.2 TFLOP GPU behave like a 2 TFLOP GPU. It might make a 100 GFLOP CPU behave like a 500 GFLOP CPU though - at the expense of graphics performance.
 
Though now we see this scratchpad in Durango made out of SRAM as cache, in beefy quantity.
DME is not making sctachpad L3 but they handle nice thing compression/decompression on the fly (you could load whatever you want in a scratchpad memory, keep it compress, stream tile to the CPU L2).


Only 2 of the 4 DME units have compress/decompress hardware if I understand correctly.

My impression was that it would be used more for copying buffers between the edram and ddr3 than for more efficient texturing.
 
The 102GB bandwidth eSRAM number is derivated from 800Mhz GPU clock and 1024Bit(128Byte) connection. But if it's part of the APU why shouldn't the factor be the internal GPU databus(around 5Gbps at 800Mhz) for the possible limits? The internal databus is designed to handle 384 GDDR5 bandwidth.

And if it's an external connection on a MCM there's also DDR/QDR SRAM. The bandwidth difference between 68GB and 102GB even with lower latency in a L1/L2 infrastructure doesn't look big enough to me for the effort.
 
Only 2 of the 4 DME units have compress/decompress hardware if I understand correctly.

My impression was that it would be used more for copying buffers between the edram and ddr3 than for more efficient texturing.

They are also pretty slow

Around 400MB/s for both combined jpeg + lz iirc it works to a couple MB per frame at most
 
Only 2 of the 4 DME units have compress/decompress hardware if I understand correctly.
And only one of them can do jpegs, which are much more relevant for natural-looking textures than zip. However, just that one is easily enough for virtual texturing, as you don't recycle your entire atlas every frame.

My impression was that it would be used more for copying buffers between the edram and ddr3 than for more efficient texturing.

Virtual texturing is a technique that (among other things) exploits the massive efficiency difference between texture compression and traditional image compression. Textures need to be able to be addressed at arbitrary offsets -- you need to be able to say "I want this texel here", without having to process the entire texture on every access. Most images don't have this requirement, instead this :oops: little smiley icon is processed as whole to be drawn. This means that jpeg/lz77/whatever compression is much better than texture compression.

So, in virtual texturing you have a small atlas that has ready-to-use textures, and keep most of your textures compressed with jpeg or the like. As you move about in the game world, the game tries to keep the atlas filled with textures that are relevant to your surroundings, uncompressing the jpeg textures as you walk closer to them and placing them resident in the texture cache. For a good example of this in use, check out Chivalry. Whenever you spawn after being away from your spawn, everything around you looks fuzzy for a second. Then it snaps to sharp and stays that way, regardless of where you walk or what you do. So long as you are not unpredictably teleporting around, the VT scheme can reduce texture memory use and allow for much higher amount of texture detail by keeping only nearby surfaces ready to use at high detail.

That was a mile-high overview with obvious gaps -- if you want better detail of virtual texturing, as Sebbbi. I haven't actually ever implemented a game engine that uses it.

The reason the DMEs sound like made for virtual texturing is that one important part of it is turning those small chunks of jpegs into textures on-demand, with low latency and ideally without consuming a lot of system resources. The pop-in problems widely reported in Rage were there because the latency of conversion was too high. The uncompressing DME seems like an ideal fit for this -- It's basically a plug-in solution. You don't even need to use any GPU or CPU resources for it, beyond figuring out which of the textures you want to uncompress this frame.
 
Blast, I originally replied to this in the wrong thread. Moving it here, but keeping the original quote that I quoted. :p

VGleaks have now published some information on Durango's Move Engines.



http://www.vgleaks.com/world-exclusive-durangos-move-engines/

So, assuming info correct of course, we know what these are. I must say they don't look ground-breaking but this sort of hardware can sometimes be used in interesting ways.

Some of that is pretty interesting. I wasn't expecting 2 of the DME's to be different in nature.

Depending on whether data is compressible/decrompessible using LZ77 (apparently many EA games use this) that could either increase effective bandwidth to the CPU/GPU and/or decrease the CPU/GPU load required to decompress data stored in LZ77 format. The CPU/GPU load could potentially be significant as the JPEG XR decode for Rage's assets could be quite slow without GPGPU assistance, and even then could be quite slow if not enough GPGPU resources were available. I'm not sure how resource intensive decompression of LZ77 is compared to JPEG XR, however. I'm going to guess less resource intensive, but I'm not sure by how much.

[strike]It also implies that the method they go with can result in up to an 8:1 compression ratio, which effectively means up to an 8x increase in effective bandwidth (25.6 GB/s for one DME, up to ~200 GB/s data throughput after compression/decompression). Without using the DME, you'd have to use CPU or GPU resources for this. So that effectively makes it "free" on Durango as long as the compression used (if pre-shipped with compressed assets) is that which the DME supports. Which in the case of console titles for Durango, will most likely be the case.

BTW - before fans of one console or another jump on this. That isn't a magic bullet. If the data is highly compressible, that can result in more effective bandwidth. On the other hand if it is already highly compressed using LZ77, that just reduces the potential CPU/GPU load. If it uses some other form of compression and thus isn't highly compressible then the benefits are significantly lower. However, it's likely that developers targeting a game at Durango will likely try to take advantage of this.

A fan of Durango might point out that provides a theoretical 350+ GB/s bandwidth if all the stars aligned (accessing DDR3, ESRAM, and highly compressible data simultaneously) but that's only a wet dream. It's not going to happen, not even close. Non fans of Durango might point out that it has no benefit if everything is highly compressed in something other than LZ77 or the supported JPEG formats. That also isn't going to happen. The benefit lies somewhere in between.[/strike]

Somewhat disappointed that the JPEG decoder doesn't support more advanced JPEG formats such as JPEG XR. But it's another bandwidth reducing and/or CPU/GPU resource reducing function.

The rest is pretty standard, though the tile/untile is potentially interesting.

The move engines seem very underwhelming to me. They don't operate at full memory bus speed, and they all share bandwidth with each other and other on-chip devices.

If tiling in small chunks then you don't need the use of all the bandwidth. And as mentioned if they did use all the bandwidth then other parts of the system become bandwidth starved and at that point what's the point of having them? These are meant to do their function when the bandwidth isn't being fully utilized or when it could use a part of it better than whatever else is accessing it.

Yeah, but as you note, there are more modern, better algos than ole plain-jane jpeg.

-snip-

But these days we have Ogg, for example. No need to pick a ubiqutous algo in a proprietary piece of hardware. You can use whatever the F you like, it makes no difference since you have complete control over every piece of media that goes through that device.

Increased compression complexity also increases complexity of the silicon used to decode it. I'm willing to bet that LZ77 was chosen as a good compromise between efficiency and cost of implementation.

In other words, how much bloat do you want to add to the SOC for how much benefit? If it requires 4x the silicon space for 1.5x the speed, is it worth it?

The GPU already has dedicated DMA hardware, you wouldn't need to use shader programs just to move stuff around in RAM...and I don't see how a "data movement engine" would automagically lead to greater GPU utilization just by existing.

But the whole point of this is to free up GPU resources. Why have the GPU do it if you can have something else do it while the GPU goes along with the rendering tasks and fetching what it needs from ESRAM when possible.

As well with the added functionality in 2 of the DMEs that allows for things to be done which would require GPU resources in the form of compute resources or CPU resources. Again things that could be better used for running the game than compressing/decompressing data.

Regards,
SB
 
Blast, I originally replied to this in the wrong thread. Moving it here, but keeping the original quote that I quoted. :p



Some of that is pretty interesting. I wasn't expecting 2 of the DME's to be different in nature.

Depending on whether data is compressible/decrompessible using LZ77 (apparently many EA games use this) that could either increase effective bandwidth to the CPU/GPU and/or decrease the CPU/GPU load required to decompress data stored in LZ77 format. The CPU/GPU load could potentially be significant as the JPEG XR decode for Rage's assets could be quite slow without GPGPU assistance, and even then could be quite slow if not enough GPGPU resources were available. I'm not sure how resource intensive decompression of LZ77 is compared to JPEG XR, however. I'm going to guess less resource intensive, but I'm not sure by how much.

It also implies that the method they go with can result in up to an 8:1 compression ratio, which effectively means up to an 8x increase in effective bandwidth (25.6 GB/s for one DME, up to ~200 GB/s data throughput after compression/decompression). Without using the DME, you'd have to use CPU or GPU resources for this. So that effectively makes it "free" on Durango as long as the compression used (if pre-shipped with compressed assets) is that which the DME supports. Which in the case of console titles for Durango, will most likely be the case.

BTW - before fans of one console or another jump on this. That isn't a magic bullet. If the data is highly compressible, that can result in more effective bandwidth. On the other hand if it is already highly compressed using LZ77, that just reduces the potential CPU/GPU load. If it uses some other form of compression and thus isn't highly compressible then the benefits are significantly lower. However, it's likely that developers targeting a game at Durango will likely try to take advantage of this.

A fan of Durango might point out that provides a theoretical 350+ GB/s bandwidth if all the stars aligned (accessing DDR3, ESRAM, and highly compressible data simultaneously) but that's only a wet dream. It's not going to happen, not even close. Non fans of Durango might point out that it has no benefit if everything is highly compressed in something other than LZ77 or the supported JPEG formats. That also isn't going to happen. The benefit lies somewhere in between.

Somewhat disappointed that the JPEG decoder doesn't support more advanced JPEG formats such as JPEG XR. But it's another bandwidth reducing and/or CPU/GPU resource reducing function.

The rest is pretty standard, though the tile/untile is potentially interesting.



If tiling in small chunks then you don't need the use of all the bandwidth. And as mentioned if they did use all the bandwidth then other parts of the system become bandwidth starved and at that point what's the point of having them? These are meant to do their function when the bandwidth isn't being fully utilized or when it could use a part of it better than whatever else is accessing it.



Increased compression complexity also increases complexity of the silicon used to decode it. I'm willing to bet that LZ77 was chosen as a good compromise between efficiency and cost of implementation.

In other words, how much bloat do you want to add to the SOC for how much benefit? If it requires 4x the silicon space for 1.5x the speed, is it worth it?



But the whole point of this is to free up GPU resources. Why have the GPU do it if you can have something else do it while the GPU goes along with the rendering tasks and fetching what it needs from ESRAM when possible.

As well with the added functionality in 2 of the DMEs that allows for things to be done which would require GPU resources in the form of compute resources or CPU resources. Again things that could be better used for running the game than compressing/decompressing data.

Regards,
SB


The DMEs dont run at full speed when using the compression

It's a roughly combined rate of 400/450MB/s when using both the JPEG and LZ compression it's no where near the peak rates of 25GB/s
 
Sucks that pretty much most of the posters that have implemented virtual texturing, that could give some hints as to how the DMEs might help with that, are probably under NDA and won't comment on this. Looking at Sebbbi's post on virtual texturing, they used over 50MB of RAM for their implementation in Trials(at 720p), and I'm unsure of which parts would benefit the most from being in ESRAM.
 
LZ77 is quite sub-optimal for the kind of data processed by GPUs. What I see as a possibility is fe. to write a massively parallel predictive image compressor using the hardware huffman-coder required by RFC1951, I got one implemented here which would only require minor changes to actually operate that way. It would have to be two-pass - or one-pass if the DPCM is directly in the data-consumer, without storing the residuals - but it would gain ~20% over LZ77. It may be possible to misuse the <length,offset> pair to duct-tape a run-length coder (if the sliding window can overlap with the current position), which might make it possible to evade the 1bit/symbol limit of huffman coding, which under optimal circumstances could bring it near 2% of an arithmetic coder. The 1k/4k window limitation of the DME makes LZ77 pretty much a joke for offline-data.

Only thing usefull is the encoder, if running at the promised speed. I guess the 1k/4k limit is related to the internal coder's block-cache for not using chip-external bandwidth. LZ77 is asymmetric, balancing computational burden towards the encoder, an efficient encoder consumes heaps of memory-accesses.
 
The Durango actually makes quite a lot of sense, and indeed looks as it is optimised for tiling multi-pass renderers. Which is not a problem, as it's a clear trend in modern graphic engines and will likely stay for enough time.

The DMEs are not standard DMA units, they also offer conversion between tiled and linear memory models, and that does increase the inefficiency. Indeed, GPU can copy between ESRAM and RAM on its own using DMA engines, but AFAIK they are only one-way fetching, so it would have to fire up its DMA for every copy - i.e. copy a chunk from RAM to cache, set them up again to copy from cache to ESRAM, plus, that would have to be repeated for every line of a texture if it's being tiled, while you can just set up DME once and have everything copied without ever stopping.

Multi-pass rendering would hide the relatively low speed of DMEs - they would be able to load a new chunk of data into ESRAM while previous one is being processed - while actual graphic processing's memory access happens at 51Gbps (remember, you have to read AND write), it has to happen several times over, in the meanwhile data is streamed by DMEs at 25Gbps (one "transaction" only works in one direction).

Essentially, all the GPU should ever have to access is ESRAM. Its lower latency should also contribute to speed in multi-pass renderers, though don't think it would be a significant change - GPUs architectures are designed to hide latencies, after all.

The LZ77 and JPEG units usefulness is questionable to me, as their throughput isn't high. I'd even think that CPU would be better at that, but as there are mentions that Rage had latency problems decoding its megatexture, they are probably useful. Still though, as JPEG decoder doesn't apparently support decode into DXTC, but only to bitmap, CPU would have to be used anyways. LZ decompression could be used to unpack streamed geometry for megamesh system. Don't have ideas on how to use compressor though.
 
That could easily be handled in hardware. Even the old Amiga had a mechanism that made sure DMA didn't starve the CPU.

The Amiga is a great example of data starvation. The blitter either used all the chip memory bus cycles (blitter nasty mode), or it would use three cycles, then yield one. The latter mode was avoided when possible because it took an extra bus cycle for the blitter to yield the bus, so you only used 4 out of 5 bus cycles (3 for the blitter, one for the M68K, one wasted)

As durango seems to be designed now, if the CPU or GPU aren't using much memory bandwidth the move engines still can't utilize it all. It's wasted.

If the CPU and GPU aren't using much memory bandwidth, odds are that not a lot of data needs to be moved around.

But these days we have Ogg, for example. No need to pick a ubiqutous algo in a proprietary piece of hardware. You can use whatever the F you like, it makes no difference since you have complete control over every piece of media that goes through that device.

Ogg is just a container format, and only for media.

Your choice basically boils down to Lempel-Ziv variants with or without Huffman symbol coding on top or block sorting compression. The former is used in gzip, rar and most common form of zip compression, the latter is used in bzip2. The latter has better compression ratios, but much bigger memory footprint and bandwidth requirements.

The option not to use Huffman coding on top of LZ tells us MS values a low latency, high throughput and cheap compression/decompression method higher than ultimate compression ratio.

A perfectly valid choice

Gaining you HOW MUCH exactly, really...? A few tenths of a percent, what? It can't be any huge amounts, that's for sure. Copying data must only take a tiny fraction of frame time.

It's all about bus utilization.

Imagine a CPU doing the swizzling, loading and storing data. Do you use temporal or non-temporal memory ops? Either way the CPU quickly issues a series of loads and stores, then it stalls waiting for data.

Some of the accesses are adjacent so the prefetcher fires up, this helps with subsequent adjacent loads, - good. But because of the swizzling and boundaries (remember we can copy to and from subregions of textures) the next load is somewhere completely different and again we have a stall. The prefetcher has already fetched data ahead of the first series of loads, wasting bandwidth.

So we waste expensive silicon (our CPU core) moving data around, wasting bandwidth doing so. We're not talking a few percent here, more like 25-50%.

Alternatively you could run the texture cookie-cutter on the GPU, that won't waste any bandwidth, but the CU doing the moving will have its shader array just sit there while you copy data around and if your jpeg-decode-to-texture is part of a demand-loading texture pipeline, you'd have a lot of CPU overhead setting it up.

Saying it would save some CPU time, well, what game is seriously going to load 8 CPU cores fully all the time...? You gotta use those cycles for something or they're just wasted.

Seriously, are you saying developers won't know what to do with all those CPU cores ?

Cheers
 
Last edited by a moderator:
The option not to use Huffman coding on top of LZ tells us MS values a low latency, high throughput and cheap compression/decompression method higher than ultimate compression ratio.

Is this a typo? RFC1951 clearly describes"deflate", which is huffman on top of LZ, and no-one says you actually have to send anything but literals, which makes the LZ-part optional.
 
Is this a typo? RFC1951 clearly describes"deflate", which is huffman on top of LZ, and no-one says you actually have to send anything but literals, which makes the LZ-part optional.

Not a typo, an oversight on my part. Re-reading the VGleaks article, it clearly states RFC1951 compliance which is LZ77+Huffman. I only latched on to the LZ77 at first.

Cheers
 
I assume the JPEG-decoder and the LZ-implementation share common logic, both use canonical huffmans, and both are in the same DME. Probably a cheap two-in-one offer for a few cent.
 
I also don't like the magic secret sauce talk, but this could make up for a lot of the supposed performance deficiencies. Maybe even get us some nice anisotropic filtering on ground textures, which has been quite lacking in all X360 titles ;)

The raw texel rate is almost five times that of the 360. The Anisotropic filtering algorithms have also evolved enormously since 2005.

The system seems optimized for megamesh/megatexture type rendering. The hardware assisted decompression features together with the GPU using virtual address translation (which might remove the need for indirection in tex-lookup, cutting the cost of anisotropic filtering)

Cheers
 
How effectively can an early dev kit approximate all of these custom, fixed-function pieces of silicon? Wondering if current devs can really have a clear picture of performance, for better or worse, without actually having final silicon?

(This compared to Orbis which seems (at least to me anyway) much more straightforward and better approximated with off-the-shelf PC parts.)
 
The raw texel rate is almost five times that of the 360. The Anisotropic filtering algorithms have also evolved enormously since 2005.

The system seems optimized for megamesh/megatexture type rendering. The hardware assisted decompression features together with the GPU using virtual address translation (which might remove the need for indirection in tex-lookup, cutting the cost of anisotropic filtering)

Cheers

Are these DMEs the same thing or do they have the same function as a zlib decoder?
 
How effectively can an early dev kit approximate all of these custom, fixed-function pieces of silicon?

Early dev kits are probably only good for general architecture and feature testing, but it is quite impossible to fine tune performance at all.

There is a short behind the scenes movie on Gamespot about Spartan Ops, which shows 343 building and testing the levels on a PC - but the rendering engine is quite bare bones,
http://blogs.halowaypoint.com/post/...and-the-Future-of-Halos-Episodic-Content.aspx

I imagine current development should be something like this, with two versions of engines used:
- one for testing the renderer features at some pretty bad FPS
- one for building levels and gameplays, stripped of most renderer features

How they can plan for 30fps without final silicone, I have no idea.

(This compared to Orbis which seems (at least to me anyway) much more straightforward and better approximated with off-the-shelf PC parts.)

In some ways it's actually similar to PS3 vs. Xbox360 - MS went with a then new architecture using unified shaders and EDRAM, whereas Sony just picked a generic GPU.
 
Status
Not open for further replies.
Back
Top