Xbox One (Durango) Technical hardware investigation

Discussion in 'Console Technology' started by Love_In_Rio, Jan 21, 2013.

Thread Status:
Not open for further replies.
  1. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,583
    Likes Received:
    703
    Location:
    Guess...
    No, this is wrong. You're implying that HSA makes a system so much more efficient that a 5W HSA system can outperform an 80W non HSA system. But your reasoning is totally flawed. Dirt is not build for HSA and therefore would take no advantage of the improved communication between CPU and GPU in Tehmash. It runs so well purely because Tehmash is pretty powerful in the traditional sense of decent CPU and decent GPU. FOr the record it doesn't even run all that well, the framerate is clearly in the 20's at best, probably in the teens.

    If what you were saying about HSA automatically making systems so much more powerful/efficient were true then why isn't Trinity displaying mind blowing performance? It may not be as fully HSA as Kaveri but it's certainly a lot more unified than discrete systems.

    Sure, coding specifically for HSA will allow new approaches to certain operations to be exploited. Probably most significantly, operations that would previously have had to be performed on the CPU but which lend themselves well to SIMD performance (like phsyics). So the biggest impact will likely be in areas that have traditionally been seen as CPU limited. It's not going to suddenly make a 1.2 TFLOP GPU behave like a 2 TFLOP GPU. It might make a 100 GFLOP CPU behave like a 500 GFLOP CPU though - at the expense of graphics performance.
     
  2. inefficient

    Veteran

    Joined:
    May 5, 2004
    Messages:
    2,121
    Likes Received:
    53
    Location:
    Tokyo

    Only 2 of the 4 DME units have compress/decompress hardware if I understand correctly.

    My impression was that it would be used more for copying buffers between the edram and ddr3 than for more efficient texturing.
     
  3. Nisaaru

    Regular

    Joined:
    Jan 19, 2013
    Messages:
    867
    Likes Received:
    195
    The 102GB bandwidth eSRAM number is derivated from 800Mhz GPU clock and 1024Bit(128Byte) connection. But if it's part of the APU why shouldn't the factor be the internal GPU databus(around 5Gbps at 800Mhz) for the possible limits? The internal databus is designed to handle 384 GDDR5 bandwidth.

    And if it's an external connection on a MCM there's also DDR/QDR SRAM. The bandwidth difference between 68GB and 102GB even with lower latency in a L1/L2 infrastructure doesn't look big enough to me for the effort.
     
  4. Betanumerical

    Veteran

    Joined:
    Aug 20, 2007
    Messages:
    1,544
    Likes Received:
    10
    Location:
    In the land of the drop bears
    They are also pretty slow

    Around 400MB/s for both combined jpeg + lz iirc it works to a couple MB per frame at most
     
  5. expletive

    Veteran

    Joined:
    Jun 4, 2005
    Messages:
    3,583
    Likes Received:
    59
    Location:
    Bridgewater, NJ
    On page 31 he talks about texture compression formats and techniques. Can anyone with more technical understanding connect the dots with his findings and the leak today with the DMEs?
     
  6. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    541
    Likes Received:
    170
    And only one of them can do jpegs, which are much more relevant for natural-looking textures than zip. However, just that one is easily enough for virtual texturing, as you don't recycle your entire atlas every frame.

    Virtual texturing is a technique that (among other things) exploits the massive efficiency difference between texture compression and traditional image compression. Textures need to be able to be addressed at arbitrary offsets -- you need to be able to say "I want this texel here", without having to process the entire texture on every access. Most images don't have this requirement, instead this :shock: little smiley icon is processed as whole to be drawn. This means that jpeg/lz77/whatever compression is much better than texture compression.

    So, in virtual texturing you have a small atlas that has ready-to-use textures, and keep most of your textures compressed with jpeg or the like. As you move about in the game world, the game tries to keep the atlas filled with textures that are relevant to your surroundings, uncompressing the jpeg textures as you walk closer to them and placing them resident in the texture cache. For a good example of this in use, check out Chivalry. Whenever you spawn after being away from your spawn, everything around you looks fuzzy for a second. Then it snaps to sharp and stays that way, regardless of where you walk or what you do. So long as you are not unpredictably teleporting around, the VT scheme can reduce texture memory use and allow for much higher amount of texture detail by keeping only nearby surfaces ready to use at high detail.

    That was a mile-high overview with obvious gaps -- if you want better detail of virtual texturing, as Sebbbi. I haven't actually ever implemented a game engine that uses it.

    The reason the DMEs sound like made for virtual texturing is that one important part of it is turning those small chunks of jpegs into textures on-demand, with low latency and ideally without consuming a lot of system resources. The pop-in problems widely reported in Rage were there because the latency of conversion was too high. The uncompressing DME seems like an ideal fit for this -- It's basically a plug-in solution. You don't even need to use any GPU or CPU resources for it, beyond figuring out which of the textures you want to uncompress this frame.
     
  7. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    16,051
    Likes Received:
    5,002
    Blast, I originally replied to this in the wrong thread. Moving it here, but keeping the original quote that I quoted. :p

    Some of that is pretty interesting. I wasn't expecting 2 of the DME's to be different in nature.

    Depending on whether data is compressible/decrompessible using LZ77 (apparently many EA games use this) that could either increase effective bandwidth to the CPU/GPU and/or decrease the CPU/GPU load required to decompress data stored in LZ77 format. The CPU/GPU load could potentially be significant as the JPEG XR decode for Rage's assets could be quite slow without GPGPU assistance, and even then could be quite slow if not enough GPGPU resources were available. I'm not sure how resource intensive decompression of LZ77 is compared to JPEG XR, however. I'm going to guess less resource intensive, but I'm not sure by how much.

    [strike]It also implies that the method they go with can result in up to an 8:1 compression ratio, which effectively means up to an 8x increase in effective bandwidth (25.6 GB/s for one DME, up to ~200 GB/s data throughput after compression/decompression). Without using the DME, you'd have to use CPU or GPU resources for this. So that effectively makes it "free" on Durango as long as the compression used (if pre-shipped with compressed assets) is that which the DME supports. Which in the case of console titles for Durango, will most likely be the case.

    BTW - before fans of one console or another jump on this. That isn't a magic bullet. If the data is highly compressible, that can result in more effective bandwidth. On the other hand if it is already highly compressed using LZ77, that just reduces the potential CPU/GPU load. If it uses some other form of compression and thus isn't highly compressible then the benefits are significantly lower. However, it's likely that developers targeting a game at Durango will likely try to take advantage of this.

    A fan of Durango might point out that provides a theoretical 350+ GB/s bandwidth if all the stars aligned (accessing DDR3, ESRAM, and highly compressible data simultaneously) but that's only a wet dream. It's not going to happen, not even close. Non fans of Durango might point out that it has no benefit if everything is highly compressed in something other than LZ77 or the supported JPEG formats. That also isn't going to happen. The benefit lies somewhere in between.[/strike]

    Somewhat disappointed that the JPEG decoder doesn't support more advanced JPEG formats such as JPEG XR. But it's another bandwidth reducing and/or CPU/GPU resource reducing function.

    The rest is pretty standard, though the tile/untile is potentially interesting.

    If tiling in small chunks then you don't need the use of all the bandwidth. And as mentioned if they did use all the bandwidth then other parts of the system become bandwidth starved and at that point what's the point of having them? These are meant to do their function when the bandwidth isn't being fully utilized or when it could use a part of it better than whatever else is accessing it.

    Increased compression complexity also increases complexity of the silicon used to decode it. I'm willing to bet that LZ77 was chosen as a good compromise between efficiency and cost of implementation.

    In other words, how much bloat do you want to add to the SOC for how much benefit? If it requires 4x the silicon space for 1.5x the speed, is it worth it?

    But the whole point of this is to free up GPU resources. Why have the GPU do it if you can have something else do it while the GPU goes along with the rendering tasks and fetching what it needs from ESRAM when possible.

    As well with the added functionality in 2 of the DMEs that allows for things to be done which would require GPU resources in the form of compute resources or CPU resources. Again things that could be better used for running the game than compressing/decompressing data.

    Regards,
    SB
     
  8. Betanumerical

    Veteran

    Joined:
    Aug 20, 2007
    Messages:
    1,544
    Likes Received:
    10
    Location:
    In the land of the drop bears

    The DMEs dont run at full speed when using the compression

    It's a roughly combined rate of 400/450MB/s when using both the JPEG and LZ compression it's no where near the peak rates of 25GB/s
     
  9. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    16,051
    Likes Received:
    5,002
    Oh bloody hell, you're right. I misread the VGleaks article.

    Regards,
    SB
     
  10. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    13,221
    Likes Received:
    3,654
    Sucks that pretty much most of the posters that have implemented virtual texturing, that could give some hints as to how the DMEs might help with that, are probably under NDA and won't comment on this. Looking at Sebbbi's post on virtual texturing, they used over 50MB of RAM for their implementation in Trials(at 720p), and I'm unsure of which parts would benefit the most from being in ESRAM.
     
  11. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    858
    Likes Received:
    260
    LZ77 is quite sub-optimal for the kind of data processed by GPUs. What I see as a possibility is fe. to write a massively parallel predictive image compressor using the hardware huffman-coder required by RFC1951, I got one implemented here which would only require minor changes to actually operate that way. It would have to be two-pass - or one-pass if the DPCM is directly in the data-consumer, without storing the residuals - but it would gain ~20% over LZ77. It may be possible to misuse the <length,offset> pair to duct-tape a run-length coder (if the sliding window can overlap with the current position), which might make it possible to evade the 1bit/symbol limit of huffman coding, which under optimal circumstances could bring it near 2% of an arithmetic coder. The 1k/4k window limitation of the DME makes LZ77 pretty much a joke for offline-data.

    Only thing usefull is the encoder, if running at the promised speed. I guess the 1k/4k limit is related to the internal coder's block-cache for not using chip-external bandwidth. LZ77 is asymmetric, balancing computational burden towards the encoder, an efficient encoder consumes heaps of memory-accesses.
     
  12. Nevod

    Newcomer

    Joined:
    Feb 6, 2013
    Messages:
    18
    Likes Received:
    0
    Location:
    Krasnoyarsk, RF
    The Durango actually makes quite a lot of sense, and indeed looks as it is optimised for tiling multi-pass renderers. Which is not a problem, as it's a clear trend in modern graphic engines and will likely stay for enough time.

    The DMEs are not standard DMA units, they also offer conversion between tiled and linear memory models, and that does increase the inefficiency. Indeed, GPU can copy between ESRAM and RAM on its own using DMA engines, but AFAIK they are only one-way fetching, so it would have to fire up its DMA for every copy - i.e. copy a chunk from RAM to cache, set them up again to copy from cache to ESRAM, plus, that would have to be repeated for every line of a texture if it's being tiled, while you can just set up DME once and have everything copied without ever stopping.

    Multi-pass rendering would hide the relatively low speed of DMEs - they would be able to load a new chunk of data into ESRAM while previous one is being processed - while actual graphic processing's memory access happens at 51Gbps (remember, you have to read AND write), it has to happen several times over, in the meanwhile data is streamed by DMEs at 25Gbps (one "transaction" only works in one direction).

    Essentially, all the GPU should ever have to access is ESRAM. Its lower latency should also contribute to speed in multi-pass renderers, though don't think it would be a significant change - GPUs architectures are designed to hide latencies, after all.

    The LZ77 and JPEG units usefulness is questionable to me, as their throughput isn't high. I'd even think that CPU would be better at that, but as there are mentions that Rage had latency problems decoding its megatexture, they are probably useful. Still though, as JPEG decoder doesn't apparently support decode into DXTC, but only to bitmap, CPU would have to be used anyways. LZ decompression could be used to unpack streamed geometry for megamesh system. Don't have ideas on how to use compressor though.
     
  13. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,519
    Likes Received:
    852
    The Amiga is a great example of data starvation. The blitter either used all the chip memory bus cycles (blitter nasty mode), or it would use three cycles, then yield one. The latter mode was avoided when possible because it took an extra bus cycle for the blitter to yield the bus, so you only used 4 out of 5 bus cycles (3 for the blitter, one for the M68K, one wasted)

    If the CPU and GPU aren't using much memory bandwidth, odds are that not a lot of data needs to be moved around.

    Ogg is just a container format, and only for media.

    Your choice basically boils down to Lempel-Ziv variants with or without Huffman symbol coding on top or block sorting compression. The former is used in gzip, rar and most common form of zip compression, the latter is used in bzip2. The latter has better compression ratios, but much bigger memory footprint and bandwidth requirements.

    The option not to use Huffman coding on top of LZ tells us MS values a low latency, high throughput and cheap compression/decompression method higher than ultimate compression ratio.

    A perfectly valid choice

    It's all about bus utilization.

    Imagine a CPU doing the swizzling, loading and storing data. Do you use temporal or non-temporal memory ops? Either way the CPU quickly issues a series of loads and stores, then it stalls waiting for data.

    Some of the accesses are adjacent so the prefetcher fires up, this helps with subsequent adjacent loads, - good. But because of the swizzling and boundaries (remember we can copy to and from subregions of textures) the next load is somewhere completely different and again we have a stall. The prefetcher has already fetched data ahead of the first series of loads, wasting bandwidth.

    So we waste expensive silicon (our CPU core) moving data around, wasting bandwidth doing so. We're not talking a few percent here, more like 25-50%.

    Alternatively you could run the texture cookie-cutter on the GPU, that won't waste any bandwidth, but the CU doing the moving will have its shader array just sit there while you copy data around and if your jpeg-decode-to-texture is part of a demand-loading texture pipeline, you'd have a lot of CPU overhead setting it up.

    Seriously, are you saying developers won't know what to do with all those CPU cores ?

    Cheers
     
    #873 Gubbi, Feb 7, 2013
    Last edited by a moderator: Feb 7, 2013
  14. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    858
    Likes Received:
    260
    Is this a typo? RFC1951 clearly describes"deflate", which is huffman on top of LZ, and no-one says you actually have to send anything but literals, which makes the LZ-part optional.
     
  15. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,519
    Likes Received:
    852
    Not a typo, an oversight on my part. Re-reading the VGleaks article, it clearly states RFC1951 compliance which is LZ77+Huffman. I only latched on to the LZ77 at first.

    Cheers
     
  16. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    858
    Likes Received:
    260
    I assume the JPEG-decoder and the LZ-implementation share common logic, both use canonical huffmans, and both are in the same DME. Probably a cheap two-in-one offer for a few cent.
     
  17. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,519
    Likes Received:
    852
    The raw texel rate is almost five times that of the 360. The Anisotropic filtering algorithms have also evolved enormously since 2005.

    The system seems optimized for megamesh/megatexture type rendering. The hardware assisted decompression features together with the GPU using virtual address translation (which might remove the need for indirection in tex-lookup, cutting the cost of anisotropic filtering)

    Cheers
     
  18. expletive

    Veteran

    Joined:
    Jun 4, 2005
    Messages:
    3,583
    Likes Received:
    59
    Location:
    Bridgewater, NJ
    How effectively can an early dev kit approximate all of these custom, fixed-function pieces of silicon? Wondering if current devs can really have a clear picture of performance, for better or worse, without actually having final silicon?

    (This compared to Orbis which seems (at least to me anyway) much more straightforward and better approximated with off-the-shelf PC parts.)
     
  19. scently

    Regular

    Joined:
    Jun 12, 2008
    Messages:
    926
    Likes Received:
    81
    Are these DMEs the same thing or do they have the same function as a zlib decoder?
     
  20. Laa-Yosh

    Laa-Yosh I can has custom title?
    Legend Subscriber

    Joined:
    Feb 12, 2002
    Messages:
    9,568
    Likes Received:
    1,452
    Location:
    Budapest, Hungary
    Early dev kits are probably only good for general architecture and feature testing, but it is quite impossible to fine tune performance at all.

    There is a short behind the scenes movie on Gamespot about Spartan Ops, which shows 343 building and testing the levels on a PC - but the rendering engine is quite bare bones,
    http://blogs.halowaypoint.com/post/...and-the-Future-of-Halos-Episodic-Content.aspx

    I imagine current development should be something like this, with two versions of engines used:
    - one for testing the renderer features at some pretty bad FPS
    - one for building levels and gameplays, stripped of most renderer features

    How they can plan for 30fps without final silicone, I have no idea.

    In some ways it's actually similar to PS3 vs. Xbox360 - MS went with a then new architecture using unified shaders and EDRAM, whereas Sony just picked a generic GPU.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...