Next Generation Hardware Speculation with a Technical Spin [pre E3 2019]

Discussion in 'Console Technology' started by TheAlSpark, Dec 31, 2018.

Thread Status:
Not open for further replies.
  1. vipa899

    Regular

    Joined:
    Mar 31, 2017
    Messages:
    922
    Likes Received:
    354
    Location:
    Sweden
    Was thinking, if a a 2013 killzone uses 3gb vram, then how much would GoW or Spiderman use in vram.

    Isnt that how HZD is done, or atleast partially?

    Seems very much but then were two years away.
     
  2. BRiT

    BRiT (>• •)>⌐■-■ (⌐■-■)
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    20,516
    Likes Received:
    24,424
    My initial knee-jerk numbers are 20 GB:
    • 4 GB OS (Includes the Streaming Buffers)
    • 16 GB Game
    That's a 3x improvement from PlayStation 5GB for Games.
     
    milk, Shoujoboy, pharma and 2 others like this.
  3. Tkumpathenurpahl

    Tkumpathenurpahl Oil Monsieur Geezer
    Veteran

    Joined:
    Apr 3, 2016
    Messages:
    1,910
    Likes Received:
    1,929
    Killzone SF :p

    As Shifty's pointed out, it's really hard to guess. The X1X is capable of delivering native 4K games with higher resolution textures using 12GB of GDDR5. So that, coupled with marketing requirements, indicate that slightly more than 12GB is the bare minimum.

    Should the next-gen consoles contain an NVME SSD, it could fill up 16GB of memory in 1 second. As Shifty said, the amount of memory required for buffering could reduce substantially.

    Should they contain a secondary pool of memory for UI and apps, that will increase the amount of main memory accessible to games. For example, let's say they cut OS requirements (and I mean OS, not UI) down to 1GB, whilst increasing main memory to 16GB. Although only double the size, we'd see nearly triple the capacity.

    Bandwidth seems to be the greatest concern, and 24GB of GDDR6 on a 384bit bus is the best, cheapest way of meeting that requisite bandwidth. Let's be kind of conservative, and assume 1.5GB needs to be reserved for the OS - apps and UI dwelling in secondary memory. That's four times the PS4's available capacity, and approximately four times the bandwidth, depending on the speed of GDDR6.

    What I'd love to see though - and I know this is highly unlikely - is two stacks of HBM3, totalling 24GB and 1TB/s. Plenty of bandwidth and enough capacity to last a good 8-10 years. Also, it'd use less power than the same capacity GDDR6, leaving power budget to be spent on higher clocks for the GPU/CPU.

    There's a veeeeeeery slender chance we'll get that if there's any truth to the claim that HBM is the future because GDDR is beginning to hit its limits. I know nothing about the veracity of that claim.
     
    vipa899 likes this.
  4. anexanhume

    Veteran

    Joined:
    Dec 5, 2011
    Messages:
    2,078
    Likes Received:
    1,535
    We haven’t seen anyone talk about HBM3 in quite a while. It’s likely too far off, and can’t hit the volumes needed for consoles.

    They could do 12Gb chips and get 18GB of GDDR6 on a 384-bit bus.
     
  5. Tkumpathenurpahl

    Tkumpathenurpahl Oil Monsieur Geezer
    Veteran

    Joined:
    Apr 3, 2016
    Messages:
    1,910
    Likes Received:
    1,929
    Yeah, I think you're 99.999999% correct. But I'm hopeful that there'll be some unexpectedly positive news at CES or its ilk. Is there an equivalent event dedicated to memory?

    I was really expecting more news regarding it this year, with 7nm ramping up, but news on it's been weirdly absent.

    Cool. What bandwidth would we be looking at? And what's the formula to calculate that please?
     
  6. anexanhume

    Veteran

    Joined:
    Dec 5, 2011
    Messages:
    2,078
    Likes Received:
    1,535
    Number of chips times 32 pins per chip times the chips’ transfer speed divided by 8 to convert bits to bytes.

    Example: 14Gbps chips on a 256 bit bus (8 chips) is 448GB/s.
     
    Tkumpathenurpahl likes this.
  7. MrFox

    MrFox Deludedly Fantastic
    Legend

    Joined:
    Jan 7, 2012
    Messages:
    6,488
    Likes Received:
    5,996
    If we assume 10-12TF is the most reasonable target, would 576GB/s be enough? They can choose the least expensive between 16/18gbps 256bit, 12/14gbps 384bit, or two stacks of HBM2/HBM2E. Every two years the memory manufacturers have a new node which allows faster bins, making the previous top-end bin speeds ripple down into mainstream volume.

    Gddr6 16gbps 256bit would be a good choice for a conservative 10TF on 512GB/s. Then add next-gen features like FP16 (20TF), AI and RT (40TF), combined with a much faster CPU it's a nice next gen which can have it's GPU doubled at mid-gen.

    Four HBM2 stacks are out of the question, but two stacks are not impossible. It might fall to a reasonable cost in the next two years. It's the right bandwidth and would help a lot with power issues. And it will get a speed bump too, so the 307GB/s per stack won't be the top speed anymore in 2020, it should be a midrange bin.

    Then we have HBM3 which should fill those requirements with a single stack. If the major cost of HBM is integration, not the memory dies themselves, this option could be very low cost. Don't know if that can be ready for 2020 (rumored to be planned for 2019/2020 but that doesn't mean mainstream volume).
     
  8. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    15,134
    Likes Received:
    7,680
    If they go back to split memory pools, they'll need to have more total memory, probably 16GB of RAM and 8GB of VRAM. That's still enough for even the most high-end pc games right now. They might be able to get away with less if its a single pool.
     
  9. DieH@rd

    Legend

    Joined:
    Sep 20, 2006
    Messages:
    6,387
    Likes Received:
    2,411
    HBM would also save dozen of watts.
     
  10. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    44,106
    Likes Received:
    16,898
    Location:
    Under my bridge
    Enough for what? That's the same question as how much RAM is enough, which is the same as how cheap does the RAM need to be. ;)

    Given x amount of bandwidth and y amount of compute, devs will favour techniques that prefer bandwidth if there's an excess, or processing if BW is limited. The only case BW becomes a real issue is when one platform has significantly less than the others (including PC targets) and cross-platforms that aren't optimised for it tank the framerate (or simplistically reduce quality) to fit.

    I suppose the real question your posing is what's the typical BW per flop needed in current and predicted future GPU workloads. Someone might have typical data for BW limited games, but future requirements is nigh impossible to pin down if we don't know what the hardware will be doing. If raytracing is a target, would some large L3 cache on the GPU big enough to fit a useful data-structure be useful (BVH tree)? Or will we forever be thrashing main RAM and need as much main RAM BW as possible? Are we going to start to see materials computed in realtime instead of reading loads of 4K textures?
     
    anexanhume and TheAlSpark like this.
  11. anexanhume

    Veteran

    Joined:
    Dec 5, 2011
    Messages:
    2,078
    Likes Received:
    1,535
    It’s not just about cost.

    https://www.extremetech.com/computi...double-hbm2-manufacturing-fail-to-meet-demand

    And yes, HBM burns less power, but the advantage of GDDR6 is that your power is burned off package mostly. Burning 20W more power isn’t that costly system wise (perhaps a 10% system bump), but having to cool only one or two instead of those die + memory stacks makes your cooling system more complex. You also potentially save the cost of an interposer.
     
    Tkumpathenurpahl likes this.
  12. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,834
    Likes Received:
    18,634
    Location:
    The North
    hmm... possibly. maybe someone else with more hardware knowledge could provide some insight on memory bandwidth usage running a model. I know training takes a megaton.

    We can use AI NN to generate textures now, I mean, it's a long way out because i don't know what the work flow would look like, but in theory if they trained models for all their textures in advance, then in theory the textures could be generated and re-generated as per LOD parameters, and maybe even get up to super close texture detail.

    If that's to be mainstream, alongside DLSS and NI denoising, and now AI NN texture generation, the desire for tensor seems to go up more. I'll take 100+ tf of tensor flops over standard compute power.
     
    vipa899 likes this.
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    My understanding of Samsung's discussion at the time was a low-cost HBM standard that traded some of the bandwidth, power efficiency, and features of HBM away for cheaper and easier manufacturing and integration. That cost-reduced HBM was something Samsung said it was shopping around to see if there was any interest. I have not seen any announcement or discussion of that idea since.

    Hynix mentioned some aspirations for HBM3 (the slide I saw said HBMx, however). Hynix essentially hoped that it should be faster, cheaper, and more broadly adopted without going into any detail as to what new elements would go into the standard, or how far along it was at the time.

    There's something of a marketing conundrum even in calling it HBM3. From the JEDEC standard, for which there is a standard for each numbered memory type, there is only one HBM standard. HBM2 was the marketing name applied to the finalized variant of the standard. HBM was the preliminary version that was used in AMD's Fury products and seemingly nowhere else. HBM2 as we know it is what happened after various blank spaces in the JEDEC standard were filled in and tentative features were finalized/deleted. Unless I missed a separate HBM2 JEDEC standard being copy-pasted from the 2016 finalized revision, I'm not sure if the various manufacturers would keep an implicit +1 in all their marketing, or JEDEC could be compelled to skip 2 and go straight to 3.

    The individual channels split their data lines between the two clamshell chips, 8 to each. GDDR5 does the same, it just starts with twice as many data pins before taking splitting them between chips.

    I'm aware of games that used procedural generation at level load time or on a streaming basis as new assets were loaded. That allowed for a much smaller footprint in the game's on-disk storage and helped alleviate the SATA bus bottleneck. I'm not aware of games using it on a per-frame basis to save RAM capacity. The concept trades a potentially significant amount of computation to produce an asset, and it's a serial component that is coalesced into loading times or is given a fraction of the frame budget for new objects entering the margins of what is viewable. What is already loaded or generated is re-used for many frames, and there's not enough storage on-die to make the storage pool anywhere but RAM.
    It would seemingly require a limited amount of transformation from input to output to fit the generation process into every frame's usage of an asset.

    Some of Nvidia's early research into BVH acceleration and adapting GPU execution to ray-tracing had test scenes that could take the memory footprint into the tens to hundreds of MB, which on-die storage is unlikely to scale to.
    The examples I found in https://users.aalto.fi/~ailat1/publications/aila2010hpg_paper.pdf were also from 2010, to give an idea of where the contemporary levels of complexity were at the time those figures were given.
     
    egoless, milk, mrcorbo and 3 others like this.
  14. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    44,106
    Likes Received:
    16,898
    Location:
    Under my bridge
    You've got neural net itus! There are many ways to create textures procedurally. Perhaps the most straightforward is to exectue the actions of the artists, so compile and execute a Substance material in realtime rather than baking it. See the classic .kkrieger FPs in 96kb.

    It's a mostly hypothetical statement illustrating the choices engineers face.

    That's what I imagined, but could a smaller subset of the top end of the tree be stored for faster sorting, and only when you find areas on the low levels would you then need to go to RAM. Potentially, have a permanent top-level map and a cache of a smaller lower level to load in necessary spaces? I suppose that only works with convergent rays, so reflections. Scattered light traces absolutely anywhere.

    So for RT next-gen, BW is going to be a premium?
     
  15. Tkumpathenurpahl

    Tkumpathenurpahl Oil Monsieur Geezer
    Veteran

    Joined:
    Apr 3, 2016
    Messages:
    1,910
    Likes Received:
    1,929
    Interesting. And also, it looks like my 0.000001% chance of HBM appearing in a next-gen console has just evaporated :frown:
     
  16. cheapchips

    Veteran

    Joined:
    Feb 23, 2013
    Messages:
    2,493
    Likes Received:
    2,665
    Location:
    UK
    You could reinstate that tiny % chance by assuming that the reason supply is constrained is because one of the console makers has secured a large chunk of Samsung's output. ;-)
     
  17. Globalisateur

    Globalisateur Globby
    Veteran Subscriber

    Joined:
    Nov 6, 2013
    Messages:
    4,592
    Likes Received:
    3,412
    Location:
    France
    It just confirms for me that they are going to use 32GB using clamshell mode using either 18 Gbps or 20 Gbps GDDR6 chips (IMO). The 2 channels will reduce the memory contention problems we have on all current gen consoles and bandwidth will be sufficient.
     
  18. anexanhume

    Veteran

    Joined:
    Dec 5, 2011
    Messages:
    2,078
    Likes Received:
    1,535
    At this point, HBM3 is essentially our best hope to deliver on the premise of LCHBM. LCHBM has evaporated. Customers want high bandwidth and are high-dollar capable, so LCHBM and short-stack HBM has been de-prioritized.

    https://translate.google.de/translate?sl=auto&tl=en&js=y&prev=_t&hl=de&ie=UTF-8&u=https://pc.watch.impress.co.jp/docs/column/kaigai/1112390.html&edit-text=

    Enter STT-MRAM. It's how we'll get more density than SRAM with near-SRAM level performance.

    Thanks. I ended up adding this to my first revision because I thought it was relevant enough to consider, especially since HBM was considered on X1X but decided against with access granularity being one of the drawbacks mentioned.
     
    #98 anexanhume, Jan 6, 2019
    Last edited: Jan 6, 2019
  19. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,834
    Likes Received:
    18,634
    Location:
    The North
    :oops:
    Symptom of now working in the field, tends to dominate my thought process as of late.
    But honestly, really effective stuff to tackle a variety of derivatives and variations. Something hard algorithms struggle a bit more with especially when we get into content creation.

    Agreed, but the computational or labour resources may be significantly higher here. I prefer the NN approach ;) seems to scale better with AAA environment as they have the resources to purchase and make full use of data sets and hardware for their games that require that fidelity. Effectively the same companies that do PBR and outsource their work, can also do catalogs of images for NN to train off. Like different types of sheet metal etc.
     
  20. milk

    milk Like Verified
    Veteran

    Joined:
    Jun 6, 2012
    Messages:
    3,977
    Likes Received:
    4,102
    It's probably the third or fourth time I'm saying this, but my prediction for next gen tech is that implementing exactly that will be one of the next big things next gen. With virtual texturing, it can be cached in texture space. It greatly simplifies the real time shaders as a lot of the compositing is moved into the runtime texture bake stage, so it solves problems of too many different shaders, as discussed in RT threads. Once you have a robust dynamic virtual texture system, you can start experimenting with texture space shading/lighting much more easily too. Half the work has already been done.
    And finally, if we do finaly enter the age of all texture have dynamic displacement, be it through POM like shaders or actual geometry tessellation, or a mix of both (tessellation for low frequency+large scale, POM for low frequency pixel-sized displacements) compositing your textures in texture space also allows multiple heightmaps to be mixed and merged in various interesting ways, even for dynamic decals, solving a common problem I see in many games today where some surfaces have multiple layers of materials on the same mesh, but only one of them has POM and the other ones seem to float above it.
    It solves so may problems, and adds so many new possibilities, I think the most forward looking graphics programers just have to try this out. And now the fact Nvidia added a texture space shading extension to their API tells me they feel the same as me.
     
    chris1515 and pharma like this.
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...