Next Generation Hardware Speculation with a Technical Spin [pre E3 2019]

Status
Not open for further replies.
Firstly, what does a 3 GB game look like?

Was thinking, if a a 2013 killzone uses 3gb vram, then how much would GoW or Spiderman use in vram.

If you had enough power, you could even procedurally generate everything and need even less RAM than we have now.

Isnt that how HZD is done, or atleast partially?

So 20 to 24 GB of total RAM is possible.

Seems very much but then were two years away.
 
Firstly, what does a 3 GB game look like?

Killzone SF :p

How much memory do next gen consoles need as a minimum you think? Maybe if we look at how much ram curent games use we get an idea.
Only know of one exempel, thats killzone shadowfall, its supposedly using 3072mb for vram, about 2gb for game logic, audio, physics etc. How does that look like for GoW, Spiderman, lats of us 2, or any of the AAA games?
For shadowfall, 3gb seems alot, as the game doesnt look like a title with that much vram useage. Spiderman, HZD looking much better does see a much larger vram usage?

As Shifty's pointed out, it's really hard to guess. The X1X is capable of delivering native 4K games with higher resolution textures using 12GB of GDDR5. So that, coupled with marketing requirements, indicate that slightly more than 12GB is the bare minimum.

Should the next-gen consoles contain an NVME SSD, it could fill up 16GB of memory in 1 second. As Shifty said, the amount of memory required for buffering could reduce substantially.

Should they contain a secondary pool of memory for UI and apps, that will increase the amount of main memory accessible to games. For example, let's say they cut OS requirements (and I mean OS, not UI) down to 1GB, whilst increasing main memory to 16GB. Although only double the size, we'd see nearly triple the capacity.

Bandwidth seems to be the greatest concern, and 24GB of GDDR6 on a 384bit bus is the best, cheapest way of meeting that requisite bandwidth. Let's be kind of conservative, and assume 1.5GB needs to be reserved for the OS - apps and UI dwelling in secondary memory. That's four times the PS4's available capacity, and approximately four times the bandwidth, depending on the speed of GDDR6.

What I'd love to see though - and I know this is highly unlikely - is two stacks of HBM3, totalling 24GB and 1TB/s. Plenty of bandwidth and enough capacity to last a good 8-10 years. Also, it'd use less power than the same capacity GDDR6, leaving power budget to be spent on higher clocks for the GPU/CPU.

There's a veeeeeeery slender chance we'll get that if there's any truth to the claim that HBM is the future because GDDR is beginning to hit its limits. I know nothing about the veracity of that claim.
 
Killzone SF :p



As Shifty's pointed out, it's really hard to guess. The X1X is capable of delivering native 4K games with higher resolution textures using 12GB of GDDR5. So that, coupled with marketing requirements, indicate that slightly more than 12GB is the bare minimum.

Should the next-gen consoles contain an NVME SSD, it could fill up 16GB of memory in 1 second. As Shifty said, the amount of memory required for buffering could reduce substantially.

Should they contain a secondary pool of memory for UI and apps, that will increase the amount of main memory accessible to games. For example, let's say they cut OS requirements (and I mean OS, not UI) down to 1GB, whilst increasing main memory to 16GB. Although only double the size, we'd see nearly triple the capacity.

Bandwidth seems to be the greatest concern, and 24GB of GDDR6 on a 384bit bus is the best, cheapest way of meeting that requisite bandwidth. Let's be kind of conservative, and assume 1.5GB needs to be reserved for the OS - apps and UI dwelling in secondary memory. That's four times the PS4's available capacity, and approximately four times the bandwidth, depending on the speed of GDDR6.

What I'd love to see though - and I know this is highly unlikely - is two stacks of HBM3, totalling 24GB and 1TB/s. Plenty of bandwidth and enough capacity to last a good 8-10 years. Also, it'd use less power than the same capacity GDDR6, leaving power budget to be spent on higher clocks for the GPU/CPU.

There's a veeeeeeery slender chance we'll get that if there's any truth to the claim that HBM is the future because GDDR is beginning to hit its limits. I know nothing about the veracity of that claim.

We haven’t seen anyone talk about HBM3 in quite a while. It’s likely too far off, and can’t hit the volumes needed for consoles.

They could do 12Gb chips and get 18GB of GDDR6 on a 384-bit bus.
 
We haven’t seen anyone talk about HBM3 in quite a while. It’s likely too far off, and can’t hit the volumes needed for consoles.

Yeah, I think you're 99.999999% correct. But I'm hopeful that there'll be some unexpectedly positive news at CES or its ilk. Is there an equivalent event dedicated to memory?

I was really expecting more news regarding it this year, with 7nm ramping up, but news on it's been weirdly absent.

They could do 12Gb chips and get 18GB of GDDR6 on a 384-bit bus.

Cool. What bandwidth would we be looking at? And what's the formula to calculate that please?
 
Yeah, I think you're 99.999999% correct. But I'm hopeful that there'll be some unexpectedly positive news at CES or its ilk. Is there an equivalent event dedicated to memory?

I was really expecting more news regarding it this year, with 7nm ramping up, but news on it's been weirdly absent.



Cool. What bandwidth would we be looking at? And what's the formula to calculate that please?

Number of chips times 32 pins per chip times the chips’ transfer speed divided by 8 to convert bits to bytes.

Example: 14Gbps chips on a 256 bit bus (8 chips) is 448GB/s.
 
If we assume 10-12TF is the most reasonable target, would 576GB/s be enough? They can choose the least expensive between 16/18gbps 256bit, 12/14gbps 384bit, or two stacks of HBM2/HBM2E. Every two years the memory manufacturers have a new node which allows faster bins, making the previous top-end bin speeds ripple down into mainstream volume.

Gddr6 16gbps 256bit would be a good choice for a conservative 10TF on 512GB/s. Then add next-gen features like FP16 (20TF), AI and RT (40TF), combined with a much faster CPU it's a nice next gen which can have it's GPU doubled at mid-gen.

Four HBM2 stacks are out of the question, but two stacks are not impossible. It might fall to a reasonable cost in the next two years. It's the right bandwidth and would help a lot with power issues. And it will get a speed bump too, so the 307GB/s per stack won't be the top speed anymore in 2020, it should be a midrange bin.

Then we have HBM3 which should fill those requirements with a single stack. If the major cost of HBM is integration, not the memory dies themselves, this option could be very low cost. Don't know if that can be ready for 2020 (rumored to be planned for 2019/2020 but that doesn't mean mainstream volume).
 
If they go back to split memory pools, they'll need to have more total memory, probably 16GB of RAM and 8GB of VRAM. That's still enough for even the most high-end pc games right now. They might be able to get away with less if its a single pool.
 
If we assume 10-12TF is the most reasonable target, would 576GB/s be enough?
Enough for what? That's the same question as how much RAM is enough, which is the same as how cheap does the RAM need to be. ;)

Given x amount of bandwidth and y amount of compute, devs will favour techniques that prefer bandwidth if there's an excess, or processing if BW is limited. The only case BW becomes a real issue is when one platform has significantly less than the others (including PC targets) and cross-platforms that aren't optimised for it tank the framerate (or simplistically reduce quality) to fit.

I suppose the real question your posing is what's the typical BW per flop needed in current and predicted future GPU workloads. Someone might have typical data for BW limited games, but future requirements is nigh impossible to pin down if we don't know what the hardware will be doing. If raytracing is a target, would some large L3 cache on the GPU big enough to fit a useful data-structure be useful (BVH tree)? Or will we forever be thrashing main RAM and need as much main RAM BW as possible? Are we going to start to see materials computed in realtime instead of reading loads of 4K textures?
 
If we assume 10-12TF is the most reasonable target, would 576GB/s be enough? They can choose the least expensive between 16/18gbps 256bit, 12/14gbps 384bit, or two stacks of HBM2/HBM2E. Every two years the memory manufacturers have a new node which allows faster bins, making the previous top-end bin speeds ripple down into mainstream volume.

Gddr6 16gbps 256bit would be a good choice for a conservative 10TF on 512GB/s. Then add next-gen features like FP16 (20TF), AI and RT (40TF), combined with a much faster CPU it's a nice next gen which can have it's GPU doubled at mid-gen.

Four HBM2 stacks are out of the question, but two stacks are not impossible. It might fall to a reasonable cost in the next two years. It's the right bandwidth and would help a lot with power issues. And it will get a speed bump too, so the 307GB/s per stack won't be the top speed anymore in 2020, it should be a midrange bin.

Then we have HBM3 which should fill those requirements with a single stack. If the major cost of HBM is integration, not the memory dies themselves, this option could be very low cost. Don't know if that can be ready for 2020 (rumored to be planned for 2019/2020 but that doesn't mean mainstream volume).

It’s not just about cost.

https://www.extremetech.com/computi...double-hbm2-manufacturing-fail-to-meet-demand

On Tuesday at ISC 2018, Samsung discussed its Aquabolt HBM2 technology and made a rather unusual claim about demand for its high-end memory standard. According to the company, even if it doubled its manufacturing capacity for HBM2 today, it still wouldn’t be able to meet existing demand for the standard.

And yes, HBM burns less power, but the advantage of GDDR6 is that your power is burned off package mostly. Burning 20W more power isn’t that costly system wise (perhaps a 10% system bump), but having to cool only one or two instead of those die + memory stacks makes your cooling system more complex. You also potentially save the cost of an interposer.
 
you could even procedurally generate everything and need even less RAM than we have now.
hmm... possibly. maybe someone else with more hardware knowledge could provide some insight on memory bandwidth usage running a model. I know training takes a megaton.

We can use AI NN to generate textures now, I mean, it's a long way out because i don't know what the work flow would look like, but in theory if they trained models for all their textures in advance, then in theory the textures could be generated and re-generated as per LOD parameters, and maybe even get up to super close texture detail.

If that's to be mainstream, alongside DLSS and NI denoising, and now AI NN texture generation, the desire for tensor seems to go up more. I'll take 100+ tf of tensor flops over standard compute power.
 
Does anyone know the current status of HBM3 in general?

HBM3 was announced August 2016 at HotChips (Samsung, SK Hynix) which was expected to begin manufacturing in 2019 or 2020.

I did see this recent article from Tomshardware:

HBM2
HBM2 debuted in 2016. In December 2018, the JEDEC updated the HBM2 standard. It now allows up to 12 dies per stack for a max capacity of 24GB. The standard also pegs memory bandwidth at 307GB/s, delivered across a 1,024-bit memory interface separated by eight unique channels on each stack.
My understanding of Samsung's discussion at the time was a low-cost HBM standard that traded some of the bandwidth, power efficiency, and features of HBM away for cheaper and easier manufacturing and integration. That cost-reduced HBM was something Samsung said it was shopping around to see if there was any interest. I have not seen any announcement or discussion of that idea since.

Hynix mentioned some aspirations for HBM3 (the slide I saw said HBMx, however). Hynix essentially hoped that it should be faster, cheaper, and more broadly adopted without going into any detail as to what new elements would go into the standard, or how far along it was at the time.

There's something of a marketing conundrum even in calling it HBM3. From the JEDEC standard, for which there is a standard for each numbered memory type, there is only one HBM standard. HBM2 was the marketing name applied to the finalized variant of the standard. HBM was the preliminary version that was used in AMD's Fury products and seemingly nowhere else. HBM2 as we know it is what happened after various blank spaces in the JEDEC standard were filled in and tentative features were finalized/deleted. Unless I missed a separate HBM2 JEDEC standard being copy-pasted from the 2016 finalized revision, I'm not sure if the various manufacturers would keep an implicit +1 in all their marketing, or JEDEC could be compelled to skip 2 and go straight to 3.

Even if they use the clamshell mode ?
The individual channels split their data lines between the two clamshell chips, 8 to each. GDDR5 does the same, it just starts with twice as many data pins before taking splitting them between chips.

If you had enough power, you could even procedurally generate everything and need even less RAM than we have now.
I'm aware of games that used procedural generation at level load time or on a streaming basis as new assets were loaded. That allowed for a much smaller footprint in the game's on-disk storage and helped alleviate the SATA bus bottleneck. I'm not aware of games using it on a per-frame basis to save RAM capacity. The concept trades a potentially significant amount of computation to produce an asset, and it's a serial component that is coalesced into loading times or is given a fraction of the frame budget for new objects entering the margins of what is viewable. What is already loaded or generated is re-used for many frames, and there's not enough storage on-die to make the storage pool anywhere but RAM.
It would seemingly require a limited amount of transformation from input to output to fit the generation process into every frame's usage of an asset.

Someone might have typical data for BW limited games, but future requirements is nigh impossible to pin down if we don't know what the hardware will be doing. If raytracing is a target, would some large L3 cache on the GPU big enough to fit a useful data-structure be useful (BVH tree)? Or will we forever be thrashing main RAM and need as much main RAM BW as possible? Are we going to start to see materials computed in realtime instead of reading loads of 4K textures?
Some of Nvidia's early research into BVH acceleration and adapting GPU execution to ray-tracing had test scenes that could take the memory footprint into the tens to hundreds of MB, which on-die storage is unlikely to scale to.
The examples I found in https://users.aalto.fi/~ailat1/publications/aila2010hpg_paper.pdf were also from 2010, to give an idea of where the contemporary levels of complexity were at the time those figures were given.
 
hmm... possibly. maybe someone else with more hardware knowledge could provide some insight on memory bandwidth usage running a model. I know training takes a megaton. We can use AI NN to generate textures now.
You've got neural net itus! There are many ways to create textures procedurally. Perhaps the most straightforward is to exectue the actions of the artists, so compile and execute a Substance material in realtime rather than baking it. See the classic .kkrieger FPs in 96kb.

I'm aware of games that used procedural generation at level load time or on a streaming basis as new assets were loaded.
It's a mostly hypothetical statement illustrating the choices engineers face.

Some of Nvidia's early research into BVH acceleration and adapting GPU execution to ray-tracing had test scenes that could take the memory footprint into the tens to hundreds of MB, which on-die storage is unlikely to scale to.
That's what I imagined, but could a smaller subset of the top end of the tree be stored for faster sorting, and only when you find areas on the low levels would you then need to go to RAM. Potentially, have a permanent top-level map and a cache of a smaller lower level to load in necessary spaces? I suppose that only works with convergent rays, so reflections. Scattered light traces absolutely anywhere.

So for RT next-gen, BW is going to be a premium?
 
anexanhume said:
On Tuesday at ISC 2018, Samsung discussed its Aquabolt HBM2 technology and made a rather unusual claim about demand for its high-end memory standard. According to the company, even if it doubled its manufacturing capacity for HBM2 today, it still wouldn’t be able to meet existing demand for the standard.

Interesting. And also, it looks like my 0.000001% chance of HBM appearing in a next-gen console has just evaporated :(
 
Kinda brought it up a while ago regarding GDDR6, but it operates with 2chan x 16-bit I/O, which may have (good) implications for bandwidth utilization and sharing on an APU configuration.

The individual channels split their data lines between the two clamshell chips, 8 to each. GDDR5 does the same, it just starts with twice as many data pins before taking splitting them between chips.
It just confirms for me that they are going to use 32GB using clamshell mode using either 18 Gbps or 20 Gbps GDDR6 chips (IMO). The 2 channels will reduce the memory contention problems we have on all current gen consoles and bandwidth will be sufficient.
 
My understanding of Samsung's discussion at the time was a low-cost HBM standard that traded some of the bandwidth, power efficiency, and features of HBM away for cheaper and easier manufacturing and integration. That cost-reduced HBM was something Samsung said it was shopping around to see if there was any interest. I have not seen any announcement or discussion of that idea since.

Hynix mentioned some aspirations for HBM3 (the slide I saw said HBMx, however). Hynix essentially hoped that it should be faster, cheaper, and more broadly adopted without going into any detail as to what new elements would go into the standard, or how far along it was at the time.

There's something of a marketing conundrum even in calling it HBM3. From the JEDEC standard, for which there is a standard for each numbered memory type, there is only one HBM standard. HBM2 was the marketing name applied to the finalized variant of the standard. HBM was the preliminary version that was used in AMD's Fury products and seemingly nowhere else. HBM2 as we know it is what happened after various blank spaces in the JEDEC standard were filled in and tentative features were finalized/deleted. Unless I missed a separate HBM2 JEDEC standard being copy-pasted from the 2016 finalized revision, I'm not sure if the various manufacturers would keep an implicit +1 in all their marketing, or JEDEC could be compelled to skip 2 and go straight to 3.

At this point, HBM3 is essentially our best hope to deliver on the premise of LCHBM. LCHBM has evaporated. Customers want high bandwidth and are high-dollar capable, so LCHBM and short-stack HBM has been de-prioritized.

https://translate.google.de/translate?sl=auto&tl=en&js=y&prev=_t&hl=de&ie=UTF-8&u=https://pc.watch.impress.co.jp/docs/column/kaigai/1112390.html&edit-text=

Some of Nvidia's early research into BVH acceleration and adapting GPU execution to ray-tracing had test scenes that could take the memory footprint into the tens to hundreds of MB, which on-die storage is unlikely to scale to.
The examples I found in https://users.aalto.fi/~ailat1/publications/aila2010hpg_paper.pdf were also from 2010, to give an idea of where the contemporary levels of complexity were at the time those figures were given.

Enter STT-MRAM. It's how we'll get more density than SRAM with near-SRAM level performance.

Kinda brought it up a while ago regarding GDDR6, but it operates with 2chan x 16-bit I/O, which may have (good) implications for bandwidth utilization and sharing on an APU configuration.

Thanks. I ended up adding this to my first revision because I thought it was relevant enough to consider, especially since HBM was considered on X1X but decided against with access granularity being one of the drawbacks mentioned.
 
Last edited:
You've got neural net itus!
:oops:
Symptom of now working in the field, tends to dominate my thought process as of late.
But honestly, really effective stuff to tackle a variety of derivatives and variations. Something hard algorithms struggle a bit more with especially when we get into content creation.

There are many ways to create textures procedurally. Perhaps the most straightforward is to exectue the actions of the artists, so compile and execute a Substance material in realtime rather than baking it. See the classic .kkrieger FPs in 96kb.
Agreed, but the computational or labour resources may be significantly higher here. I prefer the NN approach ;) seems to scale better with AAA environment as they have the resources to purchase and make full use of data sets and hardware for their games that require that fidelity. Effectively the same companies that do PBR and outsource their work, can also do catalogs of images for NN to train off. Like different types of sheet metal etc.
 
Perhaps the most straightforward is to exectue the actions of the artists, so compile and execute a Substance material in realtime rather than baking it. See the classic .kkrieger FPs in 96kb.

It's probably the third or fourth time I'm saying this, but my prediction for next gen tech is that implementing exactly that will be one of the next big things next gen. With virtual texturing, it can be cached in texture space. It greatly simplifies the real time shaders as a lot of the compositing is moved into the runtime texture bake stage, so it solves problems of too many different shaders, as discussed in RT threads. Once you have a robust dynamic virtual texture system, you can start experimenting with texture space shading/lighting much more easily too. Half the work has already been done.
And finally, if we do finaly enter the age of all texture have dynamic displacement, be it through POM like shaders or actual geometry tessellation, or a mix of both (tessellation for low frequency+large scale, POM for low frequency pixel-sized displacements) compositing your textures in texture space also allows multiple heightmaps to be mixed and merged in various interesting ways, even for dynamic decals, solving a common problem I see in many games today where some surfaces have multiple layers of materials on the same mesh, but only one of them has POM and the other ones seem to float above it.
It solves so may problems, and adds so many new possibilities, I think the most forward looking graphics programers just have to try this out. And now the fact Nvidia added a texture space shading extension to their API tells me they feel the same as me.
 
Status
Not open for further replies.
Back
Top