Next Generation Hardware Speculation with a Technical Spin [pre E3 2019]

anexanhume · Jan 6, 2019

milk said:
It's probably the third or fourth time I'm saying this, but my prediction for next gen tech is that implementing exactly that will be one of the next big things next gen. With virtual texturing, it can be cached in texture space. It greatly simplifies the real time shaders as a lot of the compositing is moved into the runtime texture bake stage, so it solves problems of too many different shaders, as discussed in RT threads. Once you have a robust dynamic virtual texture system, you can start experimenting with texture space shading/lighting much more easily too. Half the work has already been done.
And finally, if we do finaly enter the age of all texture have dynamic displacement, be it through POM like shaders or actual geometry tessellation, or a mix of both (tessellation for low frequency+large scale, POM for low frequency pixel-sized displacements) compositing your textures in texture space also allows multiple heightmaps to be mixed and merged in various interesting ways, even for dynamic decals, solving a common problem I see in many games today where some surfaces have multiple layers of materials on the same mesh, but only one of them has POM and the other ones seem to float above it.
It solves so may problems, and adds so many new possibilities, I think the most forward looking graphics programers just have to try this out. And now the fact Nvidia added a texture space shading extension to their API tells me they feel the same as me.

It’s something AMD has talked about before too.

https://gpuopen.com/texel-shading/

3dilettante · Jan 6, 2019

Shifty Geezer said:
You've got neural net itus! There are many ways to create textures procedurally. Perhaps the most straightforward is to exectue the actions of the artists, so compile and execute a Substance material in realtime rather than baking it. See the classic .kkrieger FPs in 96kb.

My recollection of kkreiger's process was that it ran through the creation steps for the assets during the game's load time. The constructed game would then have much more than 96KB in RAM that allowed it to reuse those results in successive frames. An algorithm fetching instructions, inputs, and looping intermediate buffers is going to generate more accesses than it would take to later access those results, and the load time seemed significant enough that it's unlikely that the more significant serial component of running through a compressed list of steps could be hidden if done on the fly. It would be a net loss if those results were discarded nearly immediately and done again in the next frame.

That's what I imagined, but could a smaller subset of the top end of the tree be stored for faster sorting, and only when you find areas on the low levels would you then need to go to RAM. Potentially, have a permanent top-level map and a cache of a smaller lower level to load in necessary spaces? I suppose that only works with convergent rays, so reflections. Scattered light traces absolutely anywhere.

The effectiveness of a top-level cache would depend on how many accesses hit the top-level versus the acceleration structure in the bottom-level. It seems like the majority of accesses in decently complex objects would be in the bottom level.
TLBs and page walker buffers tend to store a limited set of most recently used entries. The higher levels tend to change less frequently than the lower ones, and the buffers can leverage temporal locality to save misses to cache or memory.
A large table of top-level instances may still be too big for the storage available in the L1 or local buffers of an RT core, but it's able to get some level of spatial or temporal locality, a buffer of the current object and some of the most recently traversed BVH nodes could be applied to multiple rays.

So for RT next-gen, BW is going to be a premium?

RT does increase shading load and has a compute burden from BVH construction/update as well. Bandwidth use can increase, though it's apparently early days in finding out how games in general behave with it. There may be future optimizations beyond just conserving raw bandwidth, such as finding better ways of controlling the divergence of accesses. Disjoint accesses could potentially lead to stalls in the RT hardware or memory subsystem, which would in raw bandwidth terms look deceptively low.

anexanhume said:
Enter STT-MRAM. It's how we'll get more density than SRAM with near-SRAM level performance.

Perhaps at some point, the most recent products and announcements still had endurance fall short of SRAM and DRAM, which would make it less viable for on-die caches operating in the GHz range. Another trade-off for removing the standby current of SRAM besides endurance is write energy, which has historically been significantly higher.

Thanks. I ended up adding this to my first revision because I thought it was relevant enough to consider, especially since HBM was considered on X1X but decided against with access granularity being one of the drawbacks mentioned.

What access granularity problem would there be? HBM has 8 independent 128-bit channels with a burst length of 2, so 256 bits per burst. GDDR5 has a 32-bit channel and burst length of 8, so 256 bits as well.
GDDR5X was the one that doubled prefetch on a 32-bit channel and got bursts of 512 bits.
One of the reasons cited for GDDR6's transition to two channels was to stop the increase in the width of the internal array accesses, so GDDR6 drops back down to the 256 bits per access.

The other kind of granularity is the page width of the DRAM, which is usually around 2KB. GDDR5X in some configurations also doubles this, whereas HBM's pseudo-channel mode can actually halve the page width.

anexanhume · Jan 6, 2019

3dilettante said:
My recollection of kkreiger's process was that it ran through the creation steps for the assets during the game's load time. The constructed game would then have much more than 96KB in RAM that allowed it to reuse those results in successive frames. An algorithm fetching instructions, inputs, and looping intermediate buffers is going to generate more accesses than it would take to later access those results, and the load time seemed significant enough that it's unlikely that the more significant serial component of running through a compressed list of steps could be hidden if done on the fly. It would be a net loss if those results were discarded nearly immediately and done again in the next frame.

The effectiveness of a top-level cache would depend on how many accesses hit the top-level versus the acceleration structure in the bottom-level. It seems like the majority of accesses in decently complex objects would be in the bottom level.
TLBs and page walker buffers tend to store a limited set of most recently used entries. The higher levels tend to change less frequently than the lower ones, and the buffers can leverage temporal locality to save misses to cache or memory.
A large table of top-level instances may still be too big for the storage available in the L1 or local buffers of an RT core, but it's able to get some level of spatial or temporal locality, a buffer of the current object and some of the most recently traversed BVH nodes could be applied to multiple rays.

RT does increase shading load and has a compute burden from BVH construction/update as well. Bandwidth use can increase, though it's apparently early days in finding out how games in general behave with it. There may be future optimizations beyond just conserving raw bandwidth, such as finding better ways of controlling the divergence of accesses. Disjoint accesses could potentially lead to stalls in the RT hardware or memory subsystem, which would in raw bandwidth terms look deceptively low.

Perhaps at some point, the most recent products and announcements still had endurance fall short of SRAM and DRAM, which would make it less viable for on-die caches operating in the GHz range. Another trade-off for removing the standby current of SRAM besides endurance is write energy, which has historically been significantly higher.

What access granularity problem would there be? HBM has 8 independent 128-bit channels with a burst length of 2, so 256 bits per burst. GDDR5 has a 32-bit channel and burst length of 8, so 256 bits as well.
GDDR5X was the one that doubled prefetch on a 32-bit channel and got bursts of 512 bits.
One of the reasons cited for GDDR6's transition to two channels was to stop the increase in the width of the internal array accesses, so GDDR6 drops back down to the 256 bits per access.

The other kind of granularity is the page width of the DRAM, which is usually around 2KB. GDDR5X in some configurations also doubles this, whereas HBM's pseudo-channel mode can actually halve the page width.

But for a consumer product HBM2 is too expensive and inflexible…its memory bandwidth is not as granular, and we would be locked into [an HBM] module

https://www.eetimes.com/document.asp?doc_id=1332180

3dilettante · Jan 6, 2019

My interpretation is that it's not access granularity but granularity in how readily the designers could scale bandwidth up or down. HBM was expensive and could only be scaled up at the granularity of full stacks. If the bandwidth was in excess of the console's need, they'd be paying for the integration and full stack anyway. If insufficient, their next step could only be multiplying the expense by an integer factor.

anexanhume · Jan 6, 2019

3dilettante said:
My interpretation is that it's not access granularity but granularity in how readily the designers could scale bandwidth up or down. HBM was expensive and could only be scaled up at the granularity of full stacks. If the bandwidth was in excess of the console's need, they'd be paying for the integration and full stack anyway. If insufficient, their next step could only be multiplying the expense by an integer factor.

That seems a fair point in that context.

itsmydamnation · Jan 6, 2019

man this thread is circular :runaway:

im going to quote myself from like march

itsmydamnation said:
Remember the AMD HBCC demo, 2gb of HBM memory using HBCC paging to and from main memory, so its not about split symmetrical pools, its about split asymmetrical pools, singe 2hi stack of HBM2 is going to give 2GB at upto 300GB/s of bandwidth which can then be backed by DDR/GDDR. Also consider effective bandwidth issue of this generation, given that most of the GPU read/writes will go to the HBM this means the contention for access on the "main dram" should be much lower then currently experienced by the consoles.

Depending on cost/yields they could to 2x 1hi or 2x 2hi stacks for upto 600GB/s of bandwidth and both of those should come in at most 1/2 of your $160 HBM2 cost. Also i have big doubts on your cost of 24GB of GDDR6 going forward, manufacturing costs are increasing more then the density is improving, 20nm product is dual patterned, 10nm parts will be quad patterned etc. If some how 10nm memory prices move how you expect HBM can move to 10nm as well.

The longer this generation the more likely i think we are to see split memory, the key here is the packaging technologies ( not needing TSV's SLIM/SWIFT etc) , the sooner a PS5 is release i think the less likely we will see split memory.

I am probably less bullish on HBM for the next consoles, but i still think the logic applies.

anexanhume · Jan 6, 2019

itsmydamnation said:
man this thread is circular

im going to quote myself from like march

I am probably less bullish on HBM for the next consoles, but i still think the logic applies.

https://www.extremetech.com/computi...double-hbm2-manufacturing-fail-to-meet-demand

On Tuesday at ISC 2018, Samsung discussed its Aquabolt HBM2 technology and made a rather unusual claim about demand for its high-end memory standard. According to the company, even if it doubled its manufacturing capacity for HBM2 today, it still wouldn’t be able to meet existing demand for the standard.

HBM density is going to subject to the same costs that GDDR6 is, as they're fundamentally the same basic cells. Complicating things by making it a local cache and not adding to overall capacity doesn't seem to be the way to go.

itsmydamnation · Jan 6, 2019

anexanhume said:
https://www.extremetech.com/computi...double-hbm2-manufacturing-fail-to-meet-demand

HBM density is going to subject to the same costs that GDDR6 is, as they're fundamentally the same basic cells. Complicating things by making it a local cache and not adding to overall capacity doesn't seem to be the way to go.

Its not to be a cache, you could choose to expose and developer managed or not expose it directly and have the memory system manage ( just like on the latest Xeon phi) and make that option developer controlled. The point is to keep the GPU write heavy addresses in the place with the most bandwidth. keep the cost of the high bandwidth in check by limiting its size.

anexanhume · Jan 7, 2019

itsmydamnation said:
Its not to be a cache, you could choose to expose and developer managed or not expose it directly and have the memory system manage ( just like on the latest Xeon phi) and make that option developer controlled. The point is to keep the GPU write heavy addresses in the place with the most bandwidth. keep the cost of the high bandwidth in check by limiting its size.

This would be the original Xbox One approach. The capacity is certainly higher, so it would eliminate that developer complaint at least.

MrFox · Jan 7, 2019

3dilettante said:
Hynix mentioned some aspirations for HBM3 (the slide I saw said HBMx, however). Hynix essentially hoped that it should be faster, cheaper, and more broadly adopted without going into any detail as to what new elements would go into the standard, or how far along it was at the time.

There's something of a marketing conundrum even in calling it HBM3. From the JEDEC standard, for which there is a standard for each numbered memory type, there is only one HBM standard. HBM2 was the marketing name applied to the finalized variant of the standard. HBM was the preliminary version that was used in AMD's Fury products and seemingly nowhere else. HBM2 as we know it is what happened after various blank spaces in the JEDEC standard were filled in and tentative features were finalized/deleted. Unless I missed a separate HBM2 JEDEC standard being copy-pasted from the 2016 finalized revision, I'm not sure if the various manufacturers would keep an implicit +1 in all their marketing, or JEDEC could be compelled to skip 2 and go straight to 3.

Actually, the revised jedec doc now have added clear designation of HBM1 and HBM2. The front page also changed the document title to "High Bandwidth Memory Dram (HBM1, HBM2)". Further down it explains HBM1 are devices supporting legacy mode (full channel 128bit and BL 2 only) and HBM2 for devices supporting the split pseudo channel mode (2x 64bit and BL 4 at twice the clock). Maybe HBM3 will be a simple update with four pseudo channels, 32bit, Burst of 8 and a doubled clock again.

3dilettante · Jan 8, 2019

MrFox said:
Actually, the revised jedec doc now have added clear designation of HBM1 and HBM2. The front page also changed the document title to "High Bandwidth Memory Dram (HBM1, HBM2)". Further down it explains HBM1 are devices supporting legacy mode (full channel 128bit and BL 2 only) and HBM2 for devices supporting the split pseudo channel mode (2x 64bit and BL 4 at twice the clock). Maybe HBM3 will be a simple update with four pseudo channels, 32bit, Burst of 8 and a doubled clock again.

Interesting, perhaps my last review of the specs was in a time window between the addition of HBM2 and a later clarification. From what I recall at the time the separation hadn't been made.
Some of the scaling challenges predicted back when Nvidia first started talking about future memory standards was the cost of DRAM array access. Post-HBM2 bandwidths had corresponding increases to the power consumption of row and column activation that had the arrays themselves consuming more power than the interfaces HBM replaced.

HBM2's row access consumption was lower, which may have been a nod to pseudo-channel mode or the array organization that permits it. There wasn't much clarity on what was expected to be done for later HBM versions. A theoretical HBMx standard that tried to reduce array power would still be pulling prohibitive amounts of power if bandwidth scaled as wanted.

anexanhume · Jan 8, 2019

Vega 64 is down to $400 at retail and has 8GB of HBM2. Is it approaching economic viability for consoles, supply issues aside?

metacore · Jan 8, 2019

If Samsung said , even if they would double production capacity there would still be shortages to anyone interested, and meanwhile nvidia opted to use gddr6 in 1200$ cards... ehh.gif

anexanhume · Jan 8, 2019

metacore said:
If Samsung said , even if they would double production capacity there would still be shortages to anyone interested, and meanwhile nvidia opted to use gddr6 in 1200$ cards... ehh.gif

By the time consoles launch, that Samsung quote could be over 2 years old. Taken in conjunction with high pricing it helps make the case, but it becomes a little weaker if viewed in isolation.

Nvidia went with GDDR6 because they likely wanted a common memory controller design for their consumer products. Also, GDDR6 can equal HBM bandwidth with a 352 bit interface for 2 stacks or less. Once higher sped modules are available, it can reach 3-stack bandwidth levels. It’s definitely more cost effective.

MrFox · Jan 8, 2019

3dilettante said:
Interesting, perhaps my last review of the specs was in a time window between the addition of HBM2 and a later clarification. From what I recall at the time the separation hadn't been made.
Some of the scaling challenges predicted back when Nvidia first started talking about future memory standards was the cost of DRAM array access. Post-HBM2 bandwidths had corresponding increases to the power consumption of row and column activation that had the arrays themselves consuming more power than the interfaces HBM replaced.

HBM2's row access consumption was lower, which may have been a nod to pseudo-channel mode or the array organization that permits it. There wasn't much clarity on what was expected to be done for later HBM versions. A theoretical HBMx standard that tried to reduce array power would still be pulling prohibitive amounts of power if bandwidth scaled as wanted.

Yeah it's very recent, from Nov 2018 revision B.

It's unclear whether hbm1 have been or is being used in other products, like networking custom silicon or high end fpga stuff.

Some companies making controllers IP unofficially called the latest revision compatibility as HBM2E. Before that it was just a list of supported features. Hynix call it HBM2, and only Samsung is being a dick calling them a trademarked name like aquabolt.

I suppose the array power that rise would somewhat be countered by hbm3 being on 7nm (hbm2 is 14nm-ish?). Do dram arrays scale that way?

For the PHY, there was rambus saying hbm3 would be a 3D stacking instead of 2.5D. There's no data about 3d integration power saving, but maybe it could keep the array power the domimant figure.

anexanhume · Jan 8, 2019

MrFox said:
Yeah it's very recent, from Nov 2018 revision B.

It's unclear whether hbm1 have been or is being used in other products, like networking custom silicon or high end fpga stuff.

Some companies making controllers IP unofficially called the latest revision compatibility as HBM2E. Before that it was just a list of supported features. Hynix call it HBM2, and only Samsung is being a dick calling them a trademarked name like aquabolt.

I suppose the array power that rise would somewhat be countered by hbm3 being on 7nm (hbm2 is 14nm-ish?). Do dram arrays scale that way?

For the PHY, there was rambus saying hbm3 would be a 3D stacking instead of 2.5D. There's no data about 3d integration power saving, but maybe it could keep the array power the domimant figure.

This post is a bit old at this point, but it explains DRAM scaling fairly well, as 1xnm and 1ynm are rather inscrutable.

https://semiengineering.com/whats-next-for-dram/

3dilettante · Jan 9, 2019

MrFox said:
I suppose the array power that rise would somewhat be countered by hbm3 being on 7nm (hbm2 is 14nm-ish?). Do dram arrays scale that way?

DRAM processes don't use the same metrics as logic ones, and generally aren't marketed that clearly. The most recent notable transition was from roughly above 20 to some number somewhat below it. DRAM isn't as concerned with the sort of gate contact pitch measurements logic nodes used to be bound by, so there's not much clarity on how to compare the marketing name of a logic node to the marketing name of a DRAM node.

If there were an apples to apples comparison, it seems like the consensus is that DRAM is stalling above 10nm.

chris1515 · Jan 9, 2019

https://twitter.com/x/status/1082885194087124992

DieH@rd · Jan 9, 2019

Console games that fully takes advantage of 16 threads of Zen 2 [ok, 2 will be left for OS] just to produce 30fps gameplay will lead to massive upgrade waves on PC.

I miss the times of old, when Crytek had the balls to create game that's actually taking in consideration unreleased hardware [and the game still looked phenomenal on medium settings]. Even before that game was released, there were report that gamers invested $1B in upgrades just for that game.

JPT · Jan 9, 2019

DieH@rd said:
I miss the times of old, when Crytek had the balls to create game that's actually taking in consideration unreleased hardware [and the game still looked phenomenal on medium settings]. Even before that game was released, there were report that gamers invested $1B in upgrades just for that game.

But did Crytek earn any extra money by doing it? Or was it just Crytek tooting their own horn?

Next Generation Hardware Speculation with a Technical Spin [pre E3 2019]

anexanhume

3dilettante

anexanhume

3dilettante

anexanhume

itsmydamnation

anexanhume

itsmydamnation

anexanhume

MrFox

Deludedly Fantastic

3dilettante

anexanhume

metacore

anexanhume

MrFox

Deludedly Fantastic

anexanhume

3dilettante

chris1515

DieH@rd

JPT

Similar threads