Next Generation Hardware Speculation with a Technical Spin [2018]

Status
Not open for further replies.
Did they say no gaming products, or no consumer-level desktop GPUs? A custom SOC at 7nm is something different to a GPU, with guaranteed sales of probably >50 million. That's a lot of silicon and worth doing if the contract arises. So yeah, I think something like current Vega dGPU specs (certainly architecture wise) in an SOC is a clear option.
 
They’ve recovered from a recent exaggerated downturn, but the trend over that entire chart is clearly a gradual conceding of the market to Nvidia. I also wonder how much the crypto craze has been responsible for the ‘correction.’



AMD has admitted there will be no 7nm Vega gaming products. Still, scaling from 14/16 to 7nm is very good, greater than 50%. They could cram a Vega 56/64 in a 350mm^2 APU with that type of scaling.

On top of all of this is the gaming performance per TF as well. I’m sure this has a lot to do with drivers, but comparing SP FLOPs performance to games performance shows an advantage for Nvidia as well.

If they keep Jaguar cores maybe, but if they upgrade to a beefier CPU that might take more space...
 
Did they say no gaming products, or no consumer-level desktop GPUs? A custom SOC at 7nm is something different to a GPU, with guaranteed sales of probably >50 million. That's a lot of silicon and worth doing if the contract arises. So yeah, I think something like current Vega dGPU specs (certainly architecture wise) in an SOC is a clear option.

Sorry, link is more supposition than I recall, but conclusions make sense to me: https://www.pcgamesn.com/amd-7nm-vega-not-for-gaming

If they keep Jaguar cores maybe, but if they upgrade to a beefier CPU that might take more space...

Rough math: Assuming 50% scaling. Vega 55/64 is a 484 mm^2 die for reference. Zeppelin (8 core Zen) is 213 mm^2. 242 + 107 = 349 mm^2. Keep in mind the repeated structures of a CPU and GPU will not scale the same, but 350 mm^2 seems a good ballpark.
 
I wonder if using 3 HBM2 stacks at 2.4Gbps (~920GB/s) would give the SoC enough system bandwidth to completely forego the L3 cache for inter-core coherency.
That could reduce the die area for each CCX by ~1/3rd.. at the cost of an extra HBM stack, of course.
 
I wonder if using 3 HBM2 stacks at 2.4Gbps (~920GB/s) would give the SoC enough system bandwidth to completely forego the L3 cache for inter-core coherency.
That could reduce the die area for each CCX by ~1/3rd.. at the cost of an extra HBM stack, of course.
I don't know much about this, but I think the cache helps latency to prevent stalls, much more than bandwidth. HBM doesn't have any lower latency nor better data granularity than ddr or gddr. More banks and lower power, that's about it.
 
I don't know much about this, but I think the cache helps latency to prevent stalls, much more than bandwidth. HBM doesn't have any lower latency nor better data granularity than ddr or gddr. More banks and lower power, that's about it.
I would also venture that the packaging cost (TSV) increase would more than offset cost savings in the die size reduction and complicate the thermal solution.

You’re only talking about latency improvements in the hundreds of picoseconds vs. PCB based solution based on TOF. Maybe a cycle or two?
 
My logic had more to do with the bandwidth that would be necessary for the GPU regardless.
If Scorpio needs 320GB/s for a 6 TFLOPs GPU + low end CPU cores, then a hypothetical ~13 TFLOPs GPU + high-end CPU cores could need >640GB/s.
They can't get 640 GB/s with 2 stacks of HBM2, so if they'd need 3 stacks achieving 920 GB/s then I was wondering if they could forego the L3 for the CPU cores.

3x Aquabolt stacks would give the console 24GB at 920GB/s. At a modest cost of 749€ per console, of course...
 
Everything with a standard dram technology (so any external memory) is 40ns to 48ns of tRC, as far as I found.

And while HBM have more banks, the data granularity is 256 bytes (prefetch 2n, but 1024bit?) so I'm curious if that could cause problems with CPU workloads. Or if they could use the channels independently to improve this. I think nvidia said at some point the ideal granularity is the size of cache lines?
 
My logic had more to do with the bandwidth that would be necessary for the GPU regardless.
If Scorpio needs 320GB/s for a 6 TFLOPs GPU + low end CPU cores, then a hypothetical ~13 TFLOPs GPU + high-end CPU cores could need >640GB/s.
They can't get 640 GB/s with 2 stacks of HBM2, so if they'd need 3 stacks achieving 920 GB/s then I was wondering if they could forego the L3 for the CPU cores.

3x Aquabolt stacks would give the console 24GB at 920GB/s. At a modest cost of 749€ per console, of course...
There is always hope the integration cost of HBM will get a breakthrough. There was a lot of buzz about organic interposers costing a fraction of silicon ones. The "low cost" HBM proposal was designed with that in mind.

At that point I want 4 stacks :runaway:
 
My logic had more to do with the bandwidth that would be necessary for the GPU regardless.
If Scorpio needs 320GB/s for a 6 TFLOPs GPU + low end CPU cores, then a hypothetical ~13 TFLOPs GPU + high-end CPU cores could need >640GB/s.
They can't get 640 GB/s with 2 stacks of HBM2, so if they'd need 3 stacks achieving 920 GB/s then I was wondering if they could forego the L3 for the CPU cores.

3x Aquabolt stacks would give the console 24GB at 920GB/s. At a modest cost of 749€ per console, of course...

384-bit with 18Gb/s GDDR6 gets you to 864 GB/s. So I think HBM would have to be cost competitive to force the issue. Personally I’d take the hit to system power to simplify your packaging and its thermal solution, even if you’re stuck with a 384-bit interface with die shrinks.

There is always hope the integration cost of HBM will get a breakthrough. There was a lot of buzz about organic interposers costing a fraction of silicon ones. The "low cost" HBM proposal was designed with that in mind.

At that point I want 4 stacks :runaway:

I’m curious how EMIB compares cost-wise. Not that it’s an option, but still curious.
 
I’m curious how EMIB compares cost-wise. Not that it’s an option, but still curious.

Well it should certainly help with yields initially, and the individual components can be used for other products. But once a process and production matures, the yield advantages and simplified production cost will favor a more integrated solution. AFAIK, functionally an HBM2 interposer could be built-in to to a motherboard. It's not like Kabylake G where you need a certain level of customization and want to keep your IP technically discreet.
 
Well it should certainly help with yields initially, and the individual components can be used for other products. But once a process and production matures, the yield advantages and simplified production cost will favor a more integrated solution. AFAIK, functionally an HBM2 interposer could be built-in to to a motherboard. It's not like Kabylake G where you need a certain level of customization and want to keep your IP technically discreet.

Oh, I’m sure it would be superior eventually due to the level of integration in the solution. I’m sure it simplifies their PCB greatly, and they may even be able to use a lower cost material such as FR4 in absence of a GDDR bus. The fact that only enthusiast and compute focused cards are utilizing HBM thus far doesn’t bode well for the projected initial cost though. MS said they evaluated HBM for Scorpio, after all.

Microsoft kicked the tires of the kinds of HBM modules AMD pioneered in GPUs in 2015 and Nvidia uses now on Volta. “But for a consumer product HBM2 is too expensive and inflexible…its memory bandwidth is not as granular, and we would be locked into [an HBM] module,” Sell said.
 
384-bit with 18Gb/s GDDR6 gets you to 864 GB/s. So I think HBM would have to be cost competitive to force the issue. Personally I’d take the hit to system power to simplify your packaging and its thermal solution, even if you’re stuck with a 384-bit interface with die shrinks.
384bit means a minimum of 12 memory chips.
I like to think console makers will want to avoid anything resembling this, if they can:

7WOF3QX.jpg



In future iterations, HBM could be implemented using TSVs on top of the SoC in a true 3D configuration (when thermals allow it). Compare that to being stuck with 12 memory chips around the SoC.


The fact that only enthusiast and compute focused cards are utilizing HBM thus far doesn’t bode well for the projected initial cost though. MS said they evaluated HBM for Scorpio, after all.
The decision for Scorpio's memory subsystem was probably made around 2015, and they did well to not go with HBM2 otherwise they'd have to be subsidizing HBM2 production lines in an already chaotic environment for RAM production.

It should be a completely different discussion for a console launching in late 2019 or 2020.
 
384bit means a minimum of 12 memory chips.
I like to think console makers will want to avoid anything resembling this, if they can:

7WOF3QX.jpg



In future iterations, HBM could be implemented using TSVs on top of the SoC in a true 3D configuration (when thermals allow it). Compare that to being stuck with 12 memory chips around the SoC.



The decision for Scorpio's memory subsystem was probably made around 2015, and they did well to not go with HBM2 otherwise they'd have to be subsidizing HBM2 production lines in an already chaotic environment for RAM production.

It should be a completely different discussion for a console launching in late 2019 or 2020.
All valid points, but here we are on the precipice of more consumer graphics card launches and they’re looking to GDDR6, not HBM variants. At some point it has to put HBM’s possibility in a console in serious doubt. I would also like to know what they mean by inflexibility and the memory granularity issues.
 
It's rather simple, if HBM is twice the price of gddr6 for the same bandwidth and capacity... It won't happen. If it's closer in price maybe the lower power would compensate cost elsewhere. The goal is to have the best cost/performance.

With high end parts, HBM will be a necessity because nothing else can provide the bandwidth. With laptop chips, it will be a similar argument, power consumption being more important than cost.
 
I would also like to know what they mean by inflexibility and the memory granularity issues.

Perhaps number of modules vs bus width for a given amount of memory? 12 GB of HBM2 would've been insanely expensive and probably required a 3 x 1024 bit bus going by Vega's implementation of 2 x 1024 for two stacks of 4GB each, therefore a more complicated interposer. I'm not completely sure what the max module per stack is currently with HBM2.
 
I would also like to know what they mean by inflexibility and the memory granularity issues.

Probably the fact that only 4-Hi and 8-Hi stacks are made with HBM. With GDDR5 each x32 chip can go from 512MB all the way up to 2GB.
I don't know if it's possible to mix stacks of different heights (and not get a z-height problem, for example), but I'd say not yet.
 
Perhaps number of modules vs bus width for a given amount of memory? 12 GB of HBM2 would've been insanely expensive and probably required a 3 x 1024 bit bus going by Vega's implementation of 2 x 1024 for two stacks of 4GB each, therefore a more complicated interposer. I'm not completely sure what the max module per stack is currently with HBM2.

Probably the fact that only 4-Hi and 8-Hi stacks are made with HBM. With GDDR5 each x32 chip can go from 512MB all the way up to 2GB.
I don't know if it's possible to mix stacks of different heights (and not get a z-height problem, for example), but I'd say not yet.

I had just assumed they’d go the 8GB + 4GB system RAM route a la PS4 Pro in that case, and 12GB was more about bandwidth, but I can appreciate that may not be the case, and going to 16GB would have been exorbitant.
 
I had just assumed they’d go the 8GB + 4GB system RAM route a la PS4 Pro in that case, and 12GB was more about bandwidth.

My assumption. They had the ESRAM in the original model Xbone to contend matching up with. And if you're going to make the motherboard more complicated, you might as well do it to the benefit of the entire system. Even with 4 TFLOPS of compute, I don't see the real point of the 1GB of DDR3 in the PS4Pro, unless checkerboarding requires alot more extra memory in combination with the 4K output. But the base PS4 already has the 256 MB DDR3 of background memory so using larger available modules in place just makes sense and is to the benefit of developers. I'm sure it added but a few cents to the cost of the Pro.
 
Last edited:
My assumption. They had the ESRAM in the original model Xbone to contend matching up with. And if you're going to make the motherboard more complicated, you might as well do it to the benefit of the entire system. Even with 4 TFLOPS of compute, I don't see the real point of the 1GB of DDR3 in the PS4Pro, unless checkerboarding requires alot more extra memory in combination with the 4K output. But the base PS4 already has the 256 MB DDR3 of background memory so using larger available modules in place just makes sense and is to the benefit of developers. I'm sure it added but a few cents to the cost of the Pro.

That all seems fair.

I found a link with numbers on HBM2 cost and power that I trust (David Kanter as source) that shows a 3x cost factor on Vega for AMD, but with a 60% power ratio. The equation is a little different when we’re talking about a graphics card that needs an integrated solution for cooling vs a console that can put that heat on the PCB alternatively, but it looks like it’s down to how aggressively they can push the price down on the die themselves.
 
Status
Not open for further replies.
Back
Top