Predict: The Next Generation Console Tech

almighty · Dec 27, 2012

BRiT said:
3D_world, you're out of your element. I hadn't had a good laugh for the day, so thanks for your laughable post.

Indeed... his nonsense posts brought a smile to my face....

MrFox · Dec 27, 2012

3dilettante said:
HBM seems to be a more straightforward port of DRAM to a stacked interposer format.

The largest apparent difference that I can see is that there isn't a layer of logic at the bottom of the DRAM stack.
HMC provides controller logic and high-speed bidirectional links. A lot of other possibilities for topology, DRAM technology, and simplification of CPU and GPU memory controller logic can come with HMC.
It's possible the on-stack logic can affect some form of repair or defect compensation that HBM may not.

HBM seems to be okay if focusing on a limited subset of capabilities using existing DRAM tech, while HMC seems to offer more applicability, expandability, and future development.

Thanks, I am beginning to understand the advantages of the bottom logic of HMC as a standard part of the memory itself (defect mapping is a pretty good one, they could even do ECC for servers, it can translate the interface to any PHY in the future). Is it plausible that the dram layers of HMC are practically the same thing as HBM, and cost the same to produce? I'm curious if one has a pricing edge over the other depending on the application. I thought HMC would necessarily cost more, but if HMC allows defect mapping, the better yield for large capacity chips would easily pay for the price of the bottom logic layer, and any logic in there would save the cost of having to implement it in the SoC. (memory being so much higher volume than an SoC, any logic there would cost less)

MrFox · Dec 27, 2012

If anyone thought a 2.5D interposer solution would be cost prohibitive... I think it's looking much better today

http://www.electroiq.com/articles/ap/2012/12/lifting-the-veil-on-silicon-interposer-pricing.html

Sesh Ramaswami, managing director at Applied Materials, showed a cost analysis which resulted in 300mm interposer wafer costs of $500-$650 / wafer. His cost analysis showed the major cost contributors are damascene processing (22%), front pad and backside bumping (20%), and TSV creation (14%).
[..]
Since one can produce ~286 200mm2 die on a 300mm wafer, at $575 (his midpoint cost) per wafer, this results in a $2 200mm2 silicon interposer.

3dcgi · Dec 27, 2012

3D_world said:
Actually, what is the nature of the GPU? It is mostly SIMD computation logic. No cache, little branch prediction, very little instruction re-ordering. Thus it's exceptionally bad for anything except 3D Math. For instance Ray Tracing - you've got to go through all the vertices in a scene to see what the bullet intersects with, for each bullet. Hard to do without a cache! The cache greatly accelerates it - but storing it on the chip. Thus the Power7 is ideal, with up 80 MB of eDRAM cache.

GPUs have had read only caches for quite some time and read/write caches are available from AMD and Nvidia.

Squilliam · Dec 27, 2012

3dilettante said:
HBM seems to be a more straightforward port of DRAM to a stacked interposer format.

The largest apparent difference that I can see is that there isn't a layer of logic at the bottom of the DRAM stack.
HMC provides controller logic and high-speed bidirectional links. A lot of other possibilities for topology, DRAM technology, and simplification of CPU and GPU memory controller logic can come with HMC.
It's possible the on-stack logic can affect some form of repair or defect compensation that HBM may not.

Thanks for the information, would an HMC interposer make sense as a dedicated I/O processor chip to make process shrinks of either the separate CPU/GPU or the APU simpler? I'm wondering about a good chip interconnect layout for using HMC and it seems to make sense from my limited perspective.

MrFox said:
If anyone thought a 2.5D interposer solution would be cost prohibitive... I think it's looking much better today
http://www.electroiq.com/articles/ap/2012/12/lifting-the-veil-on-silicon-interposer-pricing.html

Good find. I'm starting to feel quite positive about the possibility that next generation consoles will have good memory architectures.

MrFox · Dec 27, 2012

3D_world said:
For instance Ray Tracing - you've got to go through all the vertices in a scene to see what the bullet intersects with, for each bullet.

Ray tracers are using space partitioning, I tried to find a very old one that wasn't and I failed.
But still, I found this superb ray tracer from 1986:
http://www.classicamiga.com/images/stories/jreviews/software/S/Sculpt3DXL202_001.png
I cried a little when I saw it... I wanted to share.

3dilettante · Dec 27, 2012

MrFox said:
Thanks, I am beginning to understand the advantages of the bottom logic of HMC as a standard part of the memory itself (defect mapping is a pretty good one, they could even do ECC for servers, it can translate the interface to any PHY in the future). Is it plausible that the dram layers of HMC are practically the same thing as HBM, and cost the same to produce?

HBM, which looks like a higher-end variant of a future Wide I/O using interposers and TSVs, doesn't make mention of a repartitioning of the internal DRAM arrays.
HMC subdivides each DRAM layer into independent arrays, rather than each layer being a single DRAM device.
This seems to point at a potential volume disparity, at least initially, since the logic layer is quite different from the regular product, and while the DRAM's physical process and components can be mostly standard--the actual design and arrangement of them won't be the same.

A custom logic layer could be made to use HBM layers, since HMC does abstract away the details, but at least the AMD/Hynix version doesn't look like a decent fit.

However, part of the difficulty here is knowing which version of Wide I/O and which version of HMC is being compared, since the forms each side touts the most as their future standard-bearers aren't out yet.
First-gen Wide I/O is a ratified standard, but it is very, very modest relative to HBM and even more so compared to HMC.

I'm curious if one has a pricing edge over the other depending on the application. I thought HMC would necessarily cost more, but if HMC allows defect mapping, the better yield for large capacity chips would easily pay for the price of the bottom logic layer, and any logic in there would save the cost of having to implement it in the SoC. (memory being so much higher volume than an SoC, any logic there would cost less)

It seems at least for now that HMC is targeting a higher-power and more scalable market than HBM.
The logic layer in HMC allows for a network of cubes, some attached to a CPU and others just to each other. Each cube actually draws a fair amount of power on its own (each cube is basically a DIMM's-worth in capacity) and its specification includes long-distance links as well as short-distance package or interposer links.

In absolute terms I'd expect something like HBM or Wide I/O that targets mobile applications to be lower-power while also being lower-performance.

On the cost front, HBM seems to be more evolutionary, so it might have some cost advantages, although the details are so sparse I don't really know to what degree that might be true.
As you noted, being more resilient to faults in operation or manufacture can save costs, but this runs against possible smaller volumes and the cost of a logic layer on top of that.

HMC can simply do more--although if you have your own on-die memory controller and like to control your DRAM destiny, HMC means ceding control to the DRAM manufacturer.
HMC's design does allow it more flexibility in terms of capacity and read/write traffic.
I'm a little uncertain on how to directly compare the two techs since details are sparse on the final implementations and actual comparisons on workloads.

HBM appears to have the a more standard DRAM interface, just very wide. The 1024 1Gbps IOs in the AMD slide would probably be subdivided into individual channels.
In terms of raw IO, depending on the implementation HBM in terms of GB/s might get somewhere in the ballpark of first-gen HMC.
HMC has a more serialized packet-based link interface, which has its own overheads but can support concurrent read and write traffic. On the other hand, concurrent read/write seems to mean that the marketing slides show bandwidth that is 2x the single-direction peak. The upshot to this is that Micron touts very good handling of random read/write mixes.
HBM doesn't have packet overheads, but it's trying to use an unwieldly bus that can seriously degrade overall utilization. If purely reading, AMD's solution might win, if I'm reading the numbers correctly.

Without knowing more about what burst lengths and how many channels make up that 1024 IO number for HBM, it's not clear how much worse it is than HMC, and on what workloads.
It's possible HBM is a little better than prior graphics memory types because not only is it slower per-pin, but also because it's a regression in terms of DDR capability. Wide I/O gen 1, for example is SDR. It cuts bandwidth, but the burst length isn't long like GDDR5 (GDDR5 being touted as an example of a high-bandwidth tech with very poor utilization).
Traditional graphical loads don't mind rather coarse transactions--see the batch sizes and ROP array widths on GPUs, but this is less dominant as the loads have grown more complex.

Squilliam said:
Thanks for the information, would an HMC interposer make sense as a dedicated I/O processor chip to make process shrinks of either the separate CPU/GPU or the APU simpler? I'm wondering about a good chip interconnect layout for using HMC and it seems to make sense from my limited perspective.

One of the things I am excited about for any kind of 2.5D integration is that the PHYs on GPUs and CPUs can shrink dramatically for the same bandwidth, or stay the same size with much higher bandwidth. AMD's Tahiti has a very large amount of die space dedicated to the GDDR5 interface.

All of this with potentially reduced with HMC, and more reduced if HMC is on-package or on-interposer. HMC moves a lot of the memory controller logic off the CPU and GPU, which could bring further area and power savings.
An interposer in general can also shrink other chip pads as well, since the interposer can be physically larger and can host the pads that tend to constrain shrinking chips with a fixed amount of IO.
Even if the non-memory chips don't shrink, they could have more area and power to play compared to the rather overburdened ultrawide and power-hungry devices on discrete boards.

I'm not certain about adding other IO to the HMC logic layer, since it seems pretty dedicated to talking to its DRAM stack or other similar HMC cubes nearby instead of cards on an expansion bus.
However, one thing I do wonder about is if in the future HMC can allow for interposer memory and negate the hard limits on memory expansion by using the networked nature of the interface to talk to off-package HMC modules. That way, you can have a single processor with a decent memory pool without making it completely useless for servers and workstations that need massive amounts of memory.

Squilliam · Dec 27, 2012

The reason why I was interested in other I/O on the interposer was the possibility that one chip could act as the North/South bridge at the same time so information could move say between the HDD and RAM without first going through the CPU so I figured it could be a lot more efficient than sending the data to the CPU then out to the RAM and then back to the CPU as/when needed.

Another interesting combination could be HMC type technology combined with light based communciation because it would enable an efficient point to point topography between chips as you've already concentrated all memory I/O to the interposer.

Shifty Geezer · Dec 27, 2012

3D_world said:
The GPU - no cache, no ability to analyze for dependencies, no out-of-order instruction execution, thus making it an order of magnitude slower for any code, except extremely simple SIMD (3D math).

Have you never heard of GPGPU? nVidia GPUs that accelerate physics? Or all sorts of other things? I suggest you take a moment to catch up with the technology of the 3rd millennium.

MrFox · Dec 27, 2012

3dilettante said:
Without knowing more about what burst lengths and how many channels make up that 1024 IO number for HBM, it's not clear how much worse it is than HMC, and on what workloads

I don't know what they intend to change with HBM compared to wideIO if anything (I mean other than frequency, DDR pumping, and width). Right now each memory bank has it's own complete 128bit interface, including addressing, control, power/ground, etc... WideIO1 is 4 banks per chip and HBM will be 8 banks. So 2 chips on an AMD GPU would mean 16 banks that can be accessed concurrently. I think they said the stacking is simply additional banks that are accessed like a bog standard memory, so vertical layers are only adding capacity. Read/write concurrency is proportional to the number of channels, as they are managed individually by the SoC, I think the Cadence Wide-IO controller is literally one independent controller per channel.

SKYSONY · Dec 27, 2012

N2O said:
Here is AMD china empolyee use some old news to hint next-gen xbox GPU,well,and keyword of next-gen xbox CPU
http://club.tgfcer.com/thread-6586999-1-1.html

Can anyone explain a bit what he says?

DieH@rd · Dec 27, 2012

SKYSONY said:
Can anyone explain a bit what he says?

Sadly, there are not good Chinese>Englesh web translate services. With Japanese translations you can usually get the meaning, but with Chinese... not much.

SKYSONY · Dec 27, 2012

DieH@rd said:
Sadly, there are not good Chinese>Englesh web translate services. With Japanese translations you can usually get the meaning, but with Chinese... not much.

In GAF they are commenting the OP of that thread hints to a 8 Jaguar cores at 1.6 Ghz + 76xx GPU in the APU + discrete GPU of the 8800 series. It seems he claims it is Durango and the 8800 is the new 8870 with almost 4 TFLOPs.

jlippo · Dec 27, 2012

Been wondering about latency on high I/O and interposers in general.
I'm quite sure that this is one area where it should have significant advantage when compared to standard memory setups, sadly didn't find much information on subject.

Nesh · Dec 27, 2012

SKYSONY said:
In GAF they are commenting the OP of that thread hints to a 8 Jaguar cores at 1.6 Ghz + 76xx GPU in the APU + discrete GPU of the 8800 series. It seems he claims it is Durango and the 8800 is the new 8870 with almost 4 TFLOPs.

Dont those specs sound a bit overkill?

N2O · Dec 27, 2012

Well that thread only talking about Durango,he didn't mention about PS4

Why i know that?because i know chinese...wait,i'm chinese

arijoytunir · Dec 27, 2012

N2O said:
Well that thread only talking about Durango,he didn't mention about PS4

Why i know that?because i know chinese...wait,i'm chinese

So post the details of Durango from that thread .

DieH@rd · Dec 27, 2012

SKYSONY said:
In GAF they are commenting the OP of that thread hints to a 8 Jaguar cores at 1.6 Ghz + 76xx GPU in the APU + discrete GPU of the 8800 series. It seems he claims it is Durango and the 8800 is the new 8870 with almost 4 TFLOPs.

Nah, I would be amazed it MS or Sony went near or above 3 TF.

N2O · Dec 27, 2012

Well he did not post the details,and i'm not him

And i also find older thread(November) talk about next-gen xbox GPU
http://club.tgfcer.com/thread-6577631-1-1.html

I try to translate some important reply

from old one
"原来跟明年才会出的的HD8XXX（如果市场部不改名的话）差不多啊，微软那么下血本啊"
He said nextbox will use HD8XXX release in next year and Microsoft throw a lot money for that

"也不算是黑科技，不过就是堆更多的管子和特效罢了，反而是良率方面难管"
Yup,he talked about yield problem because that GPU

"但是这块APU真的很大，3M行的netlist啊"
What is netlist?whatever,he said all in one APU,huge size APU

This is from another AMD employee

"285W"
Someone ask him about next-gen xbox power and 285W was his answer(285W ac adaptor i guess)

"PS4和XBOX F实际上都很HI,呵呵"
He said PS4 and XBOX F(F?) both are HI(i think he mean high)

"我知道你是那里的了,辛苦了，兄台
但爆料,没啥好处,对吧,微软这次搞的蛮多的,它家还有一个机器或许你也不知道,嘿嘿"
He said i know where you(OP from that thread) are,but be carefu
And he also said MS have another machine maybe OP don't know

Back from new one...
http://club.tgfcer.com/thread-6586999-1-1.html
Actually the new one he just once again said nextbox will use HD8xxx GPU,and use HD8850/8870 news to hint about nextbox GPU

arijoytunir · Dec 27, 2012

Thanks N2O !
- big APU
- hd 8000
- 285 watt

The other system may be the Xbox lite set top box . And like this gen ps4 and next Xbox will be similar in power.

Predict: The Next Generation Console Tech

almighty

MrFox

Deludedly Fantastic

MrFox

Deludedly Fantastic

3dcgi

Squilliam

Beyond3d isn't defined yet

MrFox

Deludedly Fantastic

3dilettante

Squilliam

Beyond3d isn't defined yet

Shifty Geezer

uber-Troll!

MrFox

Deludedly Fantastic

SKYSONY

DieH@rd

SKYSONY

jlippo

Nesh

Double Agent

N2O

arijoytunir

DieH@rd

N2O

arijoytunir

Similar threads