AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Anarchist4000 · Feb 22, 2017

lanek said:
Something like there TOP-PIM research maybe ?

I was thinking more along the lines of removing the L2 from the core and stacking that on top of logic. L2 is roughly half of the current area and could still support PIM. HBM stacks without PIM would still exist with that model.

Spitballing here, but reduce core voltage 30% (1V to 0.7V) roughly cutting power and base area in half by stacking the cache on top of the die. Might be able to stack more than one cache die on top for added capacity without thermal issues. Clocks would be lower, but the design would have effectively doubled the compute and doubled, possibly quadrupled, cache with a ton of chiplets. Then add temporal VLIW and cascaded SIMDs to supplement the op cache design. Haven't seen papers on that since 2009 though.

3dilettante · Feb 22, 2017

Jawed said:
"Tahiti" chiplets:

http://www.computermachines.org/joe/publications/pdfs/hpca2017_exascale_apu.pdf

Memory stacked upon GPU blocks (as processors, not merely ROPs) is looking like a focus of AMD's research.

It seems like this is the full embrace of AMD's dis-integration strategy, where it is insufficient to just be architecturally specialized and fundamentally incompatible physical specialization ends AMD's troubled era of commingling CPU and GPU silicon. It then goes further by apparently divorcing digital logic silicon with its analog and mixed-signal components, meaning the digital silicon cannot exist without an interposer, and an active one in this case.

It's not that having the silicon on one die didn't have benefits, although it seems the physical tradeoffs were more difficult than anticipated. It was a money make for a good number of nodes, although for AMD not quite as much.

The stock graphics for the CPU and GPU chiplets appears to be a Jaguar module from a marketing shot, and the GPU chiplet appears to come from the GPU portion of a Carrizo marketing picture.
The GPU chiplets are supposed to be made this way so they can continue to be leveraged in commercial products, although something as specialized as NVT with partial asynchronous design and a peak of 1 GHz appears to walk back the clocks of the current FinFET GPUs.
Some of these features also may put a ceiling on how complex GCN can get, given NVT needing to put a lid on logic depth and the complexities of asynchronous logic being applied to more complex units.

Anarchist4000 said:
I was thinking more along the lines of removing the L2 from the core and stacking that on top of logic. L2 is roughly half of the current area and could still support PIM. HBM stacks without PIM would still exist with that model.

Which L2 is that much? It's not likely that the placeholder graphics are accurate. Zen doesn't devote that much area to L2, and any probable HPC device that needs high-end serial execution is coming from that line in the next 5 years at least. The demands for that portion may not allow something as sensitive as a low-latency inclusive L2 being shunted off by microbumps.

Spitballing here, but reduce core voltage 30% (1V to 0.7V) roughly cutting power and base area in half by stacking the cache on top of the die. Might be able to stack more than one cache die on top for added capacity without thermal issues.

The GPU chiplets are projected to be using NVT, and a limited amount of asynchronous logic. The voltage is probably going to be lower, and might be paired with something like FDSOI. The CPU section's emphasis may not permit this, and may be close to modern top voltages on whatever high-end feature like FinFET is available at the time.

Anarchist4000 · Feb 22, 2017

3dilettante said:
Which L2 is that much? It's not likely that the placeholder graphics are accurate. Zen doesn't devote that much area to L2, and any probable HPC device that needs high-end serial execution is coming from that line in the next 5 years at least. The demands for that portion may not allow something as sensitive as a low-latency inclusive L2 being shunted off by microbumps.

The primary data cache for the GPU portion. I'm not convinced stacking memory on a CPU is the way to go. With a demand for high clocks it's less practical on a CPU like you mentioned. If you recall from that one Nvidia paper, the micro op cache was reducing MRF accesses by 30-70% if I remember correctly. That might actually address the microbump concern for a GPU. It's also possible the GPUs are a monolithic 3D design. FINFET already started that progression to some degree. With tiny chiplets that may actually be feasible in regards to yields, in which case microbumps aren't used. Latency would be interesting, as it's technically closer to the logic. Not to mention the relatively low clocks the chip is already operating at. The L2 could conceivably be clocked higher than the ALUs and L1. As for area, I don't think a GPU could realistically have too much cache.

If you recall from the exascale paper the CPU and GPU stacks were different chiplets. The GPUs being towers with HBM atop and the CPUs looking normal.

3dilettante said:
The GPU chiplets are projected to be using NVT, and a limited amount of asynchronous logic. The voltage is probably going to be lower, and might be paired with something like FDSOI. The CPU section's emphasis may not permit this, and may be close to modern top voltages on whatever high-end feature like FinFET is available at the time.

Sorry, was being general with those numbers. I wasn't reading anything in to specific processes with that exascale paper. Just that they'd use the most power efficient process at that time. The actual voltage would likely be as low as possible with a push for a maximum amount of logic. Keep the power minimal if piling memory on top of the die.

3dilettante · Feb 22, 2017

Anarchist4000 said:
The primary data cache for the GPU portion.

Is there a reference for the GPU L2 area?
Capacity-wise, it's not that much. The fact that it services so many 64B transactions per clock is probably why it is significantly bigger than its capacity would indicate for larger GPUs, and in those it is still not that big.
However, the large number of signals translates into a large number of IOs, which isn't free. It's 3D-stacked, which has afforded somewhat tighter pitches than what HBM currently has, but still coarse relative to on-die.
If under an HBM stack, it's 8 cache slices at 64B each for read bandwidth just for data transport. If there are simultaneous writes it's higher still, and at a minimum since this is putting the L2 on the other side of the die from DRAM, memory reads into the L2 have to die hop again and may justify having that extra connectivity.

Anarchist4000 · Feb 22, 2017

3dilettante said:
Is there a reference for the GPU L2 area?

http://www.guru3d.com/news-story/amd-ryzen-die-shot-and-caches-shown-at-icc.html
This is for Ryzen, but I haven't seen anything specific for the GPUs. Just that the cache situation is likely worse.

http://www.redgamingtech.com/nintendo-nx-uses-polaris-gpu-vulkan-2x-performance-of-ps4-analysis/
This would be XB1 with the ESRAM added. Remove the Jag cores and half that chip is SRAM. As I recall Polaris increased relative cache size substantially.

3dilettante said:
Capacity-wise, it's not that much. The fact that it services so many 64B transactions per clock is probably why it is significantly bigger than its capacity would indicate for larger GPUs, and in those it is still not that big.
However, the large number of signals translates into a large number of IOs, which isn't free. It's 3D-stacked, which has afforded somewhat tighter pitches than what HBM currently has, but still coarse relative to on-die.
If under an HBM stack, it's 8 cache slices at 64B each for read bandwidth just for data transport. If there are simultaneous writes it's higher still, and at a minimum since this is putting the L2 on the other side of the die from DRAM, memory reads into the L2 have to die hop again and may justify having that extra connectivity.

Exception to that being the designs with ESRAM or possibly PIM. Designs with a large quantity of slow cores I'd expect to have a good deal of cache just to reduce power from IO. They did mention multi-level memory hierarchies. While speculative, the micro op cache was significantly reducing power to L1 in the designs. Growing the L1 to accommodate more, larger waves would make sense. Then maximizing the L2 to avoid going off chip altogether. It might not be unreasonable to view futuristic HBM stacks as L2 either. By the time you're stacking memory directly on the logic die those lines are becoming blurred.

3dilettante · Feb 22, 2017

Anarchist4000 said:
http://www.guru3d.com/news-story/amd-ryzen-die-shot-and-caches-shown-at-icc.html
This is for Ryzen, but I haven't seen anything specific for the GPUs. Just that the cache situation is likely worse.

http://www.redgamingtech.com/nintendo-nx-uses-polaris-gpu-vulkan-2x-performance-of-ps4-analysis/
This would be XB1 with the ESRAM added. Remove the Jag cores and half that chip is SRAM. As I recall Polaris increased relative cache size substantially.

This is hopping around between very dissimilar architectures.
Zen's L2s are not that big. The dominant area is a 16MB L3, which was not what was referenced.
The Xbox One's ESRAM is a 32 MB pool meant to enable the GPU to operate on a DRAM bus with only 68 GB/s of bandwidth in 2013, and its interface is 1024 bits in each direction.

Exception to that being the designs with ESRAM or possibly PIM. Designs with a large quantity of slow cores I'd expect to have a good deal of cache just to reduce power from IO.

GPU caches in general do not match the L1-L3 capacities of server chips.
GCN's L2 caches have a known size, with Fiji and Polaris having 2 MB for the whole chip.
There are pictures of both dies, and the L2s are most likely the rectangles on either side of the center line where the CU L1 caches line up.
That's where GPUs balance their designs, and the point of TOP-PIM is that the GPUs save power by having the DRAM stacked right on them.

On a side note, AMD's point of using GPU and CPU chiplets to reuse the design makes it seem like future non-HPC hardware is going to have an interposer element, otherwise how readily can these be reused?
Also with the separate interposers in the MCM, this really breaks AMD's idea of an APU, as it's not even 2.5D integration at that point unless there's another fan-out or interposer solution below the interposers.

Anarchist4000 · Feb 23, 2017

3dilettante said:
This is hopping around between very dissimilar architectures.
Zen's L2s are not that big. The dominant area is a 16MB L3, which was not what was referenced.
The Xbox One's ESRAM is a 32 MB pool meant to enable the GPU to operate on a DRAM bus with only 68 GB/s of bandwidth in 2013, and its interface is 1024 bits in each direction.

I was referencing them to give some perspective on the area occupied by the caches in recent designs.

In regards to the ESRAM, or some form of scratchpad, that's not necessarily about memory, but energy consumption at an exascale level. Memory stacked on the logic die could conceivably appear as L2/3 depending on function of the processor. L3 not existing in a traditional GPU, but those lines may be blurred with the HBCC in Vega and likely future models. Might be relevant to the CPU as well in an APU. As mentioned in the exascale paper NVM static energy is low, but dynamic rather high. Even traditional DRAM is more expensive than leaving the package. I'd expect the exascale designs to always have larger caches to save energy.

3dilettante said:
GPU caches in general do not match the L1-L3 capacities of server chips.
GCN's L2 caches have a known size, with Fiji and Polaris having 2 MB for the whole chip.
There are pictures of both dies, and the L2s are most likely the rectangles on either side of the center line where the CU L1 caches line up.
That's where GPUs balance their designs, and the point of TOP-PIM is that the GPUs save power by having the DRAM stacked right on them.

The capacities traditionally are designed with planar area in mind. In exascale with 3D designs energy is the larger concern. Going off package is simply bad. The design focus will always be keeping cache close to logic. Going vertical with the cache should yield shorter paths which consume less energy. The stacked DRAM I'd argue could be viewed as a relatively low level cache. The silicon also is clocked very conservatively in exchange for more silicon. Back to my original point, if the logic and cache were roughly equal in area, I'd imagine stacking them makes sense so long as the design can be manufactured. Worth noting even Nvidia implemented the op caches to reduce access energy to L1. That's already pretty close and still significant. Seems increasingly likely Vega did something similar and Zen has the op caches as well.

Gubbi · Feb 23, 2017

Anarchist4000 said:
Memory stacked on the logic die could conceivably appear as L2/3 depending on function of the processor. L3 not existing in a traditional GPU, but those lines may be blurred with the HBCC in Vega and likely future models.

The problem with multi-gigabyte LLC cache in HBM is that the size of the tag-arrays get unwieldy. If you have 4GB HBM and a cacheline size of 64 bytes, you end up 64M cache lines. Each of the tags for these lines needs 42-44 bits for address and a few bits for state (MOESI, etc).

I don't know how you would alleviate this, some ideas:
1. Let the HBM cache act as a memory cache. It would sit between the CPU and DDR4, intercepting memory transfers. Tags and each way of the cache could be looked up in parallel to reduced latency overhead. The downside is in a multi-socket system, the HBM cache only caches memory that is physically attached to the socket.

2. Massively increase cache line length. If you make your HBM cache lines 1024 bytes with 16 sectors instead, you'll have 4M lines, each with 38-40 bits for address, a few bits for MOESI state and 16 sector-valid bits. Instead of 400MB SRAM for tags, you'd need ~30MB. That's a lot in 14nm, but might be feasible in 7/10nm. The massive cacheline length will cost a lot of bandwidth overhead, though.

There's a third option. Increase the HBM capacity to the point where you use it as main memory and treat DDR4 as a really fast block device, swapping entire pages to and from HBM.

Cheers

pTmdfx · Feb 23, 2017

Gubbi said:
There's a third option. Increase the HBM capacity to the point where you use it as main memory and treat DDR4 as a really fast block device, swapping entire pages to and from HBM.

This is apparently what's behind the "HBCC" moniker, managing the HBM "cache" through the virtual memory system, and taking pages from the main memory on demand.

"HW page management support" from the VideoCardZ collection of leaks.

3dilettante · Feb 23, 2017

Anarchist4000 said:
I was referencing them to give some perspective on the area occupied by the caches in recent designs.

The reason why I had questions was because of the claim about GPU L2 area. The area saved by putting the GPU L2 storage is not that major, and for various reasons may not be a significant win if AMD's concept is considered evidence.

In regards to the ESRAM, or some form of scratchpad, that's not necessarily about memory, but energy consumption at an exascale level. Memory stacked on the logic die could conceivably appear as L2/3 depending on function of the processor.

TOP-PIM and the GPU chiplets in the paper already have an HBM or similar stack right on top of them.

The capacities traditionally are designed with planar area in mind. In exascale with 3D designs energy is the larger concern. Going off package is simply bad. The design focus will always be keeping cache close to logic. Going vertical with the cache should yield shorter paths which consume less energy.

The concept in question already stacks the DRAM on top of the logic, so the off-package issue is resolved.
What this does do with the current data flow between DRAM, L2, and on-die clients is add another 2 vertical movements from DRAM to L2, then L2 to clients. For GCN's write-through L1 to L2, everything coherent is going off-die all the time. The cost for going through TSVs and bumps is slightly higher than a similar traversal through standard metal layers, but this is in an incremental amount on top of the extra traversal(s).
Also, the horizontal traversal is still there, given the long line of CU clients and control processors laid out horizontally dwarfs the dimensions of the L2 blocks. Making the L2 big enough to take up horizontal space under the die just makes it very likely that the distance traversed approaches twice what a small on-die L2 would have (and this is again for a write-through L1).
Since it is below the logic, there are very likely density penalties (larger horizontal distance) from all the power and IO drilled through the L2 layer.
The L2 serves as a concentration point, so it by design is not local to everything that uses it. In theory, a more local change would be bigger L1/LDS/register files--but those make the TSV/interface problem even worse than my next concern.

The other concern is the space taken up by doing this, which AMD hasn't indicated will be fully addressed. The interface area for the 1024 IOs and power for an HBM stack looks like it might be 1/5 (more?) of the stack's footprint. An L2 that services that many channels and 64-byte lines would just for read service be 4 times wider, if it's not bidirectional like the ESRAM. The GPU chiplets are likely to lose a fair chunk of area to the DRAM stack's data lines and power/ground equivalent to what each stack layer loses to vertical connectivity, before making the L2 an 4x burden.

Worth noting even Nvidia implemented the op caches to reduce access energy to L1. That's already pretty close and still significant. Seems increasingly likely Vega did something similar and Zen has the op caches as well.

The operand cache is physically small because it stays on-die. The current 2.5 and 3D integration methods use vias and pads that measure in tens of microns with 40-55um pitch in the place of wires that and features measure somewhat over 10 nanometers. It makes sense versus the big wires and pads for off-package that measures in mm.

Gubbi said:
The problem with multi-gigabyte LLC cache in HBM is that the size of the tag-arrays get unwieldy. If you have 4GB HBM and a cacheline size of 64 bytes, you end up 64M cache lines. Each of the tags for these lines needs 42-44 bits for address and a few bits for state (MOESI, etc).

Since HBM is DRAM, the natural alignment is the 1 or 2KB page, at least if there is to be any power efficiency and bandwidth utilization. Atomics, false sharing, or other cache operations would incur more complexity if the 64B granularity is kludged into the DRAM arrays, worsened by the long latencies and other device restrictions of the DRAM.

When moving clusters of this size, perhaps it might start looking like an in-memory physical disk system? It might start doing things like clustering data, compressing tags, or doing some kind of search in a region-based or segmented storage system.

Anarchist4000 · Feb 23, 2017

Gubbi said:
I don't know how you would alleviate this, some ideas:
1. Let the HBM cache act as a memory cache. It would sit between the CPU and DDR4, intercepting memory transfers. Tags and each way of the cache could be looked up in parallel to reduced latency overhead. The downside is in a multi-socket system, the HBM cache only caches memory that is physically attached to the socket.

2. Massively increase cache line length. If you make your HBM cache lines 1024 bytes with 16 sectors instead, you'll have 4M lines, each with 38-40 bits for address, a few bits for MOESI state and 16 sector-valid bits. Instead of 400MB SRAM for tags, you'd need ~30MB. That's a lot in 14nm, but might be feasible in 7/10nm. The massive cacheline length will cost a lot of bandwidth overhead, though.

There's a third option. Increase the HBM capacity to the point where you use it as main memory and treat DDR4 as a really fast block device, swapping entire pages to and from HBM.

3dilettante said:
Since HBM is DRAM, the natural alignment is the 1 or 2KB page, at least if there is to be any power efficiency and bandwidth utilization. Atomics, false sharing, or other cache operations would incur more complexity if the 64B granularity is kludged into the DRAM arrays, worsened by the long latencies and other device restrictions of the DRAM.

When moving clusters of this size, perhaps it might start looking like an in-memory physical disk system? It might start doing things like clustering data, compressing tags, or doing some kind of search in a region-based or segmented storage system.

Here's a thought: 16K pages/lines with a L2 RF. Possibly partitioning that added memory into different caches. It stands to reason PIM isn't required for the entire pool and would add complexity. Or different capabilities would be more or less ideal for different cache/memory types. That would be 256VGPRs or a wave. Along with that you could increase the workgroup size, hopefully locality in regards to chiplet cache, and possibly keep more synchronization local to a chiplet.

3dilettante said:
The area saved by putting the GPU L2 storage is not that major, and for various reasons may not be a significant win if AMD's concept is considered evidence.

For current designs no, but I'm envisioning a significant expansion to the cache sizes and waves in flight to help with locality.

3dilettante said:
The other concern is the space taken up by doing this, which AMD hasn't indicated will be fully addressed. The interface area for the 1024 IOs and power for an HBM stack looks like it might be 1/5 (more?) of the stack's footprint. An L2 that services that many channels and 64-byte lines would just for read service be 4 times wider, if it's not bidirectional like the ESRAM. The GPU chiplets are likely to lose a fair chunk of area to the DRAM stack's data lines and power/ground equivalent to what each stack layer loses to vertical connectivity, before making the L2 an 4x burden.

With PIM techniques or a NOC, it may be possible to narrow that link with a higher transmission rate. If the logic is only ~1GHz, getting a 4GHz(?) transmission rate wouldn't be unreasonable. All the DDR type transmission techniques could apply. That should significantly reduce those requirements. Uncommon for within a die, but the TSV or 3D designs might warrant that. It would cost some energy, but the cache increase and TSV simplification may offset that. At the very least it's still on package. The transmission rate should be possible thanks to the relatively low base clocks.

Another possibility is removing the L2 to another layer might allow for an increase in L1 and less traffic. That might be useful for some compression techniques on top of L2.

3dilettante said:
The operand cache is physically small because it stays on-die. The current 2.5 and 3D integration methods use vias and pads that measure in tens of microns with 40-55um pitch in the place of wires that and features measure somewhat over 10 nanometers. It makes sense versus the big wires and pads for off-package that measures in mm.

Small because it's only a few registers in size. It's also located near the logic to my understanding so linear distance is as low as possible. Avoid the capacitance of a longer wire. Nvidia's implementation was to reduce the energy consumption of going to L1, which is still relatively close and on die. The bigger benefit was effectively emulating more ports to the RF to facilitate scheduling.

3dilettante · Feb 24, 2017

Anarchist4000 said:
Here's a thought: 16K pages/lines with a L2 RF. Possibly partitioning that added memory into different caches. It stands to reason PIM isn't required for the entire pool and would add complexity.

Is this some conceptual architecture unrelated to AMD's proposal?
I'm now uncertain what implementation or concept we're talking about.

For AMD's concepts, PIM or the equivalent GPU chiplet isn't optional for the on-package pool.

For current designs no, but I'm envisioning a significant expansion to the cache sizes and waves in flight to help with locality.

This is envisioning a more fundamental reworking of the architecture than just a bigger cache, since the L2 is a physically distributed point of global visibility. It is by definition not local to its clients on the plane of the chip, and if it's local to one it is not well-shared with the others. The GPU itself is quite insensitive to the latency of the 32GB of HBM sitting on top of it, so the stack no longer seems that distant, especially relative to an L2 that is itself yet another stack. What the stack cannot do well is provide synchronization and bandwidth amplification in a practical manner, so long as the stack is HBM or something like it.

With PIM techniques or a NOC, it may be possible to narrow that link with a higher transmission rate.

The HBM stack uses DDR to get to its current rates, and it loses about 1/5 of its slice footprint and some amount of surrounding logic to the interface for that purpose.

If the logic is only ~1GHz, getting a 4GHz(?) transmission rate wouldn't be unreasonable.

What is being accomplished by placing an HBM stack on top of the GPU if we're turning around and putting a GDDR4/5 interface in for the L2? It's at a minimum 4x wider, but might be closer to 8, and HBM2 is already 2Gbps/pin.
If that were worthwhile, they'd put a 2K pin GDDR5 stack on top of the chip and be done with it.

Another possibility is removing the L2 to another layer might allow for an increase in L1 and less traffic.

Is this related to what AMD is proposing with a GCN-like GPU? There's barely any area gained by taking the L2 array cost out of them.

Small because it's only a few registers in size.

My point is that it is only doable if kept 2D, and cannot be contemplated if dealing with TSVs and microbumps. A high-performance 14nm 6T SRAM cell is documented by GF as being .08um2.
A single ALU's operand reuse cache for just one of its operands is 8 bytes (64 6T bits), or about 5um2. An example TSV has a diameter of 10 um, or is 5 times the area of single instance of one operand reuse lane. They're operating at scales multiple orders of magnitude apart.

Anarchist4000 · Feb 24, 2017

3dilettante said:
Is this some conceptual architecture unrelated to AMD's proposal?
I'm now uncertain what implementation or concept we're talking about.

For AMD's concepts, PIM or the equivalent GPU chiplet isn't optional for the on-package pool.

Expanding on the proposal with the actual implementation. The paper outlining techniques they envision using to reduce power consumption.

3dilettante said:
This is envisioning a more fundamental reworking of the architecture than just a bigger cache, since the L2 is a physically distributed point of global visibility. It is by definition not local to its clients on the plane of the chip, and if it's local to one it is not well-shared with the others. The GPU itself is quite insensitive to the latency of the 32GB of HBM sitting on top of it, so the stack no longer seems that distant, especially relative to an L2 that is itself yet another stack. What the stack cannot do well is provide synchronization and bandwidth amplification in a practical manner, so long as the stack is HBM or something like it.

3dilettante said:
Is this related to what AMD is proposing with a GCN-like GPU? There's barely any area gained by taking the L2 array cost out of them.

hpca2017_exascale_apu.pdf said:
The EHP uses eight GPU chiplets. Our initial configura-
tion provisions 32 CUs per chiplet. Each chiplet is projected
to provide two teraflops of double-precision computation, for
a total of 16 teraflops. Based on the projected system size
of 100,000 nodes, this would provide a total of 1.6 exaflops
(we over-provision because real applications do not achieve
100% utilization).

I'm expanding on the ideas they laid out in the paper. My thinking being that each chiplet behaves as a traditional chip. Conceptually the design they proposed was 8 GPUs working together on an interposer. The paper didn't really go into GCN design changes. Just higher level how to arrange everything on an interposer and power optimization ideas. This still assumes Navi and exascale are related. My assumption was increasing the ability to hide latency (off chiplet) and increase locality (larger workgroups) by expanding the number of active waves without adding execution units. Hence more L1/L2 area and possibly stacking L2 as a means to decrease footprint. L2 also expanding significantly to avoid dram traffic with larger shared caches. Further than what was explored in that paper. A double layer 3D logic die with "future HBM" on top of it. Again, not covered in the paper, but an interesting possibility.

3dilettante said:
The HBM stack uses DDR to get to its current rates, and it loses about 1/5 of its slice footprint and some amount of surrounding logic to the interface for that purpose.

What is being accomplished by placing an HBM stack on top of the GPU if we're turning around and putting a GDDR4/5 interface in for the L2? It's at a minimum 4x wider, but might be closer to 8, and HBM2 is already 2Gbps/pin.
If that were worthwhile, they'd put a 2K pin GDDR5 stack on top of the chip and be done with it.

The HBM3 speculation I've seen was exploring lower voltage signaling and reducing the number of TSVs to facilitate manufacturing. So "stacked GDDR5" might not be an unreasonable analogy. Bit less bandwidth with narrower interface, more capacity, simpler manufacturing.

3dilettante said:
My point is that it is only doable if kept 2D, and cannot be contemplated if dealing with TSVs and microbumps. A high-performance 14nm 6T SRAM cell is documented by GF as being .08um2.
A single ALU's operand reuse cache for just one of its operands is 8 bytes (64 6T bits), or about 5um2. An example TSV has a diameter of 10 um, or is 5 times the area of single instance of one operand reuse lane. They're operating at scales multiple orders of magnitude apart.

I think you misunderstood me on this. Operand reuse cache in my mind is part of the logic. Realistically it's just a couple of latches per ALU and not all that large. Op cache with the ALU, L1 off to the side a bit, significantly expanded L2 possibly on another chip or layer of a 3D monolithic die (no TSVs) and "future" HBM spec on top of that. I'm not disagreeing with you on the TSV and microbump area. Just that L2 could grow to be as large as the logic+L1+NOC interface, etc as a means to reduce power. On par with taking a LDS unit and making a layer out of it.

3dilettante · Feb 24, 2017

Anarchist4000 said:
Hence more L1/L2 area and possibly stacking L2 as a means to decrease footprint.

The concern that stacking with TSVs and micro bumps is that they are several orders of magnitude larger than what currently connects the L2 to the rest of the GPU. The DRAM stacks lose a very significant amount of area to the vertical interconnect, a DDR interface that is significantly narrower and lower in bandwidth
What is being changed to remedy this? Is it not using TSVs and bumps 100-1000 times larger than a bit line in a 2D arrangement reduce footprint?
There seems to be an unspoken reinvention of far more than L2 footprint, which is currently not a major contributor to area or the lateral distance data must travel in the GPU. If you wanted to minimize that, it would be stacking resources with actual locality to the CUs, which the L2 by definition is not. However, the problem with increasing local CU resources is that they have as much or higher bit widths, and 32x as many of them.

The HBM3 speculation I've seen was exploring lower voltage signaling and reducing the number of TSVs to facilitate manufacturing.

It's not in the same order of magnitude of what is necessary, and sacrifices power efficiency to do it.

I'm not disagreeing with you on the TSV and microbump area. Just that L2 could grow to be as large as the logic+L1+NOC interface, etc as a means to reduce power.

The L2 grows as large as the whole chip minus its routing logic+silicon rendered useless by pillars to the GPU above+TSVs through them for the power+ground+signalling that need to get out of the stack. Drilling through the die requires very large pillars, and they take up additional area thanks to the process wrecking the silicon around them and requiring transistors specifically for interfacing with them.
As a result, the GPU above has apparently negative area for making use of the L2.

Unless there is a change, the L2 growing to the full size of the chip would produce a much longer lateral distance, since the L2's dimensions are a minority contributor to maximum distance traveled in the current GPUs, since they don't contribute much to the footprint at all.

Anarchist4000 · Feb 24, 2017

3dilettante said:
The concern that stacking with TSVs and micro bumps is that they are several orders of magnitude larger than what currently connects the L2 to the rest of the GPU. The DRAM stacks lose a very significant amount of area to the vertical interconnect, a DDR interface that is significantly narrower and lower in bandwidth

http://spectrum.ieee.org/semiconductors/design/the-rise-of-the-monolithic-3d-chip

This would be a hybrid approach as TSVs would still be required for the HBM or stacked DRAM. The 3D logic+cache should lower the footprint allowing for more vias. The cache would still be useful to avoid hitting the DRAM in the first place. The L2, or possibly L3, expansion would facilitate a tiled approach or possibly read/write through cache. Not sure it applies to Navi at all, but for an exascale design it might fall within that scope.

3dilettante · Feb 25, 2017

Anarchist4000 said:
http://spectrum.ieee.org/semiconductors/design/the-rise-of-the-monolithic-3d-chip

This would be a hybrid approach as TSVs would still be required for the HBM or stacked DRAM. The 3D logic+cache should lower the footprint allowing for more vias. The cache would still be useful to avoid hitting the DRAM in the first place. The L2, or possibly L3, expansion would facilitate a tiled approach or possibly read/write through cache. Not sure it applies to Navi at all, but for an exascale design it might fall within that scope.

This would at least bring the area cost down to something that might be feasible, and would make the suggestion that the L2 use a GDDR interface unnecessary.
The density might still be lower than same-plane integration, at least some of the marketing might have the connections being "only" 100 times smaller than something up to 1000 times too large. Whether that would be sufficient for an L2 or a new tier is unclear.

Complexity and congestion may be concerns as well, which might make it a riskier proposition to put something this complex on so early.
http://semimd.com/blog/2016/04/27/letis-coolcube-3d-transistor-stacking-improves-with-qualcomm-help/

The HPC concept assumes at least 2 generations of HBM past this one, and Navi is at most one more generation in memory type.
The 3D integration method in that link seems too far out for Navi, with the latest version of the technology being researching not planning on transfer to manufacturing partner for two years at least. Navi's probable time frame seems like it would already be in manufacturing, and unlikely to be able to adopt something just barely being transferred to a foundry.
The inter-die physical integration in AMD's proposal is rather aggressive already, and may not be projecting that this would be included in that version. Elements like the apparent use of near-threshold computing need to minimize the impact of variation, which is helped by the die manufacturing being more mature. There is some level of degradation in transistor quality in the additional layer as well, and regardless of which layer the CU or memory is placed on there are some downsides based on the impact of TSVs, the complexity of the interlayer wiring, and reduced performance.
In some ways, it would be nicer if a layer of cache took the TSV hit, although in this stack configuration that would give the CUs inferior silicon.

Regardless of this type of 3D integration, one unfortunate element to having the GPU below the DRAM is thermal. If it were the other way around, cooling would be more straightforward, but unfortunately would put too many demands on the DRAM below in terms of how much would have to pass through it and out of its base.

Jawed · Feb 26, 2017

We set the per-node power budget to 160W to leave enough power for cooling, inter-node network, etc. so that the total system-wide power would not exceed 20MW.

Naively, assuming that the CPUs consume 0 power, that means 8 GPU chiplets share a budget of 160W, 20W each. I think it's worth remembering, therefore, that these proposed chiplets are extremely low power compared with current GPUs.

Alessio1989 · Feb 27, 2017

I want a coffee machine feature, or at least an embedded a DisplayPort Bialetti moka pot in those GPUs, because I pretend to have a totally new rasterizer with Vega.

3dilettante · Feb 27, 2017

Jawed said:
Naively, assuming that the CPUs consume 0 power, that means 8 GPU chiplets share a budget of 160W, 20W each. I think it's worth remembering, therefore, that these proposed chiplets are extremely low power compared with current GPUs.

The TOP-PIM paper capped the GPU layer of a stack at 10W for thermal reasons. For the ENA, that leaves 80W for the GPU chiplets themselves. A current HBM stack could draw several watts, and may be that much in the future.
Then there's the active interposers and 8 channels of off-package memory and those modules' draw.

Rootax · Jul 15, 2017

Quick question about Navi :

Navi will be the first gpu "designed" by Raja Koduri since he's back with AMD, no ? I mean, he came back mid 2013, I guess work on Vega architecture was already started ? Or I overestimate time to make a gpu ?